, allowing us to interact with our environment. It facilitates tasks that require understanding and reasoning about spatial relationships between objects and their motion. Spatial reasoning of language models relies heavily on language to reason about spatial information, and human cognitive abilities far exceed linguistic reasoning. Humans can not only create task-relevant abstract representations from visual perception, but also imagine unseen scenes through the mind's eye. This is a research topic known as#Large language models (LLMs) demonstrate impressive performance in language understanding and various reasoning tasks. However, their role in spatial reasoning, a key aspect of human cognition, remains understudied. Humans have the ability to create mental images of unseen objects and actions through a process known as the mind's eye, making it possible to imagine the unseen world. Inspired by this cognitive ability, researchers proposed "Visualization of Thought (VoT)" . VoT aims to guide the spatial reasoning of LLMs by visualizing their reasoning signs, thereby guiding subsequent reasoning steps. The researchers applied VoT to multi-hop spatial reasoning tasks, including natural language navigation, visual navigation, and visual paving in a two-dimensional grid world. Experimental results show that VoT significantly enhances the spatial reasoning capabilities of LLMs. Notably, VoT outperforms existing multi-modal large language models (MLLMs) on these tasks. Introduction In recent years, large language models (LLMs) have achieved remarkable performance on various language-related tasks. Despite their success in mathematical reasoning, commonsense reasoning, and other reasoning tasks such as symbolic or logical reasoning, their capabilities in spatial reasoning remain underexplored.
Spatial reasoning is a fundamental function of human cognition
Figure 1: Humans can enhance their spatial awareness and guide decision-making by creating mental images during spatial reasoning. Likewise, large language models (LLMs) can build internal mental images. The researchers proposed VoT to trigger the "mind's eye" of LLMs by visualizing their thinking at each intermediate step, thereby promoting spatial reasoning. Inspired by this cognitive mechanism, researchers speculate that LLMs have the ability to create and manipulate mental images in the mind's eye for spatial reasoning. As shown in Figure 1, LLMs may potentially process and understand spatial information in various formats. They may be able to visualize internal states and manipulate these mental images through the mind's eye to guide subsequent reasoning steps to enhance spatial reasoning. Therefore, researchers proposed
Visualization of Thought (VoT)prompts to elicit this ability. This method adds a visual-spatial sketchpad to LLMs to visualize their reasoning steps and guide subsequent steps. VoT employs zero demonstration prompts, rather than relying on few demonstrations or using CLIP for text-to-image visualization. This choice stems from the ability of LLMs to obtain a variety of mental images from text-based visual art.
To evaluate the effectiveness of VoT in spatial reasoning, the researchers selected three tasks that require LLMs' spatial awareness, includingnatural language navigation, visual navigation, and visual laying
. These tasks require understanding spatial, directional, and geometric shape reasoning. To simulate human-like multisensory perception, the researchers designed 2D grid worlds that use special characters as rich input formats in LLMs' visual navigation and visual laying tasks. Different models (GPT-4, GPT-4V) and prompting techniques were compared on these three tasks. Research results show thatVoT prompts consistently prompt LLMs to visualize their reasoning steps and guide subsequent steps. Therefore, this method achieves significant performance improvements on the corresponding tasks.
Figure 2: Examples of navigation maps in different settings, with a house emoji representing the starting point and an office emoji representing the destination. Spatial reasoning refers to the ability to understand and reason about the spatial relationships between objects, their movements and interactions. This skill is important for a wide range of real-world applications, such as navigation, robotics, and autonomous driving. These areas require action planning based on visual perception and a detailed understanding of spatial dimensions. Although several tasks and datasets have been developed to explore spatial semantics embedded in text, research efforts have generally focused on how spatial terms are linguistically structured. Recently, significant achievements and impressive results have been achieved on these benchmarks by converting spatial terms into logical forms and employing logical programming. This means that performing well on these tasks does not necessarily mean that large language models (LLMs) truly understand spatial information, nor does it provide an accurate measure of their spatial awareness. Spatial awareness involves understanding spatial relationships, directions, distances, and geometry, which are essential for planning actions in the physical world. To assess LLMs' spatial awareness and spatial reasoning abilities, the researchers selected a number of tasks that test navigation and geometric reasoning skills, including natural language navigation, visual navigation, and visual paving. Natural language navigation involves browsing the underlying spatial structure through a random walk, aiming to identify previously visited locations. The concept was inspired by previous research on human cognition, using an approach similar to a random walk along a graph structure. This process requires an understanding of loop closure, which is critical for spatial navigation. The visual navigation task presents LLMs with a synthetic 2D grid world and challenges them to exploit visual cues Navigate. The model must generate navigation instructions to move in four directions (left, right, up, and down) from a starting point to a destination while avoiding obstacles. This involves two subtasks: route planning and next step prediction, which require multi-hop spatial reasoning, of which the former is more complex. Visual tiling is a classic spatial reasoning challenge. Extending this concept to test LLMs' ability to understand, organize, and reason about shapes within a limited area enhances the assessment of spatial reasoning skills. The task involves a rectangle with unfilled cells and various domino blocks, such as the I-domino block consisting of four aligned squares. The model must choose the appropriate domino block variation, such as choosing the direction of the I-domino block, to solve the question-and-answer puzzle.
Figure 3: Example of visual laying with masked domino blocks. The image does not show the rotated and mirrored variations of the domino blocks. Given the way humans process spatial information in tasks such as navigation, mental images, such as maps, are often created to enhance spaces Awareness or simulated movement to guide decision-making. The research goal is to evoke the spatial awareness of LLMs and enable reasoning based on actual situations by visualizing their intermediate reasoning steps. Researchers introduce a Visualization of Thinking (VoT) prompt: "Visualize the state after each reasoning step." This new spatial reasoning paradigm aims to generate reasoning signs and visualization results in an interleaved manner. Figure 4: Examples of VoT prompts in three tasks that LLM generates inference signs and visualizations to track in a staggered manner state that changes over time. Paper: https://arxiv.org/pdf/2404.03622.pdfSpatial Reasoning
Natural Language Navigation
Visual Navigation
Visual tiling
ThinkingVisual Tips
The above is the detailed content of Stimulate the spatial reasoning ability of large language models: thinking visualization tips. For more information, please follow other related articles on the PHP Chinese website!