Blogs
Microsoft's Open-Source Large Action Model Brings Spatial Reasoning to LLMs
Microsoft's open-source large action model brings spatial reasoning to LLMs, enabling them to visualize and reason about spatial relationships through visualization of thought prompting.
4

minutes

May 8, 2024
This content was generated using AI and curated by humans

In a groundbreaking development, Microsoft has released an open-source large action model that brings spatial reasoning capabilities to large language models (LLMs). This advancement, similar to how the Rabbit R1 can control applications within the Android environment through natural language, now allows for a completely open-source version for the Windows environment.

Microsoft not only released a research paper outlining how they achieved this feat but also provided an open-source project that can be downloaded and used immediately. The paper, titled "Visualization of Thought Elicits Spatial Reasoning and Large Language Models," describes a method to give LLMs spatial reasoning, a core missing feature that has historically been a challenge for these models.

Understanding Spatial Reasoning

Spatial reasoning refers to the ability to visualize and comprehend the relationships between objects in a 3D or even 2D environment. It is a crucial aspect of human cognition that enables us to imagine unseen objects and actions through a process known as the mind's eye. Humans possess a remarkable ability to create mental images, allowing us to navigate, plan routes, and solve spatial challenges with ease.

However, LLMs have struggled with spatial reasoning due to their reliance on language alone. Yan LeCun, the lead of Meta AI, has previously discussed this limitation as a core missing feature that prevents LLMs from reaching artificial general intelligence (AGI). Microsoft's research paper challenges this notion, demonstrating that it is possible to elicit spatial reasoning from LLMs through a technique called visualization of thought (VoT) prompting.

Visualization of Thought (VoT) Prompting

VoT prompting is a method that augments LLMs with a visual-spatial sketchpad, allowing them to visualize their reasoning steps and inform subsequent steps. Unlike conventional prompting techniques that rely on input and output, VoT prompting introduces an additional step where the LLM is asked to represent the visualization at each step along the way.

The research paper presents three tasks that require spatial awareness in LLMs:

  • Natural Language Navigation: Navigating through a 2D grid world using natural language instructions.
  • Visual Navigation: Navigating a synthetic 2D grid world using visual cues and generating navigation instructions.
  • Visual Tiling: Comprehending, organizing, and reasoning with shapes in a confined area, similar to the classic polyomino tiling challenge.

By designing 2D grid worlds using special characters as enriched input formats, the researchers were able to evaluate the effectiveness of VoT prompting in eliciting spatial reasoning in LLMs.

Impressive Results and Performance Improvements

The results of the study are impressive, with VoT prompting consistently inducing LLMs to visualize the reasoning steps and inform subsequent steps. This approach achieved significant performance improvements across all three tasks compared to other prompting techniques, such as chain of thought prompting without explicit visualization.

The research also highlights the importance of visualizing at each step, as opposed to just one step. The complete tracking rate, where the LLM visualizes at every single step, showed superior performance in route planning, next step prediction, and visual tiling tasks.

Open-Source Implementation: Pi Win Assistant

To demonstrate the practical application of VoT prompting, Microsoft released an open-source project called Pi Win Assistant. Described as the first open-source large action model generalist artificial narrow intelligence, Pi Win Assistant controls completely human user interfaces using only natural language.

Users can interact with a cute virtual assistant character and task it with various actions within the Windows environment. The assistant analyzes the given instructions, generates a series of steps to accomplish the task, and executes them step by step. It visualizes each step along the way, providing a clear understanding of its reasoning process.

Examples of tasks that Pi Win Assistant can perform include:

  • Opening a web browser and navigating to a specific website
  • Searching for and playing a video on YouTube
  • Creating a new post on Twitter with a specific message
  • Generating a joke about engineers while making it an essay

The open-source nature of Pi Win Assistant allows developers and researchers to explore and build upon this technology, opening up new possibilities for natural language interaction with computer interfaces.

Conclusion

Microsoft's research on VoT prompting and the release of the open-source Pi Win Assistant represent a significant milestone in the development of large action models and the integration of spatial reasoning capabilities into LLMs. By enabling LLMs to visualize and reason about spatial relationships, this technology has the potential to revolutionize various domains, including navigation, robotics, and autonomous systems.

As the field of AI continues to evolve, the ability to combine natural language processing with spatial reasoning will be crucial in creating more intelligent and intuitive systems. Microsoft's contribution to this area is commendable, and it will undoubtedly inspire further research and advancements in the field.

Discover More AI Insights
Blogs