NVIDIA has released Cosmos Reason 2, an advanced reasoning vision language model for physical AI that surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding. This latest advancement in open, reasoning vision language models is designed to close the gap between current AI capabilities and human-like reasoning, enabling robots and AI agents to see, understand, plan, and act in the physical world like humans.

Background and Context

Cosmos Reason 2 builds on the success of its predecessor, Cosmos Reason, which was introduced to address the limitations of traditional vision-language models. These models have rapidly improved at tasks like object and pattern recognition in images but still struggle with tasks that humans find natural, such as planning several steps ahead, dealing with uncertainty, or adapting to new situations. Cosmos Reason 2 is designed to provide robots and AI agents with stronger common sense and reasoning capabilities to solve complex problems step by step.

The model uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations, making it particularly valuable for robotics, autonomous systems, and any application where understanding physical cause and effect is critical. Cosmos Reason 2 has been post-trained with physical common sense and embodied reasoning data, including supervised fine-tuning and reinforcement learning.

What Makes Cosmos Reason 2 Special

Cosmos Reason 2 excels at navigating the long tail of diverse physical world scenarios with spatial-temporal understanding. The model has improved spatio-temporal understanding and timestamp precision, object detection with 2D/3D point localization and bounding-box coordinates, as well as reasoning explanations and labels. It also supports an expanded set of spatial understanding and visual perception capabilities, including 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.

The model's long-context understanding has been improved to up to 256K input tokens, allowing it to handle complex tasks that require a deeper understanding of the physical world. Cosmos Reason 2 is also adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes, making it easier for developers to integrate into their applications.

Why It Matters to the Industry

Cosmos Reason 2 has significant implications for industries that rely on robotics and autonomous systems, such as logistics, manufacturing, and transportation. The model's ability to reason about complex physical scenarios and adapt to new situations makes it an attractive solution for applications where traditional AI models struggle.

The improved spatio-temporal understanding and timestamp precision of Cosmos Reason 2 also make it a valuable tool for video analytics AI agents, which can extract valuable insights from massive volumes of video data. The model's support for expanded spatial understanding and visual perception capabilities makes it an attractive solution for applications where accurate object detection and tracking are critical.

What Comes Next

NVIDIA has made Cosmos Reason 2 available on its website, along with documentation and sample prompts for generating bounding boxes and robot trajectories. The model will also be available soon on Amazon Web Services, Google Cloud, and Microsoft Azure. Developers can jumpstart development of video analytics AI agents by using the NVIDIA blueprint for video search and summarization (VSS) with Cosmos Reason as the VLM.

Key Facts

  • Cosmos Reason 2 surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding.
  • The model uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.
  • Cosmos Reason 2 has improved spatio-temporal understanding and timestamp precision, object detection with 2D/3D point localization and bounding-box coordinates, as well as reasoning explanations and labels.
  • The model supports an expanded set of spatial understanding and visual perception capabilities, including 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.
  • Cosmos Reason 2 is adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes.