What industries could benefit from MolmoMotion?

Industries such as robotics and video generation could benefit from MolmoMotion due to its implications in planning and synthesis tasks.

How does the MolmoMotion framework work?

The MolmoMotion framework builds on the Molmo 2 vision-language backbone, using two variants: an autoregressive (AR) variant and a flow-matching (FM) variant.

Allen Institute AI Unveils MolmoMotion: Revolutionizing Motion Forecasting

Q: What is MolmoMotion and what does it do?

MolmoMotion is a motion forecasting model that predicts 3D point trajectories under natural-language action instructions.

Q: How does MolmoMotion differ from existing forecasting methods?

MolmoMotion achieves substantially stronger performance than existing forecasting methods on PointMotionBench, a human-validated benchmark.

Q: What is MolmoMotion-1M?

MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectory dataset annotated from 1.16M unconstrained videos.

The new MolmoMotion model predicts 3D point trajectories under natural-language action instructions, offering a breakthrough in robotics and video generation industries.

The Allen Institute for AI has released MolmoMotion, a new motion forecasting model that predicts 3D point trajectories under natural-language action instructions. This breakthrough technology has significant implications for industries such as robotics and video generation.

What Happened

MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories in the camera-frame-at-t₀ coordinate frame, given a short RGB observation history, user-specified query points with their initial 3D positions, and a language description of the intended action. The model achieves substantially stronger performance than existing forecasting methods on PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy.

The MolmoMotion framework builds on the Molmo 2 vision-language backbone, which grounds the language instruction to objects and points in the image. The model uses two variants: an autoregressive (AR) variant that emits coordinates as quantized text, and a flow-matching (FM) variant that generates continuous trajectories from noise.

Background and Context

Motion forecasting is central to visual intelligence, enabling agents to anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. However, predicting motion is harder than observing it, as it requires anticipating future events rather than simply explaining past ones.

The Allen Institute for AI has developed MolmoMotion-1M, a large corpus of action-described, object-grounded 3D point trajectory dataset annotated from 1.16M unconstrained videos. This dataset is used to train the MolmoMotion model and evaluate its performance on PointMotionBench.

Why it Matters

MolmoMotion has significant implications for industries such as robotics and video generation, where accurate motion forecasting is crucial for planning and synthesis tasks. The model's ability to predict 3D point trajectories under natural-language action instructions enables more realistic and physically plausible motion in generated videos.

The MolmoMotion framework also provides a general motion representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. This makes it an attractive solution for applications where motion forecasting is critical, such as robotics planning and video generation.

What Comes Next

The MolmoMotion model and dataset are now available for the research community to study, improve, and customize. The Allen Institute for AI encourages developers to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating its performance on PointMotionBench.

Key Facts

MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories under natural-language action instructions.
The model achieves substantially stronger performance than existing forecasting methods on PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy.
MolmoMotion builds on the Molmo 2 vision-language backbone and uses two variants: an autoregressive (AR) variant that emits coordinates as quantized text, and a flow-matching (FM) variant that generates continuous trajectories from noise.
The MolmoMotion framework provides a general motion representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks.
MolmoMotion has significant implications for industries such as robotics and video generation, where accurate motion forecasting is crucial for planning and synthesis tasks.

Allen Institute AI Unveils MolmoMotion: Revolutionizing Motion Forecasting

What Happened

Background and Context

Why it Matters

What Comes Next

Key Facts

Related stories

ServiceNow's SLAM Lab Unveils Apriel-H1: Hybrid Language Models for Efficient Large-Scale Reasoning

NVIDIA Unveils Nemotron 3 Nano 30B A3B: Compact AI Model with Improved Accuracy

OpenAI Unveils Three New Real-Time Voice Models for API

TII Unveils Falcon Perception: A Transformer Model for Vision and Language Integration

NVIDIA Unveils Nemotron 3 Nano Omni: A Unified Multimodal Model for AI Systems

Ahrefs' Content Marketing Director Ryan Law Unveils AI-Driven Content Automation System

Recently published

Linux Kernel Security Flaw: Potential Data Breach Risk for Adult-Industry Platforms

Malaysia Seizes $13M AI Chips in Smuggling Attempt

Hugging Face and VirusTotal Collaborate for Enhanced AI Security

DOJ Intervenes in Lawsuit Over xAI's Unpermitted Gas Turbines for National Security Reasons

Meta and Hugging Face Launch OpenEnv Hub for Scalable Agentic Development

OpenAI's Codex Introduces Automations for Scheduling and Automating Recurring Tasks