The Allen Institute for AI has released MolmoMotion, a new motion forecasting model that predicts 3D point trajectories under natural-language action instructions. This breakthrough technology has significant implications for industries such as robotics and video generation.

What Happened

MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories in the camera-frame-at-t₀ coordinate frame, given a short RGB observation history, user-specified query points with their initial 3D positions, and a language description of the intended action. The model achieves substantially stronger performance than existing forecasting methods on PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy.

The MolmoMotion framework builds on the Molmo 2 vision-language backbone, which grounds the language instruction to objects and points in the image. The model uses two variants: an autoregressive (AR) variant that emits coordinates as quantized text, and a flow-matching (FM) variant that generates continuous trajectories from noise.

Background and Context

Motion forecasting is central to visual intelligence, enabling agents to anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. However, predicting motion is harder than observing it, as it requires anticipating future events rather than simply explaining past ones.

The Allen Institute for AI has developed MolmoMotion-1M, a large corpus of action-described, object-grounded 3D point trajectory dataset annotated from 1.16M unconstrained videos. This dataset is used to train the MolmoMotion model and evaluate its performance on PointMotionBench.

Why it Matters

MolmoMotion has significant implications for industries such as robotics and video generation, where accurate motion forecasting is crucial for planning and synthesis tasks. The model's ability to predict 3D point trajectories under natural-language action instructions enables more realistic and physically plausible motion in generated videos.

The MolmoMotion framework also provides a general motion representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. This makes it an attractive solution for applications where motion forecasting is critical, such as robotics planning and video generation.

What Comes Next

The MolmoMotion model and dataset are now available for the research community to study, improve, and customize. The Allen Institute for AI encourages developers to try MolmoMotion by downloading the weights, inspecting the training data, and evaluating its performance on PointMotionBench.

Key Facts

  • MolmoMotion is a 4B vision-language model that forecasts 3D point trajectories under natural-language action instructions.
  • The model achieves substantially stronger performance than existing forecasting methods on PointMotionBench, a human-validated benchmark designed to measure object-centric 3D motion forecasting accuracy.
  • MolmoMotion builds on the Molmo 2 vision-language backbone and uses two variants: an autoregressive (AR) variant that emits coordinates as quantized text, and a flow-matching (FM) variant that generates continuous trajectories from noise.
  • The MolmoMotion framework provides a general motion representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks.
  • MolmoMotion has significant implications for industries such as robotics and video generation, where accurate motion forecasting is crucial for planning and synthesis tasks.