Nicolas Ballas

Michal Drozdzal

Adriana Romero

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility… (see more). While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

2026-01-14

ArXiv (preprint)

Learning Latent Action World Models In The Wild

Quentin Garrido

Tushar Nagarajan

Basile Terver

Michael G. Rabbat

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world … (see more)models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

2026-01-07

ArXiv (preprint)

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan

Xiaofeng Zhang

Felix Friedrich

Nicolas Beltran-Velez

Melissa Hall

Reyhane Askari-Hemmat

Xiaochuang Han

Michal Drozdzal

Adriana Romero-Soriano

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility… (see more). While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, on the challenging PhysicsIQ benchmark we achieve 62.00% final score, outperforming previous state of the art by 6.78%. Our work demonstrates the viability of using latent world models to improve physical plausibility of video generation, beyond this specific instantiation or parameterization.

2025-12-31

IEEE/CVF Conference on Computer Vision and Pattern Recognition (Accept (Highlight))

Improving the Physics of Video Generation with VJEPA-2 Reward Signal

Jianhao Yuan

Xiaofeng Zhang

Felix Friedrich

Nicolas Beltran-Velez

Melissa Hall

Reyhane Askari Hemmat

Xiaochuang Han

Michal Drozdzal

Adriana Romero

2025-10-21

ArXiv (preprint)

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Randall Balestriero

Michael G. Rabbat

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine… (see more) two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample

2025-10-06

ArXiv (preprint)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran

Adrien Bardes

David Fan

Quentin Garrido

Russell Howes

Mojtaba Komeili

Matthew J. Muckley

Ammar Rizvi

Claire Roberts

Koustuv Sinha

Artem Zholus

Sergio Arnaud

Abha Gejji

Ada Martin

Francois Hogan

Daniel Dugas

Piotr Bojanowski

Vasil Khalidov

Patrick Labatut

Francisco Massa … (see 13 more)

Marc Szafraniec

K. Krishnakumar

Ying Li

Xiaodong Ma

A. Chandar

Franziska Meier

Michael G. Rabbat

Fair at Meta

Mila - Québec

AI Institute

Polytechnique Montréal

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

2025-06-10

ArXiv (preprint)

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Sergio Arnaud

Paul McVay

Ada Martin

Arjun Majumdar

Krishna Murthy

Phillip Thomas

Ruslan Partsey

Daniel Dugas

Abha Gejji

Alexander Sax

Vincent-Pierre Berges

Mikael Henaff

Ayush Jain

Ang Cao

Ishita Prasad

Mrinal Kalakrishnan

Michael G. Rabbat

Mahmoud Assran

Oleksandr Maksymets … (see 2 more)

Aravind Rajeswaran

Franziska Meier

2025-04-30

ICML.cc/2025/Conference (poster)

proceedings.mlr.press

Scaling Language-Free Visual Representation Learning

David Fan

Shengbang Tong

Jiachen Zhu

Koustuv Sinha

Zhuang Liu

Xinlei Chen

Michael G. Rabbat

Amir Bar

Saining Xie

2025-03-31

ArXiv (preprint)

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido

Mahmoud Assran

Adrien Bardes

Laurent Najman

Michael G. Rabbat

Emmanuel Dupoux

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (see more)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

2025-02-16

ArXiv (preprint)

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes

Quentin Garrido

Jean Ponce

Xinlei Chen

Michael G. Rabbat

Mahmoud Assran

2024-08-08

TMLR (accepted)

openreview.net

Discovering Environments with XRM

Mohammad Pezeshki

Diane Bouchacourt

Mark Ibrahim

P Vincent

David Lopez-Paz

Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are cos… (see more)tly to obtain and often limited by human annotators’ biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods introduce hyper-parameters and early-stopping criteria, which require a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk Minimization (XRM) to address this issue. XRM trains twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Algorithms built on top of XRM environments achieve oracle worst-group-accuracy, addressing a long-standing challenge in OOD generalization.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie

Polina Kirichenko

Mark Ibrahim

Mahmoud Assran

Andrew Gordon Wilson

Aaron Courville

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its … (see more)caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)