Michael Rabbat

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Florian Bordes

Quentin Garrido

Justine T Kao

Adina Williams

Emmanuel Dupoux

We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the or… (see more)iginal IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

2025-06-11

ArXiv (preprint)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran

Adrien Bardes

David Fan

Quentin Garrido

Russell Howes

Mojtaba Komeili

Matthew J. Muckley

Ammar Rizvi

Claire Roberts

Koustuv Sinha

Artem Zholus

Sergio Arnaud

Abha Gejji

Ada Martin

Francois Robert Hogan

Daniel Dugas

Piotr Bojanowski

Vasil Khalidov

Patrick Labatut

Francisco Massa … (see 13 more)

Marc Szafraniec

K. Krishnakumar

Yong Li

Xiaodong Ma

Sarath Chandar

Franziska Meier

Yann LeCun

Nicolas Ballas

Fair at Meta

Mila - Québec

AI Institute

Polytechnique Montréal

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

2025-06-11

ArXiv (preprint)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran

Adrien Bardes

David Fan

Quentin Garrido

Russell Howes

Mojtaba Komeili

Matthew J. Muckley

Ammar Rizvi

Claire Roberts

Koustuv Sinha

Artem Zholus

Sergio Arnaud

Abha Gejji

Ada Martin

Francois Robert Hogan

Daniel Dugas

Piotr Bojanowski

Vasil Khalidov

Patrick Labatut

Francisco Massa … (see 13 more)

Marc Szafraniec

K. Krishnakumar

Yong Li

Xiaodong Ma

Sarath Chandar

Franziska Meier

Yann LeCun

Nicolas Ballas

Fair at Meta

Mila - Québec

AI Institute

Polytechnique Montréal

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

2025-06-11

ArXiv (preprint)

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Paul McVay

Sergio Arnaud

Ada Martin

Arjun Majumdar

Krishna Murthy

Phillip Thomas

Ruslan Partsey

Daniel Dugas

Abha Gejji

Alexander Sax

Vincent-Pierre Berges

Mikael Henaff

Ayush Jain

Ang Cao

Ishita Prasad

Mrinal Kalakrishnan

Nicolas Ballas

Mahmoud Assran

Oleksandr Maksymets … (see 2 more)

Aravind Rajeswaran

Franziska Meier

2025-05-01

ICML.cc/2025/Conference (poster)

openreview.net

Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D

Sergio Arnaud

Paul McVay

Ada Martin

Arjun Majumdar

Krishna Murthy

Phillip Thomas

Ruslan Partsey

Daniel Dugas

Abha Gejji

Alexander Sax

Vincent-Pierre Berges

Mikael Henaff

Ayush Jain

Ang Cao

Ishita Prasad

Mrinal Kalakrishnan

Nicolas Ballas

Mido Assran

Oleksandr Maksymets … (see 2 more)

Aravind Rajeswaran

Franziska Meier

2025-04-19

ArXiv (preprint)

Scaling Language-Free Visual Representation Learning

David Fan

Shengbang Tong

Jiachen Zhu

Koustuv Sinha

Zhuang Liu

Xinlei Chen

Nicolas Ballas

Yann LeCun

Amir Bar

Saining Xie

2025-04-01

ArXiv (preprint)

Scaling Language-Free Visual Representation Learning

David Fan

Shengbang Tong

Jiachen Zhu

Koustuv Sinha

Zhuang Liu

Xinlei Chen

Nicolas Ballas

Yann LeCun

Amir Bar

Saining Xie

2025-04-01

arXiv (published)

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido

Nicolas Ballas

Mahmoud Assran

Adrien Bardes

Laurent Najman

Emmanuel Dupoux

Yann LeCun

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (see more)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

2025-02-17

ArXiv (preprint)

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido

Nicolas Ballas

Mahmoud Assran

Adrien Bardes

Laurent Najman

Emmanuel Dupoux

Yann LeCun

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (see more)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

2025-02-17

ArXiv (preprint)

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg

Frank Schneider

Runa Eschenhagen

Juhan Bae

Chandramouli Shama Sastry

Mark Saroufim

BOYUAN FENG

Less Wright

Edward Z. Yang

Zachary Nado

Sourabh Medapati

Philipp Hennig

George E. Dahl

The goal of the AlgoPerf: Training Algorithms competition is to evaluate practical speed-ups in neural network training achieved solely by i… (see more)mproving the underlying training algorithms. In the external tuning ruleset, submissions must provide workload-agnostic hyperparameter search spaces, while in the self-tuning ruleset they must be completely hyperparameter-free. In both rulesets, submissions are compared on time-to-result across multiple deep learning workloads, training on fixed hardware. This paper presents the inaugural AlgoPerf competition's results, which drew 18 diverse submissions from 10 teams. Our investigation reveals several key findings: (1) The winning submission in the external tuning ruleset, using Distributed Shampoo, demonstrates the effectiveness of non-diagonal preconditioning over popular methods like Adam, even when compared on wall-clock runtime. (2) The winning submission in the self-tuning ruleset, based on the Schedule Free AdamW algorithm, demonstrates a new level of effectiveness for completely hyperparameter-free training algorithms. (3) The top-scoring submissions were surprisingly robust to workload changes. We also discuss the engineering challenges encountered in ensuring a fair comparison between different training algorithms. These results highlight both the significant progress so far, and the considerable room for further improvements.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Accelerating neural network training: An analysis of the AlgoPerf competition

Priya Kasimbeg

Frank Schneider

Runa Eschenhagen

Juhan Bae

Chandramouli Shama Sastry

Mark Saroufim

BOYUAN FENG

Less Wright

Edward Z. Yang

Zachary Nado

Sourabh Medapati

Philipp Hennig