Portrait of Yann Lecun is unavailable

Yann Lecun

Alumni

Publications

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
Randall Balestriero
Megi Dervishi
David Fan
Quentin Garrido
Tushar Nagarajan
Wancong Zhang
Amir Bar
We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEP… (see more)As). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at https://github.com/facebookresearch/eb_jepa.
Parallel Stochastic Gradient-Based Planning for World Models
Michael Psenka
Aditi Krishnapriyan
Amir Bar
World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to th… (see more)e vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving long-horizon control tasks from visual input. Our method treats states as optimization variables ("virtual states") with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high-dimensional vision-based world models, we modify the gradient structure to descend towards valid plans while only requiring action-input gradients. Our planner, which we call GRASP (Gradient RelAxed Stochastic Planner), can be viewed as a stochastic version of a non-condensed or collocation-based optimal controller. We provide theoretical justification and experiments on video-based world models, where our resulting planner outperforms existing planning algorithms like the cross-entropy method (CEM) and vanilla gradient-based optimization (GD) on long-horizon experiments, both in success rate and time to convergence.
Learning Latent Action World Models In The Wild
Quentin Garrido
Tushar Nagarajan
Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world … (see more)models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
World Models Can Leverage Human Videos for Dexterous Manipulation
Raktim Gautam Goswami
Amir Bar
David Fan
Tsung-Yen Yang
Gaoyue Zhou
Prashanth Krishnamurthy
Farshad Khorrami
Dexterous manipulation is challenging because it requires understanding how subtle hand motion influences the environment through contact wi… (see more)th objects. We introduce DexWM, a Dexterous Manipulation World Model that predicts the next latent state of the environment conditioned on past states and dexterous actions. To overcome the scarcity of dexterous manipulation datasets, DexWM is trained on over 900 hours of human and non-dexterous robot videos. To enable fine-grained dexterity, we find that predicting visual features alone is insufficient; therefore, we introduce an auxiliary hand consistency loss that enforces accurate hand configurations. DexWM outperforms prior world models conditioned on text, navigation, and full-body actions, achieving more accurate predictions of future states. DexWM also demonstrates strong zero-shot generalization to unseen manipulation skills when deployed on a Franka Panda arm equipped with an Allegro gripper, outperforming Diffusion Policy by over 50% on average in grasping, placing, and reaching tasks.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Yong Li
Xiaodong Ma
Franziska Meier
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Ying Li
Xiaodong Ma
Franziska Meier
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Yong Li
Xiaodong Ma
Franziska Meier
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Ying Li
Xiaodong Ma
Franziska Meier
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Zhuang Liu
Xinlei Chen
Amir Bar
Saining Xie
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Zhuang Liu
Xinlei Chen
Amir Bar
Saining Xie
Intuitive physics understanding emerges from self-supervised pretraining on natural videos
Quentin Garrido
Mahmoud Assran
Adrien Bardes
Laurent Najman
Emmanuel Dupoux
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (see more)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.
Intuitive physics understanding emerges from self-supervised pretraining on natural videos
Quentin Garrido
Mahmoud Assran
Adrien Bardes
Laurent Najman
Emmanuel Dupoux
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regi… (see more)ons in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.