Portrait of Yann Lecun is unavailable

Yann Lecun

Alumni

Publications

Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning
Md Rifat Arefin
Nicolas Gontier
Ravid Shwartz-Ziv
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong
David Fan
Jiachen Zhu
Yunyang Xiong
Xinlei Chen
Saining Xie
Zhuang Liu
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that en… (see more)ables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong"prior"vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
Stochastic positional embeddings improve masked image modeling
Amir Bar
Assaf Shocher
Mahmoud Assran
Trevor Darrell
Amir Globerson
Stochastic positional embeddings improve masked image modeling
Amir Bar
Assaf Shocher
Mahmoud Assran
Trevor Darrell
Amir Globerson
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent… (see more) success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection o… (see more)f vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection o… (see more)f vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Blockwise Self-Supervised Learning at Scale
Shoaib Ahmed Siddiqui
Stephane Deny
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Mahmoud Assran
Quentin Duval
Ishan Misra
Piotr Bojanowski
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (see more)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
Catalyzing next-generation Artificial Intelligence through NeuroAI
Anthony Zador
Sean Escola
Bence Ölveczky
Kwabena Boahen
Matthew Botvinick
Dmitri Chklovskii
Anne Churchland
Claudia Clopath
James DiCarlo
Surya
Surya Ganguli
Jeff Hawkins
Konrad Paul Kording
Alexei Koulakov
Timothy P. Lillicrap
Adam
Adam Marblestone … (see 9 more)
Bruno Olshausen
Alexandre Pouget
Cristina Savin
Terrence Sejnowski
Eero Simoncelli
Sara Solla
David Sussillo
Andreas S. Tolias
Doris Tsao
Blockwise self-supervised learning with Barlow Twins
Shoaib Ahmed Siddiqui
Stephane Deny
Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in… (see more) the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. Notably, we show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48\%, only 1.1\% below the accuracy of an end-to-end pretrained network (71.57\% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.
Deep learning, reinforcement learning, and world models
Yu Matsuo
Maneesh Sahani
David Silver
Masashi Sugiyama
Eiji Uchibe
J. Morimoto