Nicolas Ballas

Assaf Shocher

Mahmoud Assran

P Vincent

Trevor Darrell

Amir Globerson

Yann Lecun

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (publié)

proceedings.mlr.press

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab

Timothée Darcet

Théo Moutakanni

Huy V. Vo

Marc Szafraniec

Vasil Khalidov

Pierre Fernandez

Daniel HAZIZA

Francisco Massa

Alaaeldin El-Nouby

Mahmoud Assran

Wojciech Galuba

Russell Howes

Po-Yao Huang

Shang-Wen Li

Ishan Misra

Michael G. Rabbat

Vasu Sharma

Gabriel Synnaeve … (voir 8 de plus)

Hu Xu 0001

Huijiao Xu

Hu Xu

Herve Jegou

Julien Mairal

Patrick Labatut

Armand Joulin

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (voir plus)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

2024-01-10

TMLR (accepté)

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran

Quentin Duval

Ishan Misra

Piotr Bojanowski

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (voir plus)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

2023-06-16

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation

Samuel Lavoie

Randall Balestriero

P Vincent

2023-04-10

ArXiv (prépublication)

ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations

Badr Youbi Idrissi

Diane Bouchacourt

Randall Balestriero

Ivan Evtimov

Caner Hazirbas

P Vincent

Michal Drozdzal

David Lopez-Paz

Mark Ibrahim

Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fa… (voir plus)il to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X—a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model’s (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, color-jitter augmentation improves robustness to color and brightness, but surprisingly hurts robustness to pose. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.

2023-01-31

ICLR.cc/2023/Conference (notable)

The Hidden Uniform Cluster Prior in Self-Supervised Learning

Mahmoud Assran

Randall Balestriero

Quentin Duval

Ishan Misra

Piotr Bojanowski

P Vincent

Michael G. Rabbat

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g.,… (voir plus) SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

2023-01-31

ICLR.cc/2023/Conference (poster)

Cascaded Video Generation for Videos In-the-Wild

Lluis Castrejon

Aaron Courville

Videos can be created by first outlining a global view of the scene and then adding local details. Inspired by this idea we propose a cascad… (voir plus)ed model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, which is then refined by subsequent cascade levels operating at larger resolutions. We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity of our model and makes it scalable to high-resolution videos with many frames. We empirically validate our approach on UCF101 and Kinetics-600, for which our model is competitive with the state-of-the-art. We further demonstrate the scaling capabilities of our model and train a three-level model on the BDD100K dataset which generates 256x256 pixels videos with 48 frames.

2022-08-20

2022 26th International Conference on Pattern Recognition (ICPR) (publié)

VIM: Variational Independent Modules for Video Prediction

Lluis Castrejon

2022-06-27

Proceedings of the First Conference on Causal Learning and Reasoning (publié)

proceedings.mlr.press

Masked Siamese Networks for Label-Efficient Learning

Mahmoud Assran

Mathilde Caron

Ishan Misra

Piotr Bojanowski

P Vincent

Armand Joulin

Michael G. Rabbat

We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the … (voir plus)representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available.

2022-04-13

ArXiv (prépublication)

INFERNO: Inferring Object-Centric 3D Scene Representations without Supervision

Lluis Castrejon

Aaron Courville

We propose INFERNO, a method to infer object-centric representations of visual scenes without annotations. Our method decomposes a scene int… (voir plus)o multiple objects, with each object having a structured representation that disentangles its shape, appearance and pose. Each object representation defines a localized neural radiance field used to generate 2D views of the scene through differentiable rendering. Our model is subsequently trained by minimizing a reconstruction loss between inputs and corresponding rendered scenes. We empirically show that INFERNO discovers objects in a scene without supervision. We also validate the interpretability of the learned representations by manipulating inferred scenes and showing the corresponding effect in the rendered output. Finally, we demonstrate the usefulness of our 3D object representations in a visual reasoning task using the CATER dataset.

2022-03-24

ICLR.cc/2022/Workshop/OSC (poster)

Neural Attentive Circuits

Nasim Rahaman

Martin Weiss

Francesco Locatello

Chris Pal

Yoshua Bengio

Bernhard Schölkopf

Li Erran Li

Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modali… (voir plus)ties. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.

2021-12-31

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (publié)

Hierarchical Video Generation for Complex Data

Lluis Castrejon

Aaron Courville

2021-06-03

ArXiv (prépublication)