Pascal Vincent

Assaf Shocher

Mahmoud Assran

Nicolas Ballas

Trevor Darrell

Amir Globerson

Yann LeCun

Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent… (voir plus) success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including

2024-05-01

ICML.cc/2024/Conference (poster)

On the Identifiability of Quantized Factors

Vitoria Barin Pacela

Kartik Ahuja

Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the… (voir plus) theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.

2024-03-15

Proceedings of the Third Conference on Causal Learning and Reasoning (publié)

proceedings.mlr.press

arxiv.org

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov

Pierluca D'Oro

Shagun Sodhani

Roberta Raileanu

Pierre-Luc Bacon

Amy Zhang

Mikael Henaff

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (voir plus)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

Discovering environments with XRM

Mohammad Pezeshki

Diane Bouchacourt

Mark Ibrahim

Nicolas Ballas

David Lopez-Paz

Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and t… (voir plus)heir relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization.

2023-10-27

NeurIPS.cc/2023/Workshop/DistShift (poster)

Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations

Cian Eastwood

Julius von Kügelgen

Linus Ericsson

Diane Bouchacourt

Mark Ibrahim

Bernhard Schölkopf

Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, … (voir plus)with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits our approach on synthetic datasets and then present promising but limited results on ImageNet.

2023-10-27

NeurIPS.cc/2023/Workshop/CRL (poster)

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Shashank Shekhar

Mark Ibrahim

Diane Bouchacourt

Ari S. Morcos

Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (voir plus)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.

Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning

Casey Meehan

Kamalika Chaudhuri

Chuan Guo

Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural image… (voir plus)s with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as d\'ej\`a vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that d\'ej\`a vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of d\'ej\`a vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.

On the Identifiability of Quantized Factors

Vitoria Barin Pacela

Kartik Ahuja

Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the … (voir plus)theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.

2023-06-28

ArXiv (prépublication)

arxiv.org

Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection

Vitória Barin-Pacela

Kartik Ahuja

2023-06-19

ICML.cc/2023/Workshop/SPIGM (poster)

Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection

Vitória Barin-Pacela

Kartik Ahuja

Disentanglement aims to recover meaningful latent ground-truth factors from only the observed distribution. Identifiability provides the the… (voir plus)oretical grounding for disentanglement to be well-founded. Unfortunately, unsupervised identifiability of independent latent factors is a theoretically proven impossibility in the i.i.d. setting under a general nonlinear smooth map from factors to observations. In this work, we show that, remarkably, it is possible to recover discretized latent coordinates under a highly generic nonlinear smooth mapping (a diffeomorphism) without any additional inductive bias on the mapping. This is, assuming that latent density has axis-aligned discontinuity landmarks, but without making the unrealistic assumption of statistical independence of the factors. We introduce this novel form of identifiability, termed quantized coordinate identifiability , and provide a comprehensive proof of the recovery of discretized coordinates.

2023-06-19

ICML.cc/2023/Workshop/SPIGM (poster)

doi.org

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran

Quentin Duval

Ishan Misra

Piotr Bojanowski

Michael Rabbat

Yann LeCun

Nicolas Ballas

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (voir plus)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

doi.org

arxiv.org

A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation

Samuel Lavoie

Randall Balestriero

Nicolas Ballas