Portrait of Pascal Vincent

Pascal Vincent

Core Industry Member
Associate Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Facebook AI Research (FAIR) Montréal

Biography

Pascal Vincent is a research scientist in the Fundamental AI Research (FAIR) team at Meta and an adjunct professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal.

He is also a founding member of Mila – Quebec Artificial Intelligence Institute and an associate fellow in CIFAR’s Learning in Machines & Brains program.

Vincent’s research on principles and algorithms in representation learning led him to uncover several seminal ideas that became key enablers for the successes of deep learning methods. Among his most influential contributions is the seminal paper on neural language models “A Neural Probabilistic Language Model” (Bengio et al. 2013), which laid the foundations on which all artificial neural network based language models are built.

His work on denoising autoencoders (Vincent et al. 2008, 2010) was the first to propose the pretext task of filling in artificially introduced blanks for the sake of learning useful representations in any modality, a precursor of what is today called self-supervised learning.

In another seminal paper, “A Connection Between Score Matching and Denoising Autoencoders” (Vincent 2011), he developed the “denoising score matching” principle, which is now routinely used to train diffusion-based generative models.

Vincent’s current research focuses on novel theory and algorithms for representation learning to enable robust generalization out-of-distribution.

Current Students

Publications

Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Martin Klissarov
Pierluca D'Oro
Shagun Sodhani
Roberta Raileanu
Amy Zhang
Mikael Henaff
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (see more)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
Discovering environments with XRM
Mohammad Pezeshki
Diane Bouchacourt
Mark Ibrahim
Nicolas Ballas
David Lopez-Paz
Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and t… (see more)heir relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization.
Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations
Cian Eastwood
Julius von Kügelgen
Linus Ericsson
Diane Bouchacourt
Mark Ibrahim
Bernhard Schölkopf
Self-supervised representation learning often uses data augmentations to induce some invariance to "style" attributes of the data. However, … (see more)with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We formalize our structured data-augmentation procedure from a causal latent-variable-model perspective, and prove identifiability of both content and (multiple blocks of) style variables. We empirically demonstrate the benefits our approach on synthetic datasets and then present promising but limited results on ImageNet.
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
Florian Bordes
Shashank Shekhar
Mark Ibrahim
Diane Bouchacourt
Ari S. Morcos
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (see more)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.
Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning
Casey Meehan
Florian Bordes
Kamalika Chaudhuri
Chuan Guo
Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural image… (see more)s with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as d\'ej\`a vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that d\'ej\`a vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of d\'ej\`a vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies. Code is available at https://github.com/facebookresearch/DejaVu.
Stochastic positional embeddings improve masked image modeling
Amir Bar
Florian Bordes
Assaf Shocher
Mahmoud Assran
Nicolas Ballas
Trevor Jackson Darrell
Amir Globerson
Yann LeCun
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent… (see more) success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including
On the Identifiability of Quantized Factors
Vitória Barin Pacela
Kartik Ahuja
Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the … (see more)theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.
Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection
Vitória Barin-Pacela
Kartik Ahuja
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Mahmoud Assran
Quentin Duval
Ishan Misra
Piotr Bojanowski
Yann LeCun
Nicolas Ballas
This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (see more)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.
Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning
Florian Bordes
Randall Balestriero
Quentin Garrido
Adrien Bardes
One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method,… (see more) and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.
A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation
Florian Bordes
Samuel Lavoie
Randall Balestriero
Nicolas Ballas
Instance-Conditioned GAN Data Augmentation for Representation Learning
Pietro Astolfi
Arantxa Casanova
Jakob Verbeek
Michal Drozdzal