Portrait de Michael Rabbat n'est pas disponible

Michael Rabbat

Membre industriel associé
Professeur associé, McGill University, Département de génie électrique et informatique
Chercheur scientifique, Facebook AI Research
Sujets de recherche
Apprentissage de représentations
Optimisation
Systèmes distribués

Biographie

Mike Rabbat est membre affilié de Mila – Institut québécois d’intelligence artificielle et directeur de la recherche scientifique au sein de l'équipe FAIR (Fundamental AI Research) de Meta. Ses recherches portent sur l'apprentissage efficace et robuste des représentations, en particulier l'apprentissage autosupervisé. Il s'intéresse également à l'optimisation pour un apprentissage efficace des modèles.

Publications

EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall
Reyhane Askari
Mark Ibrahim
Candace Ross
Pietro Astolfi
Tariq Berrada
Marton Havasi
Yohann Benchetrit
Karen Ullrich
Carolina Braga
Abhishek Charnalia
Maeve Ryan
Jakob Verbeek
As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. Ho… (voir plus)wever, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
DiJia Su
Sainbayar Sukhbaatar
Yuandong Tian
Qinqing Zheng
In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative S… (voir plus)ystem 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present Dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. Dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (fast mode) or both the reasoning chain and the final solution (slow mode), or automatically decide which mode to engage (auto mode). In all cases, Dualformer outperforms the corresponding baseline models in both performance and computational efficiency: (1) in slow mode, Dualformer optimally solves unseen 30 x 30 maze navigation tasks 97.6% of the time, surpassing the Searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.5% fewer reasoning steps; (2) in fast mode, Dualformer completes those tasks with an 80% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.
The Factorization Curse: Which Tokens You Predict Underlie the Reversal Curse and More
Ouail Kitouni
Niklas Nolte
Adina Williams
Diane Bouchacourt
Mark Ibrahim
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Sainbayar Sukhbaatar
DiJia Su
Paul McVay
Qinqing Zheng
Yuandong Tian
While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symboli… (voir plus)c planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the _search dynamics_ of the
DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning
Maziar Sanjabi
Pietro Astolfi
Kamalika Chaudhuri
Chuan Guo
DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning
Maziar Sanjabi
Pietro Astolfi
Kamalika Chaudhuri
Chuan Guo
Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images… (voir plus) that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Sainbayar Sukhbaatar
Paul McVay
Yuandong Tian
While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbo… (voir plus)lic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the search dynamics of the
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Sainbayar Sukhbaatar
Paul McVay
Yuandong Tian
While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symboli… (voir plus)c planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Sainbayar Sukhbaatar
Paul McVay
Yuandong Tian
While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbo… (voir plus)lic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the search dynamics of the
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection o… (voir plus)f vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes
Quentin Garrido
Jean Ponce
Xinlei Chen
Mahmoud Assran
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection o… (voir plus)f vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.