Michael Rabbat

Yuandong Tian

While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symboli… (see more)c planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the _search dynamics_ of the

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

Lucas Lehnert

Sainbayar Sukhbaatar

Paul McVay

Yuandong Tian

While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symboli… (see more)c planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks and present Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard

2024-02-21

ArXiv (preprint)

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes

Quentin Garrido

Jean Ponce

Xinlei Chen

Yann LeCun

Mahmoud Assran

Nicolas Ballas

2024-02-15

ArXiv (preprint)

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab

Timothée Darcet

Théo Moutakanni

Huy V. Vo

Marc Szafraniec

Vasil Khalidov

Pierre Fernandez

Daniel HAZIZA

Francisco Massa

Alaaeldin El-Nouby

Mahmoud Assran

Nicolas Ballas

Wojciech Galuba

Russell Howes

Po-Yao Huang

Shang-Wen Li

Ishan Misra

Vasu Sharma

Gabriel Synnaeve … (see 7 more)

Hu Xu

Huijiao Xu

Herve Jegou

Julien Mairal

Patrick Labatut

Armand Joulin

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (see more)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

2024-01-11

TMLR (accepted)

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Hao-Jun Michael Shi

Tsung-Hsien Lee

Shintaro Iwasaki

Jose Gallego-Posada

Zhijing Li

Kaushik Rangadurai

Dheevatsa Mudigere

2023-09-12

ArXiv (preprint)

Green Federated Learning

Ashkan Yousefpour

Shen Guo

Ashish Shenoy

Sayan Ghosh

Pierre Stock

Kiwan Maeng

Schalk-Willem Kruger

Carole-Jean Wu

Ilya Mironov

2023-06-19

ICML.cc/2023/Workshop/FL (published)

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran

Quentin Duval

Ishan Misra

Piotr Bojanowski

Pascal Vincent

Yann LeCun

Nicolas Ballas

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (see more)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (published)

Benchmarking Neural Network Training Algorithms

George Edward Dahl

Frank Schneider

Zachary Nado

Naman Agarwal

Chandramouli Shama Sastry

Philipp Hennig

Sourabh Medapati

Runa Eschenhagen

Priya Kasimbeg

Daniel Suo

Juhan Bae

Justin M. Gilmer

A. L. Peirson

Bilal Muhammad Khan

Rohan Anil

Shankar Krishnan

Daniel Snider

Ehsan Amid

Kongtao Chen … (see 5 more)

Chris J. Maddison

R. Vasudev

Michal Badura

Ankush Garg

Peter Mattson

2023-06-12

ArXiv (preprint)

Green Federated Learning

Ashkan Yousefpour

Sheng Guo

Ashish V. Shenoy

Sayan Ghosh

Pierre Stock

Kiwan Maeng

Schalk-Willem Kruger

Carole-Jean Wu

Ilya Mironov

The rapid progress of AI is fueled by increasingly large and computationally intensive machine learning models and datasets. As a consequenc… (see more)e, the amount of compute used in training state-of-the-art models is exponentially increasing (doubling every 10 months between 2015 and 2022), resulting in a large carbon footprint. Federated Learning (FL) - a collaborative machine learning technique for training a centralized model using data of decentralized entities - can also be resource-intensive and have a significant carbon footprint, particularly when deployed at scale. Unlike centralized AI that can reliably tap into renewables at strategically placed data centers, cross-device FL may leverage as many as hundreds of millions of globally distributed end-user devices with diverse energy sources. Green AI is a novel and important research area where carbon footprint is regarded as an evaluation criterion for AI, alongside accuracy, convergence speed, and other metrics. In this paper, we propose the concept of Green FL, which involves optimizing FL parameters and making design choices to minimize carbon emissions consistent with competitive performance and training time. The contributions of this work are two-fold. First, we adopt a data-driven approach to quantify the carbon emissions of FL by directly measuring real-world at-scale FL tasks running on millions of phones. Second, we present challenges, guidelines, and lessons learned from studying the trade-off between energy efficiency, performance, and time-to-train in a production FL system. Our findings offer valuable insights into how FL can reduce its carbon footprint, and they provide a foundation for future research in the area of Green AI.

2023-03-26

ArXiv (preprint)

The Hidden Uniform Cluster Prior in Self-Supervised Learning

Mahmoud Assran

Randall Balestriero

Quentin Duval

Florian Bordes

Ishan Misra

Piotr Bojanowski

Pascal Vincent

Nicolas Ballas

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g.,… (see more) SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

2023-02-01

ICLR.cc/2023/Conference (poster)

Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

John Nguyen

Jianyu Wang

Kshitiz Malik

Maziar Sanjabi

2023-02-01

ICLR.cc/2023/Conference (notable)

lo-fi: distributed fine-tuning without communication

Mitchell Wortsman

Suchin Gururangan

Shen Li

Ali Farhadi

Ludwig Schmidt

Ari S. Morcos

When fine-tuning large neural networks, it is common to use multiple nodes and to communicate gradients at each optimization step. By contra… (see more)st, we investigate completely local fine-tuning, which we refer to as lo-fi. During lo-fi, each node fine-tunes independently without any communication. Then, the weights are averaged across nodes at the conclusion of fine-tuning. When fine-tuning DeiT-base and DeiT-large on ImageNet, this procedure matches accuracy in-distribution and improves accuracy under distribution shift compared to the baseline, which observes the same amount of data but communicates gradients at each step. We also observe that lo-fi matches the baseline's performance when fine-tuning OPT language models (up to 1.3B parameters) on Common Crawl. By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables fine-tuning in settings with prohibitive communication cost.

2023-01-20

TMLR (accepted)