Michael Rabbat

Vasu Sharma

Gabriel Synnaeve … (see 8 more)

Hu Xu 0001

Hu Xu

Huijiao Xu

Herve Jegou

Julien Mairal

Patrick Labatut

Armand Joulin

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (see more)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

2024-01-11

TMLR (accepted)

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Hao-Jun Michael Shi

Tsung-Hsien Lee

Shintaro Iwasaki

Jose Gallego-Posada

Zhijing Li

Kaushik Rangadurai

Dheevatsa Mudigere

2023-09-12

ArXiv (preprint)

Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design

Chuan Guo

Kamalika Chaudhuri

Pierre Stock

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

proceedings.mlr.press

Green Federated Learning

Ashkan Yousefpour

Shen Guo

Ashish Shenoy

Sayan Ghosh

Pierre Stock

Kiwan Maeng

Schalk-Willem Kruger

Carole-Jean Wu

Ilya Mironov

2023-06-19

ICML.cc/2023/Workshop/FL (published)

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran

Quentin Duval

Ishan Misra

Piotr Bojanowski

Pascal Vincent

Yann LeCun

Nicolas Ballas

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. W… (see more)e introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (published)

Benchmarking Neural Network Training Algorithms

George E. Dahl

Frank Schneider

Zachary Nado

Naman Agarwal

Chandramouli Shama Sastry

Philipp Hennig

Sourabh Medapati

Runa Eschenhagen

Priya Kasimbeg

Daniel Suo

Juhan Bae

Justin M. Gilmer

A. L. Peirson

Bilal Muhammad Khan

Rohan Anil

Shankar Krishnan

Daniel Snider

Ehsan Amid

Kongtao Chen … (see 5 more)

Chris J. Maddison

R. Vasudev

Michal Badura

Ankush Garg

Peter Mattson

2023-06-12

ArXiv (preprint)

Green Federated Learning

Ashkan Yousefpour

Sheng Guo

Ashish V. Shenoy

Sayan Ghosh

Pierre Stock

Kiwan Maeng

Schalk-Willem Kruger

Carole-Jean Wu

Ilya Mironov

The rapid progress of AI is fueled by increasingly large and computationally intensive machine learning models and datasets. As a consequenc… (see more)e, the amount of compute used in training state-of-the-art models is exponentially increasing (doubling every 10 months between 2015 and 2022), resulting in a large carbon footprint. Federated Learning (FL) - a collaborative machine learning technique for training a centralized model using data of decentralized entities - can also be resource-intensive and have a significant carbon footprint, particularly when deployed at scale. Unlike centralized AI that can reliably tap into renewables at strategically placed data centers, cross-device FL may leverage as many as hundreds of millions of globally distributed end-user devices with diverse energy sources. Green AI is a novel and important research area where carbon footprint is regarded as an evaluation criterion for AI, alongside accuracy, convergence speed, and other metrics. In this paper, we propose the concept of Green FL, which involves optimizing FL parameters and making design choices to minimize carbon emissions consistent with competitive performance and training time. The contributions of this work are two-fold. First, we adopt a data-driven approach to quantify the carbon emissions of FL by directly measuring real-world at-scale FL tasks running on millions of phones. Second, we present challenges, guidelines, and lessons learned from studying the trade-off between energy efficiency, performance, and time-to-train in a production FL system. Our findings offer valuable insights into how FL can reduce its carbon footprint, and they provide a foundation for future research in the area of Green AI.

2023-03-26

ArXiv (preprint)

The Hidden Uniform Cluster Prior in Self-Supervised Learning

Mahmoud Assran

Randall Balestriero

Quentin Duval

Florian Bordes

Ishan Misra

Piotr Bojanowski

Pascal Vincent

Nicolas Ballas

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g.,… (see more) SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

2023-02-01

ICLR.cc/2023/Conference (poster)

Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning

John Nguyen

Jianyu Wang

Kshitiz Malik

Maziar Sanjabi

AI Meta

2023-02-01

ICLR.cc/2023/Conference (notable)

lo-fi: distributed fine-tuning without communication

Mitchell Wortsman

Suchin Gururangan

Shen Li

Ali Farhadi

Ludwig Schmidt

Ari S. Morcos

When fine-tuning large neural networks, it is common to use multiple nodes and to communicate gradients at each optimization step. By contra… (see more)st, we investigate completely local fine-tuning, which we refer to as lo-fi. During lo-fi, each node fine-tunes independently without any communication. Then, the weights are averaged across nodes at the conclusion of fine-tuning. When fine-tuning DeiT-base and DeiT-large on ImageNet, this procedure matches accuracy in-distribution and improves accuracy under distribution shift compared to the baseline, which observes the same amount of data but communicates gradients at each step. We also observe that lo-fi matches the baseline's performance when fine-tuning OPT language models (up to 1.3B parameters) on Common Crawl. By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables fine-tuning in settings with prohibitive communication cost.

2023-01-20

TMLR (accepted)

Contrastive Positive Unlabeled Learning

Anish Acharya

Sujay Sanghavi

Li Jing

Bhargav Bhushanam

I. Dhillon

Self-supervised pretraining on unlabeled data followed by supervised fine-tuning on labeled data is a popular paradigm for learning from lim… (see more)ited labeled examples. We extend this paradigm to the classical positive unlabeled (PU) setting, where the task is to learn a binary classifier given only a few labeled positive samples, and (often) a large amount of unlabeled samples (which could be positive or negative). We first propose a simple extension of standard infoNCE family of contrastive losses, to the PU setting; and show that this learns superior representations, as compared to existing unsupervised and supervised approaches. We then develop a simple methodology to pseudo-label the unlabeled samples using a new PU-specific clustering scheme; these pseudo-labels can then be used to train the final (positive vs. negative) classifier. Our method handily outperforms state-of-the-art PU methods over several standard PU benchmark datasets, while not requiring a-priori knowledge of any class prior (which is a common assumption in other PU methods). We also provide a simple theoretical analysis that motivates our methods.

Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design

Chuan Guo

Kamalika Chaudhuri

Pierre Stock

In private federated learning (FL), a server aggregates differentially private updates from a large number of clients in order to train a ma… (see more)chine learning model. The main challenge in this setting is balancing privacy with both classification accuracy of the learnt model as well as the number of bits communicated between the clients and server. Prior work has achieved a good trade-off by designing a privacy-aware compression mechanism, called the minimum variance unbiased (MVU) mechanism, that numerically solves an optimization problem to determine the parameters of the mechanism. This paper builds upon it by introducing a new interpolation procedure in the numerical design process that allows for a far more efficient privacy analysis. The result is the new Interpolated MVU mechanism that is more scalable, has a better privacy-utility trade-off, and provides SOTA results on communication-efficient private FL on a variety of datasets.

2023-01-01

ICML (published)