Edouard Oyallon

Stabilizing Native Low-Rank LLM Pretraining

Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges.… (voir plus) Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary"full-rank"guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

2026-02-11

ArXiv (prépublication)

arxiv.org

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (voir plus)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

openreview.net

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli

Louis Fournier

Pierre Erbacher

Louis Serrano

Eugene Belilovsky

Edouard Oyallon

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, s… (voir plus)ynchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Model Parallelism With Subnetwork Data Parallelism

Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (voir plus)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

doi.org

openreview.net

PETRA: Parallel End-to-End Training of Reversible Architectures

Stéphane Rivaud

Louis Fournier

Thomas Pumir

Eugene Belilovsky

Mickael Eickenberg

Edouard Oyallon

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep l… (voir plus)earning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.

2024-12-31

ICLR (publié)

doi.org

openreview.net

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

2024-10-09

NeurIPS.cc/2024/Workshop/OPT (publié)

doi.org

openreview.net

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Masih Aminbeidokhti

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (voir plus)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

2024-10-09

NeurIPS.cc/2024/Workshop/OPT (publié)

doi.org

openreview.net

Guiding The Last Layer in Federated Learning with Pre-Trained Models

Lucas Caccia

Federated Learning (FL) is an emerging paradigm that allows a model to be trained across a number of participants without sharing data. Rece… (voir plus)nt works have begun to consider the effects of using pre-trained models as an initialization point for existing FL algorithms; however, these approaches ignore the vast body of efficient transfer learning literature from the centralized learning setting. Here we revisit the problem of FL from a pre-trained model considered in prior work and expand it to a set of computer vision transfer learning problems. We first observe that simply fitting a linear classification head can be efficient and effective in many cases. We then show that in the FL setting, fitting a classifier using the Nearest Class Means (NCM) can be done exactly and orders of magnitude more efficiently than existing proposals, while obtaining strong performance. Finally, we demonstrate that using a two-phase approach of obtaining the classifier and then fine-tuning the model can yield rapid convergence and improved generalization in the federated setting. We demonstrate the potential our method has to reduce communication and compute costs while achieving better model performance.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

Preventing Dimensional Collapse in Contrastive Local Learning with Subsampling

Louis Fournier

Adeetya Patel

Michael Eickenberg

Edouard Oyallon

Eugene Belilovsky

2023-06-15

ICML.cc/2023/Workshop/LLW (publié)

openreview.net

A²CiD²: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Eugene Belilovsky

Edouard Oyallon

Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rel… (voir plus)y on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mitigate the increase in communication cost that naturally comes with scaling the number of workers, we introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named

2022-12-31

NeurIPS (publié)

doi.org

openreview.net

Can Forward Gradient Match Backpropagation?

Louis Fournier

Stéphane Rivaud

Eugene Belilovsky

Michael Eickenberg

Edouard Oyallon

Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable fo… (voir plus)r neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.

2022-12-31

ICML (publié)

doi.org

proceedings.mlr.press

Gradient Masked Averaging for Federated Learning

Irene Tenison

Sai Aravind Sreeramadas

Vaikkunth Mugunthan

Edouard Oyallon

Eugene Belilovsky

Irina Rish

Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a u… (voir plus)nified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.

2022-12-31

Trans. Mach. Learn. Res. (publié)

openreview.net

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Edouard Oyallon

Publications

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Mots-clés populaires:

Edouard Oyallon

Publications