Adel Nabli

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Pierre Erbacher

Louis Serrano

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, s… (voir plus)ynchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Masih Aminbeidokhti

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (voir plus)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

2024-10-09

NeurIPS.cc/2024/Workshop/OPT (publié)

doi.org

openreview.net

A²CiD²: Accelerating Asynchronous Communication in Decentralized Deep Learning

Adel Nabli

Eugene Belilovsky

Edouard Oyallon

Distributed training of Deep Learning models has been critical to many recent successes in the field. Current standard methods primarily rel… (voir plus)y on synchronous centralized algorithms which induce major communication bottlenecks and synchronization locks at scale. Decentralized asynchronous algorithms are emerging as a potential alternative but their practical applicability still lags. In order to mitigate the increase in communication cost that naturally comes with scaling the number of workers, we introduce a principled asynchronous, randomized, gossip-based optimization algorithm which works thanks to a continuous local momentum named

2022-12-31

NeurIPS (publié)

doi.org

openreview.net

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Adel Nabli

Publications