Louis Fournier

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Pierre ERBACHER

Louis Serrano

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (publié)

WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

Masih Aminbeidokhti

The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (voir plus)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (publié)

PETRA: Parallel End-to-end Training with Reversible Architectures

Stephane Rivaud

Thomas Pumir

Michael Eickenberg

Reversible architectures have been shown to be capable of performing on par with their non-reversible architectures, being applied in deep l… (voir plus)earning for memory savings and generative modeling. In this work, we show how reversible architectures can solve challenges in parallelizing deep model training. We introduce PETRA, a novel alternative to backpropagation for parallelizing gradient computations. PETRA facilitates effective model parallelism by enabling stages (i.e., a set of layers) to compute independently on different devices, while only needing to communicate activations and gradients between each other. By decoupling the forward and backward passes and keeping a single updated version of the parameters, the need for weight stashing is also removed. We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR-10, ImageNet32, and ImageNet, achieving competitive accuracies comparable to backpropagation using ResNet-18, ResNet-34, and ResNet-50 models.

2024-06-04

ArXiv (prépublication)

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Pierre ERBACHER

Louis Serrano

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, s… (voir plus)ynchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (\acco), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, \acco~reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

2024-06-03

ArXiv (prépublication)

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Pierre ERBACHER

Louis Serrano

Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients… (voir plus) on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, which can impede the efficiency gains of parallelization. To address this challenge, optimization algorithms reducing inter-worker communication have emerged, such as local optimization methods used in Federated Learning. While effective in minimizing communication overhead, these methods incur significant memory costs, hindering scalability: in addition to extra momentum variables, if communications are only allowed between multiple local optimization steps, then the optimizer's states cannot be sharded among workers. In response, we propose

2024-06-03

ArXiv (prépublication)

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Pierre ERBACHER

Louis Serrano

2024-06-03

ArXiv (prépublication)

Preventing Dimensional Collapse in Contrastive Local Learning with Subsampling

Adeetya Patel

Michael Eickenberg

2023-06-16

ICML.cc/2023/Workshop/LLW (publié)

Can Forward Gradient Match Backpropagation?

Stephane Rivaud

Michael Eickenberg

Forward Gradients - the idea of using directional derivatives in forward differentiation mode - have recently been shown to be utilizable fo… (voir plus)r neural network training while avoiding problems generally associated with backpropagation gradient computation, such as locking and memorization requirements. The cost is the requirement to guess the step direction, which is hard in high dimensions. While current solutions rely on weighted averages over isotropic guess vector distributions, we propose to strongly bias our gradient guesses in directions that are much more promising, such as feedback obtained from small, local auxiliary networks. For a standard computer vision neural network, we conduct a rigorous study systematically covering a variety of combinations of gradient targets and gradient guesses, including those previously presented in the literature. We find that using gradients obtained from a local loss as a candidate direction drastically improves on random noise in Forward Gradient methods.

2023-01-01

ICML (publié)