Charles-Etienne Joseph

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Eugene Belilovsky

Irina Rish

2025-09-26

TMLR (accepted)

doi.org

openreview.net

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Eugene Belilovsky

Irina Rish

Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers t… (see more)hat require the same amount of floating-point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay, learning rate re-warming, and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) *do the MoE transformer's routers exacerbate forgetting relative to a dense model?*; 2) *do the routers maintain a balanced load on previous distributions after CPT?*; 3) *are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs?* In what follows, we conduct a large-scale study training a 500M parameter dense transformer and four 500M-active/2B-total parameter MoE transformers, following the Switch Transformer architecture and a granular DeepSeek-inspired architecture. Each model is trained for 600B tokens. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

2025-09-26

TMLR (accepted)

openreview.net

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Eugene Belilovsky

Irina Rish

2025-03-06

ArXiv (preprint)

arxiv.org

Meta-learning Optimizers for Communication-Efficient Learning

Charles-Etienne Joseph

2025-01-01

Trans. Mach. Learn. Res. (published)

openreview.net

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (published)

doi.org

openreview.net

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

2024-05-31

ArXiv (preprint)

doi.org

arxiv.org

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Therien

Charles-Etienne Joseph

2024-05-31

ArXiv (preprint)

doi.org

arxiv.org

Can We Learn Communication-Efficient Optimizers?

Charles-Etienne Joseph

2023-12-02

ArXiv (preprint)

doi.org

arxiv.org

Learning Optimizers for Local SGD

Charles-Etienne Joseph

2023-10-27

NeurIPS.cc/2023/Workshop/Federated_Learning (poster)

openreview.net

Learning Optimizers for Local SGD

Charles-Etienne Joseph

Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches co… (see more)mpute multiple gradient steps locally, that is on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art optimizers for deep learning. In this work, we incorporate local optimizers that compute multiple updates into a learned optimization framework, allowing to meta-learn potentially more efficient local SGD algorithms. Our results demonstrate that local learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. We show that the learned optimizers can generalize to new datasets and architectures, demonstrating the potential of learned optimizers for improving communication-efficient distributed learning.

2000-01-01

(published)

www.semanticscholar.org

Speed Science

Leading in a New Era

Supervision Requests

Charles-Etienne Joseph

Publications

Speed Science

Leading in a New Era

Supervision Requests

Popular keywords:

Charles-Etienne Joseph

Publications