Benjamin Therien

Samuel Dare

Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democr… (voir plus)atize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.

2026-03-08

arXiv (prépublication)

arxiv.org

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Charles-Etienne Joseph

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (voir plus)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Continual Pre-training of MoEs: How robust is your router?

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Irina Rish

2025-09-25

TMLR (accepté)

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Matthew D Riemer

Tsuguchika Tabaru

Hiroaki Kingetsu

A. Chandar

Irina Rish

2025-09-21

NeurIPS.cc/2025/Workshop/WiML (publié)

Communication Efficient LLM Pre-training with SparseLoCo

Amir M. Sarfi

Joel Lidin

2025-08-20

ArXiv (prépublication)

arxiv.org

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (voir plus)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Xiaolong Huang

Irina Rish

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

PyLO: Towards Accessible Learned Optimizers in PyTorch

Paul Janson

Quentin Gregory Anthony

Xiaolong Huang

Abhinav Moudgil

Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optim… (voir plus)izers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo

2025-06-08

ICML.cc/2025/Workshop/CODEML (publié)