Publications

Model Parallelism With Subnetwork Data Parallelism

Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (voir plus)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

doi.org

openreview.net

MuLoCo: Muon is a practical inner optimizer for DiLoCo

Benjamin Therien

Xiaolong Huang

Irina Rish

Eugene Belilovsky

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

doi.org

openreview.net

Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

openreview.net

NovoMolGen: Rethinking Molecular Language Model Pretraining

Kamran Chitsaz

Roshan Balaji

Quentin Fournier

Nirav Pravinbhai Bhatt

Sarath Chandar

2025-06-11

ICML.cc/2025/Workshop/GenBio (poster)

doi.org

openreview.net

Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominan… (voir plus)t Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

doi.org

openreview.net

Progressive Inference-Time Annealing of Diffusion Models for Sampling from Boltzmann Densities

Tara Akhound-Sadegh

Jungyoon Lee

Joey Bose

Valentin De Bortoli

Arnaud Doucet

Michael M. Bronstein

Sampling efficiently from a target unnormalized probability density remains a core challenge, with relevance across countless high-impact sc… (voir plus)ientific applications. A promising approach towards this challenge is the design of amortized samplers that borrow key ideas, such as probability path design, from state-of-the-art generative diffusion models. However, all existing diffusion-based samplers remain unable to draw samples from distributions at the scale of even simple molecular systems. In this paper, we propose Progressive Inference-Time Annealing (PITA), a novel framework to learn diffusion-based samplers that combines two complementary interpolation techniques: I.) Annealing of the Boltzmann distribution and II.) Diffusion smoothing. PITA trains a sequence of diffusion models from high to low temperatures by sequentially training each model at progressively higher temperatures, leveraging engineered easy access to samples of the temperature-annealed target density. In the subsequent step, PITA enables simulating the trained diffusion model to procure training samples at a lower temperature for the next diffusion model through inference-time annealing using a novel Feynman-Kac PDE combined with Sequential Monte Carlo. Empirically, PITA enables, for the first time, equilibrium sampling of N-body particle systems, Alanine Dipeptide, and tripeptides in Cartesian coordinates with dramatically lower energy function evaluations. Code available at: https://github.com/taraak/pita

2025-06-11

ICML.cc/2025/Workshop/GenBio (spotlight)

doi.org

openreview.net

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Adriana Hugessen

Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (voir plus)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,

2025-06-11

ArXiv (prépublication)

arxiv.org

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Adriana Hugessen

Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (voir plus)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,

2025-06-11

ArXiv (prépublication)

doi.org

arxiv.org

Task-Informed Meta-Learning for Remote Sensing

Gabriel Tseng

Hannah Kerner

David Rolnick

2025-06-11

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (publié)

doi.org

The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Gilad Landau

Miran Ozdogan

Gereon Elvers

Francesco Mantegna

Pratik Somaiya

Dulhan Hansaja Jayalath

Luisa Kurth

Teyun Kwon

Brendan Shillingford

Greg Farquhar

Minqi Jiang

Karim Jerbi

Hamza Abdelhedi

Yorguin Mantilla Ramos

Caglar Gulçehre

M. Woolrich

Natalie Voets

Oiwi Parker Jones

The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising appli… (voir plus)cations is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an"ImageNet moment"or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.

2025-06-11

ArXiv (prépublication)

doi.org

arxiv.org

The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset

Gilad Landau

Miran Ozdogan

Gereon Elvers

Francesco Mantegna

Pratik Somaiya

Dulhan Hansaja Jayalath

Luisa Kurth

Teyun Kwon

Brendan Shillingford

Greg Farquhar

Minqi Jiang

Karim Jerbi

Hamza Abdelhedi

Yorguin Mantilla Ramos

Caglar Gulçehre

M. Woolrich

Natalie Voets

Oiwi Parker Jones

The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising appli… (voir plus)cations is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an"ImageNet moment"or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.

2025-06-11

ArXiv (prépublication)

arxiv.org