Publications

SKOLR: Structured Koopman Operator Linear RNN for Time-Series Forecasting

Boris Oreshkin

Koopman operator theory provides a framework for nonlinear dynamical system analysis and time-series forecasting by mapping dynamics to a sp… (see more)ace of real-valued measurement functions, enabling a linear operator representation. Despite the advantage of linearity, the operator is generally infinite-dimensional. Therefore, the objective is to learn measurement functions that yield a tractable finite-dimensional Koopman operator approximation. In this work, we establish a connection between Koopman operator approximation and linear Recurrent Neural Networks (RNNs), which have recently demonstrated remarkable success in sequence modeling. We show that by considering an extended state consisting of lagged observations, we can establish an equivalence between a structured Koopman operator and linear RNN updates. Building on this connection, we present SKOLR, which integrates a learnable spectral decomposition of the input signal with a multilayer perceptron (MLP) as the measurement functions and implements a structured Koopman operator via a highly parallel linear RNN stack. Numerical experiments on various forecasting benchmarks and dynamical systems show that this streamlined, Koopman-theory-based design delivers exceptional performance. Our code is available at: https://github.com/networkslab/SKOLR.

2025-05-01

ICML.cc/2025/Conference (poster)

Structure-Aligned Protein Language Model

Can Chen

David Heurtel-Depeiges

Robert M. Vernon

Christopher J. Langmead

Yoshua Bengio

Quentin Fournier

2025-05-01

arXiv (published)

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Devin Kwok

Gül Sena Altıntaş

Colin Raffel

David Rolnick

Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is … (see more)unclear to what extent such effects lead to meaningfully different networks, either in terms of the models' weights or the underlying functions that were learned. In this work, we show that during the initial "chaotic" phase of training, even extremely small perturbations reliably causes otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time. We quantify this divergence through (i)

2025-05-01

ICML.cc/2025/Conference (poster)

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu

Johan Samir Obando Ceron

Pablo Samuel Castro

Aaron Courville

Ling Pan

Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help i… (see more)mprove sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of"polluting"the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.

2025-05-01

ICML.cc/2025/Conference (poster)

On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew Lampinen

Arslan Chaudhry

Stephanie C.Y. Chan

Cody Wild

Diane Wan

Alexander Y. Ku

Alex Ku

Jorg Bornschein

Razvan Pascanu

Murray P. Shanahan

James L McClelland

2025-05-01

arXiv (published)

On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew Lampinen

Arslan Chaudhry

Stephanie C.Y. Chan

Cody Wild

Diane Wan

Alexander Y. Ku

Jorg Bornschein

Razvan Pascanu

Murray P. Shanahan

James L McClelland

Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to … (see more)generalize to simple reversals of relations they are trained on, or fail to make simple logical deductions based on trained information. These failures to generalize from fine-tuning can hinder practical application of these models. On the other hand, language models' in-context learning shows different inductive biases, and can generalize better in some cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' abilities to generalize from finetuning data. The datasets are designed to create clean tests of generalization, by isolating the knowledge in the dataset from that in pretraining. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.

2025-05-01

ArXiv (preprint)

The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

walter Mayor

Johan Samir Obando Ceron

Aaron Courville

Pablo Samuel Castro

The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in… (see more) which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.

2025-05-01

ICML.cc/2025/Conference (poster)

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Chris Emezue

The NaijaVoices Community

Busayo Awobade

Abraham Owodunni

Handel Emezue

Gloria Monica Tobechukwu Emezue

N. N. Emezue

Sewade Ogun

Bunmi Akinremi

David Ifeoluwa Adelani

Chris Pal

2025-05-01

arXiv (published)

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

Lukas Fluri

Leon Lang

Alessandro Abate

Patrick Forré

David Scott Krueger

Joar Max Viktor Skalse

In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to addre… (see more)ss this issue by *learning* the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an *error-regret mismatch*. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any *fixed* expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

2025-05-01

ICML.cc/2025/Conference (poster)

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Gaspard Lambrechts

Damien Ernst

Aditya Mahajan

2025-05-01

ICML.cc/2025/Conference (poster)

Towards a Formal Theory of Representational Compositionality

Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought and language. In AI, it ena… (see more)bles a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, we lack satisfying formal definitions for it. Here, we propose such a definition called representational compositionality that is conceptually simple, quantitative, and grounded in algorithmic information theory. Intuitively, representational compositionality states that a compositional representation is both expressive and describable as a simple function of parts. We validate our definition on both real and synthetic data, and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We hope that our definition can inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought. We make our code available at https://github.com/EricElmoznino/complexity_compositionality.

2025-05-01

ICML.cc/2025/Conference (poster)