Khimya Khetarpal

Siddhartha Srinivasa

Abhishek Gupta

Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding erro… (see more)rs during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on

2026-06-07

arXiv (preprint)

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman

Zeyu Zheng

Will Dabney

Jakob Foerster

Michael Dennis

Clare Lyle

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to it… (see more)s widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.

2026-03-05

Open MIND (preprint)

Affordances Enable Partial World Modeling with LLMs

Gheorghe Comanici

Jonathan Richens

Jeremy Shar

Fei Xia

Laurent Orseau

Aleksandra Faust

Doina Precup

2026-02-10

ArXiv (preprint)

Robust Intervention Learning from Emergency Stop Interventions

Ethan Pronovost

Siddhartha Srinivasa

2026-02-02

ArXiv (preprint)

Plasticity as the Mirror of Empowerment

David Abel

Michael Bowling

Andre Barreto

Will Dabney

Shi Dong

Steven Hansen

Anna Harutyunyan

Clare Lyle

Razvan Pascanu

Georgios Piliouras

Doina Precup

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity … (see more)is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

openreview.net

Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps

Matt Schmittle

Rohan Baijal

Nathan Hatch

Rosario Scalise

Mateo Guaman Castro

Sidharth Talia

Byron Boots

Siddhartha Srinivasa

2025-08-07

robot-learning.org/CoRL/2025/Conference (poster)

proceedings.mlr.press

Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning

Adriana Hugessen

Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (see more)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,

2025-06-30

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

openreview.net

Representation Learning via Non-Contrastive Mutual Information

Zhaohan Daniel Guo

Bernardo Avila Pires

Dale Schuurmans

Bo Dai

2025-04-22

ArXiv (preprint)

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning

David Venuto

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (see more)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through

2025-03-04

ICLR.cc/2025/Workshop/DL4C (published)

openreview.net

Agency Is Frame-Dependent

David Abel

Andre Barreto

Michael Bowling

Will Dabney

Shi Dong

Steven Stenberg Hansen

Anna Harutyunyan

Clare Lyle

Razvan Pascanu

Georgios Piliouras

Doina Precup

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science… (see more), and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.

2025-02-05

ArXiv (preprint)

Optimizing Return Distributions with Distributional Dynamic Programming

Bernardo Avila Pires

Mark Rowland

Diana Borsa

Zhaohan Daniel Guo

Andre Barreto

David Abel

Remi Munos

Will Dabney

We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standar… (see more)d reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained so far (since the first time step). We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we combine the core ideas of distributional value iteration with the deep RL agent DQN, and empirically evaluate it for solving instances of the applications discussed.

2025-01-21

ArXiv (preprint)

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Zhaohan Daniel Guo

Bernardo Avila Pires

Yunhao Tang

Clare Lyle

Mark Rowland

Nicolas Heess

Diana Borsa

Arthur Guez

Will Dabney

2025-01-21

aistats.org/AISTATS/2025/Conference (poster)