Harley Wiltzer

Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning

Yash Jhaveri

Patrick Shafto

2025-10-09

ArXiv (prépublication)

Zero-Shot Constraint Satisfaction with Forward- Backward Representations

Adriana Hugessen

Cyrus Neary

Amy Zhang

Glen Berseth

Traditionally, constrained policy optimization with Reinforcement Learning (RL) requires learning a new policy from scratch for any new envi… (voir plus)ronment, goal or cost function, with limited generalization to new tasks and constraints. Given the sample inefficiency of many common deep RL methods, this procedure can be impractical for many real-world scenarios, particularly when constraints or tasks are changing. As an alternative, in the unconstrained setting, various works have sought to pre-train representations from offline datasets to accelerate policy optimization upon specification of a reward. Such methods can permit faster adaptation to new tasks in a given environment, dramatically improving sample efficiency. Recently, zero-shot policy optimization has been explored by leveraging a particular

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (publié)

Tractable Representations for Convergent Approximation of Distributional HJB Equations

Julie Alhosh

2025-03-07

ArXiv (prépublication)

Tractable Representations for Convergent Approximation of Distributional HJB Equations

Julie Alhosh

2025-03-01

arXiv (publié)

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

Sanjiban Choudhury

2025-01-22

ICLR.cc/2025/Conference (poster)

Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

Sanjiban Choudhury

In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Tradit… (voir plus)ionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

2024-11-11

ArXiv (prépublication)

Action Gaps and Advantages in Continuous-Time Distributional Reinforcement Learning

Patrick Shafto

Yash Jhaveri

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Simplifying Constraint Inference with Inverse Reinforcement Learning

Adriana Hugessen

Glen Berseth

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Revisiting Successor Features for Inverse Reinforcement Learning

Sanjiban Choudhury

2024-06-17

ICML.cc/2024/Workshop/MFHAIA (poster)

A Distributional Analogue to the Successor Representation

Jesse Farebrother

Arthur Gretton

Yunhao Tang

Andre Barreto

Will Dabney

Mark Rowland

This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure … (voir plus)and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.

2024-05-01

ICML.cc/2024/Conference (spotlight)

Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control

Nathan Rahn

Pierluca D'Oro

Pierre-Luc Bacon

Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In th… (voir plus)is work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.