Veronica Chelu

Functional Acceleration for Policy Mirror Descent

Veronica Chelu

Doina Precup

2024-06-19

ICML.cc/2024/Workshop/ARLET (poster)

doi.org

openreview.net

Functional Acceleration for Policy Mirror Descent

Veronica Chelu

Doina Precup

We apply functional acceleration to the Policy Mirror Descent (PMD) general family of algorithms, which cover a wide range of novel and fund… (voir plus)amental methods in Reinforcement Learning (RL). Leveraging duality, we propose a momentum-based PMD update. By taking the functional route, our approach is independent of the policy parametrization and applicable to large-scale optimization, covering previous applications of momentum at the level of policy parameters as a special case. We theoretically analyze several properties of this approach and complement with a numerical ablation study, which serves to illustrate the policy optimization dynamics on the value polytope, relative to different algorithmic design choices in this space. We further characterize numerically several features of the problem setting relevant for functional acceleration, and lastly, we investigate the impact of approximation on their learning mechanics.

2024-06-19

ICML.cc/2024/Workshop/ARLET (poster)

doi.org

openreview.net

Recurrent Policies Are Not Enough for Continual Reinforcement Learning

Nathan Samuel de Lara

Veronica Chelu

Doina Precup

Continual Reinforcement Learning (CRL) aims to develop algorithms that adapt to non-stationary sequences of tasks. A promising recent approa… (voir plus)ch utilizes Recurrent Neural Networks (RNNs) to learn contextual Markov Decision Process (MDP) embeddings. This enables a reinforcement learning (RL) agent to discern the optimality of actions across diverse tasks. In this study, we examine two critical failure modes in the learning of these contextual MDP embeddings. Specifically, we find that RNNs are prone to catastrophic forgetting, manifesting in two distinct ways: (i) embedding collapse---where agents initially learn a contextual task structure that later collapses to a single task, and (ii) embedding drift---where learning embeddings for new MDPs interferes with embeddings the RNN outputs for previous MDPs in the sequence, leading to suboptimal performance of downstream policy networks conditioned on stale embeddings. We explore the effects of various objective functions and network architectures concerning these failure modes, revealing that one of these modes consistently emerges across different setups.

2024-06-07

rl-conference.cc/RLC/2024/Workshop/ICBINB (poster)

openreview.net

Acceleration in Policy Optimization

Veronica Chelu

Tom Zahavy

Arthur Guez

Doina Precup

Sebastian Flennerhag

We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) through predictive and adapt… (voir plus)ive directions of (functional) policy ascent. Leveraging the connection between policy iteration and policy gradient methods, we view policy optimization algorithms as iteratively solving a sequence of surrogate objectives, local lower bounds on the original objective. We define optimism as predictive modelling of the future behavior of a policy, and hindsight adaptation as taking immediate and anticipatory corrective actions to mitigate accumulating errors from overshooting predictions or delayed responses to change. We use this shared lens to jointly express other well-known algorithms, including model-based policy improvement based on forward search, and optimistic meta-learning algorithms. We show connections with Anderson acceleration, Nesterov's accelerated gradient, extra-gradient methods, and linear extrapolation in the update rule. We analyze properties of the formulation, design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.

2023-07-20

EWRL/2023/Workshop (accepté)

openreview.net

Optimism and Adaptivity in Policy Optimization

Veronica Chelu

Tom Zahavy

Arthur Guez

Doina Precup

Sebastian Flennerhag

2023-01-01

arXiv.org (prépublication)

doi.org

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Anthony GX-Chen

Veronica Chelu

Blake Richards

Joelle Pineau

2022-06-28

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

doi.org

arxiv.org

Selective Credit Assignment

Veronica Chelu

Diana Borsa

Doina Precup

Hado Philip van Hasselt

2022-02-20

ArXiv (prépublication)

arxiv.org

A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions

Anthony GX-Chen

Veronica Chelu

Blake Richards

Joelle Pineau

Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootst… (voir plus)rapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)—a policy-dependent model—and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the ?-return mixture, which implicitly combines value-predictive knowledge (used by TD methods) with (successor) feature-predictive knowledge—with a parameter ? capturing how much to rely on each. We illustrate that incorporating predictive knowledge through an ??-discounted SF model makes more efficient use of sampled experience, compared to either extreme, i.e. bootstrapping entirely on the value function estimate, or bootstrapping on the product of separately estimated successor features and instantaneous reward models. We empirically show this approach leads to faster policy evaluation and better control performance, for tabular and nonlinear function approximations, indicating scalability and generality.

2022-01-05

ArXiv (preprint)

doi.org

arxiv.org

Forethought and Hindsight in Credit Assignment

Veronica Chelu

Doina Precup

Hado van Hasselt

We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent … (voir plus)can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.

arxiv.org

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Veronica Chelu

Publications

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Mots-clés populaires:

Veronica Chelu

Publications