Marc Gendron-Bellemare

adrien.taiga@mila.quebec

Harley Wiltzer

PhD - McGill University

Principal supervisor :

PhD - McGill University

Co-supervisor :

PhD - Université de Montréal

Principal supervisor :

pierluca.doro@mila.quebec

schwarzm@mila.quebec

Nate Rahn

PhD - McGill University

Co-supervisor :

Doina Precup

nathan.rahn@mila.quebec

Pierluca D'Oro

PhD - Université de Montréal

Principal supervisor :

Pierre-Luc Bacon

PhD - Université de Montréal

Principal supervisor :

Publications

Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks

Jesse Farebrother

Joshua Greaves

Rishabh Agarwal

Charline Le Lan

Ross Goroshin

Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-und… (see more)erstood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function.

2023-02-01

ICLR.cc/2023/Conference (poster)

Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier

Pierluca D'Oro

Max Schwarzer

Evgenii Nikishin

Pierre-Luc Bacon

Increasing the replay ratio, the number of updates of an agent's parameters per environment interaction, is an appealing strategy for improv… (see more)ing the sample efficiency of deep reinforcement learning algorithms. In this work, we show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge. We push the limits of the sample efficiency of carefully-modified algorithms by training them using an order of magnitude more updates than usual, significantly improving their performance in the Atari 100k and DeepMind Control Suite benchmarks. We then provide an analysis of the design choices required for favorable replay ratio scaling to be possible and discuss inherent limits and tradeoffs.

2023-02-01

ICLR.cc/2023/Conference (notable)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Johan Samir Obando Ceron

Rishabh Agarwal

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-01-01

ICML (published)

The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning

Johan Samir Obando Ceron

2023-01-01

Tiny Papers @ ICLR (published)

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Mark Rowland

Yunhao Tang

Clare Lyle

Remi Munos

Will Dabney

We study the problem of temporal-difference-based policy evaluation in reinforcement learning. In particular, we analyse the use of a distri… (see more)butional reinforcement learning algorithm, quantile temporal-difference learning (QTD), for this task. We reach the surprising conclusion that even if a practitioner has no interest in the return distribution beyond the mean, QTD (which learns predictions about the full distribution of returns) may offer performance superior to approaches such as classical TD learning, which predict only the mean return, even in the tabular setting.

2023-01-01

ICML (published)

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Mark Rowland

Yunhao Tang

Clare Lyle

Remi Munos

Will Dabney

2023-01-01

ICML (published)

Variance Double-Down: The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning

Johan Samir Obando Ceron

In deep reinforcement learning, multi-step learning is almost unavoidable to achieve state-of-the-art performance. However, the increased va… (see more)riance that multistep learning brings makes it difficult to increase the update horizon beyond relatively small numbers. In this paper, we report the counterintuitive finding that decreasing the batch size parameter improves the performance of many standard deep RL agents that use multi-step learning. It is well-known that gradient variance decreases with increasing batch sizes, so obtaining improved performance by increasing variance on two fronts is a rather surprising finding. We conduct a broad set of experiments to better understand what we call the variance doubledown phenomenon.

2022-12-09

NeurIPS.cc/2022/Workshop/DeepRL (unknown)

Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress

Rishabh Agarwal

Max Schwarzer

Metrics and continuity in reinforcement learning

Charline Le Lan

2021-05-18

Proceedings of the AAAI Conference on Artificial Intelligence (published)

arxiv.org

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal

Marlos C. Machado

Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generali… (see more)zation, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.

2021-01-01

ICLR (published)

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Rishabh Agarwal

Max Schwarzer

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. M… (see more)ost published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field. This work received an outstanding paper award at NeurIPS 2021.

Autonomous navigation of stratospheric balloons using reinforcement learning

S. Candido

Jun Gong

Marlos C. Machado

Subhodeep Moitra

Sameera S. Ponda

Ziyun Wang

2020-12-01

Nature (published)