Pablo Samuel Castro

Matthieu. Geist

The core challenge of offline reinforcement learning (RL) is dealing with the (potentially catastrophic) extrapolation error induced by the … (voir plus)distribution shift between the history dataset and the desired policy. A large portion of prior work tackles this challenge by implicitly/explicitly regularizing the learning policy towards the behavior policy, which is hard to estimate reliably in practice. In this work, we propose to regularize towards the Q-function of the behavior policy instead of the behavior policy itself, under the premise that the Q-function can be estimated more reliably and easily by a SARSA-style estimate and handles the extrapolation error more straightforwardly. We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.

2023-07-25

ArXiv (prépublication)

arxiv.org

Discovering the Electron Beam Induced Transition Rates for Silicon Dopants in Graphene with Deep Neural Networks in the STEM

Kevin M Roccapriore

Max Schwarzer

Joshua Greaves

Jesse Farebrother

Rishabh Agarwal

Colton Bishop

Maxim Ziatdinov

Igor Mordatch

Ekin Dogus Cubuk

Aaron Courville

Sergei V Kalinin

2023-07-22

Microscopy and Microanalysis (publié)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Johan Samir Obando Ceron

Aaron Courville

Rishabh Agarwal

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (voir plus)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (publié)

A Kernel Perspective on Behavioural Metrics for Markov Decision Processes

Tyler Kastner

Prakash Panangaden

Mark Rowland

We present a novel perspective on behavioural metrics for Markov decision processes via the use of positive definite kernels. We define a ne… (voir plus)w metric under this lens that is provably equivalent to the recently introduced MICo distance (Castro et al., 2021). The kernel perspective enables us to provide new theoretical results, including value-function bounds and low-distortion finite-dimensional Euclidean embeddings, which are crucial when using behavioural metrics for reinforcement learning representations. We complement our theory with strong empirical results that demonstrate the effectiveness of these methods in practice.

2023-06-19

TMLR (accepté)

Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks

Jesse Farebrother

Joshua Greaves

Rishabh Agarwal

Charline Le Lan

Ross Goroshin

Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-und… (voir plus)erstood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function.

2023-02-01

ICLR.cc/2023/Conference (poster)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Johan Samir Obando Ceron

Aaron Courville

Rishabh Agarwal

2023-01-01

ICML (publié)

The Dormant Neuron Phenomenon in Deep Reinforcement Learning

Ghada Sokar

Rishabh Agarwal

Utku Evci

In this work we identify the dormant neuron phenomenon in deep reinforcement learning, where an agent's network suffers from an increasing n… (voir plus)umber of inactive neurons, thereby affecting network expressivity. We demonstrate the presence of this phenomenon across a variety of algorithms and environments, and highlight its effect on learning. To address this issue, we propose a simple and effective method (ReDo) that Recycles Dormant neurons throughout training. Our experiments demonstrate that ReDo maintains the expressive power of networks by reducing the number of dormant neurons and results in improved performance.

2023-01-01

ICML (publié)

The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning

Johan Samir Obando Ceron

2023-01-01

Tiny Papers @ ICLR (published)

Variance Double-Down: The Small Batch Size Anomaly in Multistep Deep Reinforcement Learning

Johan Samir Obando Ceron

In deep reinforcement learning, multi-step learning is almost unavoidable to achieve state-of-the-art performance. However, the increased va… (voir plus)riance that multistep learning brings makes it difficult to increase the update horizon beyond relatively small numbers. In this paper, we report the counterintuitive finding that decreasing the batch size parameter improves the performance of many standard deep RL agents that use multi-step learning. It is well-known that gradient variance decreases with increasing batch sizes, so obtaining improved performance by increasing variance on two fronts is a rather surprising finding. We conduct a broad set of experiments to better understand what we call the variance doubledown phenomenon.

2022-12-09

NeurIPS.cc/2022/Workshop/DeepRL (inconnu)