Rishabh Agarwal

Associate Industry Member

Adjunct Professor, McGill University, School of Computer Science

Google DeepMind

Research Topics

Deep Learning

Large Language Models (LLM)

Reinforcement Learning

Website

Google Scholar

Biography

I am a research scientist in the Google DeepMind Team in Montréal. I am also an Adjunct Professor at McGill University and an Associate Industry Member at Mila - Quebec Artificial Intelligence Institute. I finished my PhD at Mila under the guidance of Aaron Courville and Marc Bellemare. Previously, I spent a year at Geoffrey Hinton's amazing team in Google Brain, Toronto. Earlier, I graduated in Computer Science and Engineering from IIT Bombay.

My research work mainly revolves around language models and deep reinforcement learning (RL), and includes an outstanding paper award at NeurIPS.

Current Students

Morgane Moss

PhD - Université de Montréal

Principal supervisor :

Aaron Courville

Blog Posts

November 25, 2022

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Max Schwarzer

Rishabh Agarwal

Read the article

Publications

DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization

Aviral Kumar

Rishabh Agarwal

Tengyu Ma

Aaron Courville

George Tucker

Sergey Levine

Despite overparameterization, deep networks trained via supervised learning are surprisingly easy to optimize and exhibit excellent generali… (see more)zation. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive aliasing, in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains, and robotic manipulation from images.

2022-01-27

ICLR.cc/2022/Conference (spotlight)

openreview.net

Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress

Bellemare Marc-Emmanuel

Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL sy… (see more)stems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further. Open-sourced code and trained agents at https://agarwl.github.io/reincarnating_rl.

2021-12-31

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (published)

doi.org

openreview.net

Behavior Predictive Representations for Generalization in Reinforcement Learning

Siddhant Agarwal

Aaron Courville

Rishabh Agarwal

Deep reinforcement learning (RL) agents trained on a few environments, often struggle to generalize on unseen environments, even when such e… (see more)nvironments are semantically equivalent to training environments. Such agents learn representations that overfit the characteristics of the training environments. We posit that generalization can be improved by assigning similar representations to scenarios with similar sequences of long-term optimal behavior. To do so, we propose behavior predictive representations (BPR) that capture long-term optimal behavior. BPR trains an agent to predict latent state representations multiple steps into the future such that these representations can predict the optimal behavior at the future steps. We demonstrate that BPR provides large gains on a jumping task from pixels, a problem designed to test generalization.

2021-12-12

NeurIPS.cc/2021/Workshop/DeepRL (unknown)

openreview.net

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning

Rishabh Agarwal

Marlos C. Machado

Pablo Samuel Castro

Bellemare Marc-Emmanuel

Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generali… (see more)zation, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.

2020-12-31

ICLR (published)

openreview.net

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Bellemare Marc-Emmanuel

Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. M… (see more)ost published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field.

2020-12-31

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (published)

doi.org

openreview.net

Revisiting Fundamentals of Experience Replay

William Fedus

Prajit Ramachandran

Rishabh Agarwal

Yoshua Bengio

Hugo Larochelle

Mark Rowland

Will Dabney

Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understa… (see more)nding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.

2020-11-20

Proceedings of the 37th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Rishabh Agarwal

Biography

Current Students

Blog Posts

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Rishabh Agarwal

Biography

Current Students

Blog Posts

Publications