Portrait of Rishabh Agarwal

Rishabh Agarwal

Associate Industry Member
Adjunct Professor, McGill University, School of Computer Science
Google DeepMind
Research Topics
Deep Learning
Large Language Models (LLM)
Reinforcement Learning

Biography

I am a research scientist in the Google DeepMind Team in Montréal. I am also an Adjunct Professor at McGill University and an Associate Industry Member at Mila - Quebec Artificial Intelligence Institute. I finished my PhD at Mila under the guidance of Aaron Courville and Marc Bellemare. Previously, I spent a year at Geoffrey Hinton's amazing team in Google Brain, Toronto. Earlier, I graduated in Computer Science and Engineering from IIT Bombay.

My research work mainly revolves around language models and deep reinforcement learning (RL), and includes an outstanding paper award at NeurIPS.

Current Students

PhD - Université de Montréal
Principal supervisor :

Publications

V-STaR: Training Verifiers for Self-Taught Reasoners
Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to… (see more) improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
V-STaR: Training Verifiers for Self-Taught Reasoners
Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to… (see more) improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Avi Singh
John D Co-Reyes
Piyush Patil
Xavier Garcia
Peter J. Liu
James Harrison
Jaehoon Lee
Aaron T Parisi
Abhishek Kumar
A. Alemi
Alex Rizkowsky
Azade Nova
Ben Adlam
Bernd Bohnet
Hanie Sedghi
Gamaleldin Fathy Elsayed
Igor Mordatch … (see 21 more)
Isabelle Simpson
Izzeddin Gur
Jasper Snoek
Jeffrey Pennington
Jiri Hron
Kathleen Kenealy
Kevin Swersky
Kshiteej Mahajan
Laura Culp
Lechao Xiao
Maxwell Bileschi
Noah Constant
Roman Novak
Rosanne Liu
Tris Brian Warkentin
Yundi Qian
Ethan Dyer
Behnam Neyshabur
Jascha Sohl-Dickstein
Yamini Bansal
Noah Fiedel
Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often lim… (see more)ited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investigate a simple self-training method based on expectation-maximization, which we call ReST
Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Kevin Roccapriore
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (see more)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy
Joshua Greaves
Kevin Roccapriore
Ekin Dogus Cubuk
Sergei Kalinin
Igor Mordatch
We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (see more)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.
Discovering the Electron Beam Induced Transition Rates for Silicon Dopants in Graphene with Deep Neural Networks in the STEM
Kevin M Roccapriore
Joshua Greaves
Colton Bishop
Maxim Ziatdinov
Igor Mordatch
Ekin Dogus Cubuk
Sergei V Kalinin
Bigger, Better, Faster: Human-level Atari with human-level efficiency
We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.
Bootstrapped Representations in Reinforcement Learning
Stephen Tu
Mark Rowland
Anna Harutyunyan
Will Dabney
In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of… (see more) deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).
Bigger, Better, Faster: Human-level Atari with human-level efficiency
We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.
A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces
Joshua Greaves
Mark Rowland
Fabian Pedregosa
In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace i… (see more)s represented by a neural network, and hence can bescaled to datasets with an effectively infinite number of rows and columns. Our method consistsin defining a loss function whose minimizer is the desired principal subspace, and constructing agradient estimate of this loss whose bias can be controlled.
Investigating Multi-task Pretraining and Generalization in Reinforcement Learning
Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks
Auxiliary tasks improve the representations learned by deep reinforcement learning agents. Analytically, their effect is reasonably well-und… (see more)erstood; in practice, how-ever, their primary use remains in support of a main learning objective, rather than as a method for learning representations. This is perhaps surprising given that many auxiliary tasks are defined procedurally, and hence can be treated as an essentially infinite source of information about the environment. Based on this observation, we study the effectiveness of auxiliary tasks for learning rich representations, focusing on the setting where the number of tasks and the size of the agent’s network are simultaneously increased. For this purpose, we derive a new family of auxiliary tasks based on the successor measure. These tasks are easy to implement and have appealing theoretical properties. Combined with a suitable off-policy learning rule, the result is a representation learning algorithm that can be understood as extending Mahadevan & Maggioni (2007)’s proto-value functions to deep reinforcement learning – accordingly, we call the resulting object proto-value networks. Through a series of experiments on the Arcade Learning Environment, we demonstrate that proto-value networks produce rich features that may be used to obtain performance comparable to established algorithms, using only linear approximation and a small number (~4M) of interactions with the environment’s reward function.