Rishabh Agarwal

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

Jesse Farebrother

Jordi Orbay

Quan Vuong

Adrien Ali Taiga

Yevgen Chebotar

Ted Xiao

Alex Irpan

Sergey Levine

Aleksandra Faust

Aviral Kumar

Rishabh Agarwal

2024-05-01

ICML.cc/2024/Conference (présentation orale)

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL

Jesse Farebrother

Jordi Orbay

Quan Ho Vuong

Adrien Ali Taiga

Yevgen Chebotar

Ted Xiao

A. Irpan

Sergey Levine

Aleksandra Faust

Aviral Kumar

Rishabh Agarwal

Value functions are a central component of deep reinforcement learning (RL). These functions, parameterized by neural networks, are trained … (voir plus)using a mean squared error regression objective to match bootstrapped target values. However, scaling value-based RL methods that use regression to large networks, such as high-capacity Transformers, has proven challenging. This difficulty is in stark contrast to supervised learning: by leveraging a cross-entropy classification loss, supervised methods have scaled reliably to massive networks. Observing this discrepancy, in this paper, we investigate whether the scalability of deep RL can also be improved simply by using classification in place of regression for training value functions. We demonstrate that value functions trained with categorical cross-entropy significantly improves performance and scalability in a variety of domains. These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers, achieving state-of-the-art results on these domains. Through careful analysis, we show that the benefits of categorical cross-entropy primarily stem from its ability to mitigate issues inherent to value-based RL, such as noisy targets and non-stationarity. Overall, we argue that a simple shift to training value functions with categorical cross-entropy can yield substantial improvements in the scalability of deep RL at little-to-no cost.

2024-03-06

ArXiv (prépublication)

arxiv.org

V-STaR: Training Verifiers for Self-Taught Reasoners

Arian Hosseini

Xingdi Yuan

Nikolay Malkin

Alessandro Sordoni

Rishabh Agarwal

Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on sel… (voir plus)f-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

2024-02-09

ArXiv (prépublication)

arxiv.org

Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Max Schwarzer

Jesse Farebrother

Joshua Greaves

Ekin Dogus Cubuk

Rishabh Agarwal

Sergei V. Kalinin

Igor Mordatch

Kevin M Roccapriore

We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimu… (voir plus)lated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

2023-11-21

ArXiv (prépublication)

arxiv.org

Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Max Schwarzer

Jesse Farebrother

Joshua Greaves

Kevin Roccapriore

Ekin Dogus Cubuk

Rishabh Agarwal

Sergei Kalinin

Igor Mordatch

We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (voir plus)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

2023-10-27

NeurIPS.cc/2023/Workshop/AI4Mat (spotlight)

Learning Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Max Schwarzer

Jesse Farebrother

Joshua Greaves

Kevin Roccapriore

Ekin Dogus Cubuk

Rishabh Agarwal

Sergei Kalinin

Igor Mordatch

We introduce a machine learning approach to determine the transition rates of silicon atoms on a single layer of carbon atoms, when stimulat… (voir plus)ed by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition rates. These rates are then applied to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

2023-10-27

NeurIPS.cc/2023/Workshop/AI4Mat (spotlight)

Discovering the Electron Beam Induced Transition Rates for Silicon Dopants in Graphene with Deep Neural Networks in the STEM

Kevin M Roccapriore

Max Schwarzer

Joshua Greaves

Jesse Farebrother

Rishabh Agarwal

Colton Bishop

Maxim Ziatdinov

Igor Mordatch

Ekin Dogus Cubuk

Sergei V Kalinin

2023-07-22

Microscopy and Microanalysis (publié)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Johan Samir Obando Ceron

Rishabh Agarwal

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (voir plus)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (publié)

Bootstrapped Representations in Reinforcement Learning

Charline Le Lan

Stephen Tu

Mark Rowland

Anna Harutyunyan

Rishabh Agarwal

Will Dabney

In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of… (voir plus) deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (publié)

A Novel Stochastic Gradient Descent Algorithm for LearningPrincipal Subspaces

Charline Le Lan

Joshua Greaves

Jesse Farebrother

Mark Rowland

Fabian Pedregosa

Rishabh Agarwal

In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace i… (voir plus)s represented by a neural network, and hence can bescaled to datasets with an effectively infinite number of rows and columns. Our method consistsin defining a loss function whose minimizer is the desired principal subspace, and constructing agradient estimate of this loss whose bias can be controlled.

2023-04-11

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (publié)

Investigating Multi-task Pretraining and Generalization in Reinforcement Learning

Adrien Ali Taiga

Rishabh Agarwal

Jesse Farebrother

Google Brain

2023-02-01

ICLR.cc/2023/Conference (poster)