Portrait de Rishabh Agarwal

Rishabh Agarwal

Membre industriel associé
Professeur associé, McGill University, École d'informatique
Google DeepMind
Sujets de recherche
Apprentissage par renforcement
Apprentissage profond
Grands modèles de langage (LLM)

Biographie

Je suis chercheur dans l'équipe DeepMind de Google à Montréal, professeur adjoint à l'Université McGill et membre industriel associé à Mila - Institut québécois d'intelligence artificielle. J'ai réalisé mon doctorat au sein de Mila sous la supervision d'Aaron Courville et Marc Bellemare. Avant cela, j'ai eu l'opportunité de travailler pendant un an avec l'équipe de Geoffrey Hinton chez Google Brain, à Toronto. J'ai obtenu mon diplôme en informatique et en ingénierie à l'IIT Bombay.

Mes recherches se concentrent sur les modèles de langage et l'apprentissage par renforcement profond (RL). J'ai eu l'honneur de recevoir un prix pour un article exceptionnel présenté à NeurIPS.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :

Publications

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (voir plus)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar
Vincent Zhuang
Yi Su
John D Co-Reyes
Avi Singh
Kate Baumli
Shariq Iqbal
Colton Bishop
Rebecca Roelofs
Lei M Zhang
Kay McKinney
Disha Shrivastava
Cosmin Paduraru
George Tucker
Feryal Behbahani
Aleksandra Faust
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Yinlam Chow
Guy Tennenholtz
Izzeddin Gur
Vincent Zhuang
Bo Dai
Sridhar Thiagarajan
Craig Boutilier
Aviral Kumar
Aleksandra Faust
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large langu… (voir plus)age models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (voir plus)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (voir plus)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (voir plus)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
Not All LLM Reasoners Are Created Equal
Arian Hosseini
Daniel Toyama
Not All LLM Reasoners Are Created Equal
Arian Hosseini
Daniel Toyama
We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of e… (voir plus)xisting math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.
Many-Shot In-Context Learning
Avi Singh
Lei M Zhang
Bernd Bohnet
Luis Rosias
Stephanie C.Y. Chan
Biao Zhang
Ankesh Anand
Zaheer Abbas
Azade Nova
John D Co-Reyes
Eric Chu
Feryal Behbahani
Aleksandra Faust
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (voir plus)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples – the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore two new settings: (1) "Reinforced ICL" that uses model-generated chain-of-thought rationales in place of human rationales, and (2) "Unsupervised ICL" where we remove rationales from the prompt altogether, and prompts the model only with domain-specific inputs. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. We demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to supervised fine-tuning. Finally, we reveal the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar
Vincent Zhuang
Yi Su
John D Co-Reyes
Avi Singh
Kate Baumli
Shariq N Iqbal
Colton Bishop
Rebecca Roelofs
Lei M Zhang
Kay McKinney
Disha Shrivastava
Cosmin Paduraru
George Tucker
Feryal Behbahani
Aleksandra Faust
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar
Vincent Zhuang
Yi Su
John D Co-Reyes
Avi Singh
Kate Baumli
Shariq N Iqbal
Colton Bishop
Rebecca Roelofs
Lei M Zhang
Kay McKinney
Disha Shrivastava
Cosmin Paduraru
George Tucker
Feryal Behbahani
Aleksandra Faust
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team Morgane Riviere
Shreya Pathak
Pier Giuseppe Sessa
Cassidy Hardin
Surya Bhupatiraju
L'eonard Hussenot
Thomas Mesnard
Bobak Shahriari
Alexandre Ram'e
Johan Ferret
Peter Liu
Pouya Dehghani Tafti
Abe Friesen
Michelle Casbon
Sabela Ramos
Ravin Kumar
Charline Le Lan
Sammy Jerome
Anton Tsitsulin
Nino Vieillard … (voir 175 de plus)
Piotr Stańczyk
Sertan Girgin
Nikola Momchev
Matt Hoffman
Shantanu Thakoor
Jean-Bastien Grill
Behnam Neyshabur
Alanna Walton
Aliaksei Severyn
Alicia Parrish
Aliya Ahmad
Allen Hutchison
Alvin Abdagic
Amanda Carl
Amy Shen
Andy Brock
Andy Coenen
Anthony Laforge
Antonia Paterson
Ben Bastian
Bilal Piot
Boxi Wu
Brandon Royal
Charlie Chen
Chintu Kumar
Chris Perry
Christoper A. Welty
Christopher A. Choquette-Choo
Danila Sinopalnikov
David Weinberger
Dimple Vijaykumar
Dominika Rogozi'nska
D. Herbison
Elisa Bandy
Emma Wang
Eric Noland
Erica Moreira
Evan Senter
Evgenii Eltyshev
Francesco Visin
Gabriel Rasskin
Gary Wei
Glenn Cameron
Gus Martins
Hadi Hashemi
Hanna Klimczak-Pluci'nska
Harleen Batra
Harsh Dhand
Ivan Nardini
Jacinda Mein
Jack Zhou
James Svensson
Jeff Stanway
Jetha Chan
Jin Zhou
Joana Carrasqueira
Joana Iljazi
Jocelyn Becker
Joe Fernandez
Joost Van Amersfoort
Josh Gordon
Josh Lipschultz
Joshua Newlan
Junsong Ji
Kareem Mohamed
Kartikeya Badola
Kat Black
Katie Millican
Keelin McDonell
Kelvin Nguyen
Kiranbir Sodhia
Kish Greene
Lars Lowe Sjoesund
Lauren Usui
Laurent Sifre
L. Heuermann
Leti-cia Lago
Lilly McNealus
Livio Baldini Soares
Logan Kilpatrick
Lucas Dixon
Luciano Martins
Machel Reid
Manvinder Singh
Mark Iverson
Martin Gorner
Mat Velloso
Mateo Wirth
Matt Davidow
Matt Miller
Matthew Rahtz
Matthew Watson
Meg Risdal
Mehran Kazemi
Michael Moynihan
Ming Zhang
Minsuk Kahng
Minwoo Park
Mofi Rahman
Mohit Khatwani
Natalie Dao
Nenshad Bardoliwalla
N. Devanathan
Neta Dumai
Nilay Chauhan
O. Wahltinez
Pankil Botarda
Parker Barnes
Paul R. Barham
Paul Michel
Peng-chong Jin
Petko Georgiev
Phil Culliton
Pradeep Kuppala
Ramona Comanescu
Ramona Merhej
Reena Jana
R. Rokni
Ryan Mullins
Samaneh Saadat
S. M. Carthy
Sarah Perrin
S'ebastien M. R. Arnold
Se-bastian Krause
Shengyang Dai
S. Garg
Shruti Sheth
S. Ronstrom
Susan Chan
Timothy Jordan
Ting Yu
Tom Eccles
Tom Hennigan
Tomas Kocisky
Tulsee Doshi
Vihan Jain
Vikas Yadav
Vilobh Meshram
Vishal Dharmadhikari
Warren Barkley
Wei Wei
Wenming Ye
Woohyun Han
Woosuk Kwon
Xiang Xu
Zhe Shen
Zhitao Gong
Zichuan Wei
Victor Cotruta
Phoebe Kirk
Anand Rao
Minh Giang
Ludovic Peran
Tris Brian Warkentin
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
D. Sculley
Jeanine Banks
Anca Dragan
Slav Petrov
Oriol Vinyals
Jeffrey Dean
Demis Hassabis
Koray Kavukcuoglu
Clément Farabet
Elena Buchatskaya
Sebastian Borgeaud
Noah Fiedel
Armand Joulin
Kathleen Kenealy
Robert Dadashi
Alek Andreev
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2… (voir plus) billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.