Portrait of Rishabh Agarwal

Rishabh Agarwal

Associate Industry Member
Adjunct Professor, McGill University, School of Computer Science
Google DeepMind
Research Topics
Deep Learning
Large Language Models (LLM)
Reinforcement Learning

Biography

I am a research scientist in the Google DeepMind Team in Montréal. I am also an Adjunct Professor at McGill University and an Associate Industry Member at Mila - Quebec Artificial Intelligence Institute. I finished my PhD at Mila under the guidance of Aaron Courville and Marc Bellemare. Previously, I spent a year at Geoffrey Hinton's amazing team in Google Brain, Toronto. Earlier, I graduated in Computer Science and Engineering from IIT Bombay.

My research work mainly revolves around language models and deep reinforcement learning (RL), and includes an outstanding paper award at NeurIPS.

Current Students

PhD - Université de Montréal
Principal supervisor :

Publications

Gemma 3 Technical Report
Gemma Team Aishwarya Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
Ramona Merhej
Sarah Perrin
Tatiana Matejovicova
Alexandre Ram'e
Morgane Rivière
Louis Rouillard
Thomas Mesnard
Geoffrey Cideron
Jean-Bastien Grill
Sabela Ramos
Edouard Yvinec
Michelle Casbon
Etienne Pot
Ivo Penchev
Gael Liu
Francesco Visin … (see 190 more)
Kathleen Kenealy
Lucas Beyer
Xiaohai Zhai
Anton Tsitsulin
Róbert Busa-Fekete
Alex Feng
Noveen Sachdeva
Benjamin Coleman
Yi Gao
Basil Mustafa
Iain Barr
Emilio Parisotto
David Tian
Matan Eyal
Colin Cherry
Jan-Thorsten Peter
Danila Sinopalnikov
Surya Bhupatiraju
Mehran Kazemi
Dan Malkin
Ravin Kumar
David Vilar
Idan Brusilovsky
Jiaming Luo
Andreas Steiner
Abe Friesen
Abhanshu Sharma
Abheesht Sharma
Adi Mayrav Gilady
Adrian Goedeckemeyer
Alaa Saade
Alexander Kolesnikov
Alexei Bendebury
Alvin Abdagic
Amit Vadi
Andr'as Gyorgy
André Susano Pinto
Anil Das
Ankur Bapna
Antoine Miech
Antoine Yang
Antonia Paterson
Ashish Shenoy
Ayan Chakrabarti
Bilal Piot
Boxi Wu
Bobak Shahriari
Bryce Petrini
Charlie Chen
Charline Le Lan
Christopher A. Choquette-Choo
CJ Carey
Cormac Brick
Daniel Deutsch
Danielle Eisenbud
Dee Cattle
Derek Cheng
Dimitris Paparas
Divyashree Shivakumar Sreepathihalli
Doug Reid
Dustin Tran
Dustin Zelle
Eric Noland
Erwin Huizenga
Eugene Kharitonov
Frederick Liu
Gagik Amirkhanyan
Glenn Cameron
Hadi Hashemi
Hanna Klimczak-Pluci'nska
Harman Singh
Harsh Mehta
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
Jean Pouget-Abadie
Jetha Chan
Joe Stanton
J. Michael Wieting
Jonathan Lai
Jordi Orbay
Joe Fernandez
Joshua Newlan
Junsong Ji
Jyotinder Singh
Kat Black
Kathy Yu
Kevin Hui
Kiran N. Vodrahalli
Klaus Greff
Linhai Qiu
Marcella Valentine
Marina Coelho
Marvin Ritter
Matt Hoffman
Matthew Watson
Mayank Chaturvedi
Michael Moynihan
Min Ma
Nabila Babar
Natasha Noy
Nathan Byrd
Nick Roy
Nikola Momchev
Nilay Chauhan
Oskar Bunyan
Pankil Botarda
Paul Caron
Paul Kishan Rubenstein
Phil Culliton
Philipp Schmid
Pier Giuseppe Sessa
Pingmei Xu
Piotr Stańczyk
Pouya Dehghani Tafti
Rakesh Shivanna
Renjie Wu
Renke Pan
R. Rokni
Rob Willoughby
Rohith Vallu
Ryan Mullins
Sammy Jerome
Sara Smoot
Sertan Girgin
Shariq Iqbal
Shashir Reddy
Shruti Sheth
Siim Põder
Sijal Bhatnagar
S. Panyam
Sivan Eiger
Susan Zhang
Tianqi Liu
Trevor Yacovone
T. Liechty
Uday Kalra
Utku Evci
Vedant Misra
Vincent Roseberry
Vladimir Feinberg
Vlad Kolesnikov
Woohyun Han
Woosuk Kwon
X. T. Chen
Yinlam Chow
Yuvein Zhu
Zichuan Wei
Z. Egyed
Victor Cotruta
Minh Giang
Phoebe Kirk
Anand Rao
Jessica Lo
Erica Moreira
Luiz GUStavo Martins
Omar Sanseviero
Lucas Gonzalez
Zach Gleicher
Tris Brian Warkentin
Seyed Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
D. Sculley
Slav Petrov
Noah Fiedel
Noam M. Shazeer
Oriol Vinyals
Jeffrey Dean
Demis Hassabis
Koray Kavukcuoglu
Clément Farabet
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry Lepikhin
Sebastian Borgeaud
Olivier Bachem
Armand Joulin
Alek Andreev
Cassidy Hardin
Robert Dadashi
L'eonard Hussenot
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
Ramona Merhej
Sarah Perrin
Tatiana Matejovicova
Alexandre Ram'e
Morgane Rivière
Louis Rouillard
Thomas Mesnard
Geoffrey Cideron
Jean-Bastien Grill
Sabela Ramos
Edouard Yvinec
Michelle Casbon
Etienne Pot
Ivo Penchev
Gael Liu
Francesco Visin … (see 190 more)
Kathleen Kenealy
Lucas Beyer
Xiaohai Zhai
Anton Tsitsulin
Róbert Busa-Fekete
Alex Feng
Noveen Sachdeva
Benjamin Coleman
Yi Gao
Basil Mustafa
Iain Barr
Emilio Parisotto
David Tian
Matan Eyal
Colin Cherry
Jan-Thorsten Peter
Danila Sinopalnikov
Surya Bhupatiraju
Mehran Kazemi
Dan Malkin
Ravin Kumar
David Vilar
Idan Brusilovsky
Jiaming Luo
Andreas Steiner
Abe Friesen
Abhanshu Sharma
Abheesht Sharma
Adi Mayrav Gilady
Adrian Goedeckemeyer
Alaa Saade
Alexander Kolesnikov
Alexei Bendebury
Alvin Abdagic
Amit Vadi
Andr'as Gyorgy
André Susano Pinto
Anil Das
Ankur Bapna
Antoine Miech
Antoine Yang
Antonia Paterson
Ashish Shenoy
Ayan Chakrabarti
Bilal Piot
Boxi Wu
Bobak Shahriari
Bryce Petrini
Charlie Chen
Charline Le Lan
Christopher A. Choquette-Choo
CJ Carey
Cormac Brick
Daniel Deutsch
Danielle Eisenbud
Dee Cattle
Derek Cheng
Dimitris Paparas
Divyashree Shivakumar Sreepathihalli
Doug Reid
Dustin Tran
Dustin Zelle
Eric Noland
Erwin Huizenga
Eugene Kharitonov
Frederick Liu
Gagik Amirkhanyan
Glenn Cameron
Hadi Hashemi
Hanna Klimczak-Pluci'nska
Harman Singh
Harsh Mehta
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
Jean Pouget-Abadie
Jetha Chan
Joe Stanton
J. Michael Wieting
Jonathan Lai
Jordi Orbay
Joe Fernandez
Joshua Newlan
Junsong Ji
Jyotinder Singh
Kat Black
Kathy Yu
Kevin Hui
Kiran N. Vodrahalli
Klaus Greff
Linhai Qiu
Marcella Valentine
Marina Coelho
Marvin Ritter
Matt Hoffman
Matthew Watson
Mayank Chaturvedi
Michael Moynihan
Min Ma
Nabila Babar
Natasha Noy
Nathan Byrd
Nick Roy
Nikola Momchev
Nilay Chauhan
Oskar Bunyan
Pankil Botarda
Paul Caron
Paul Kishan Rubenstein
Phil Culliton
Philipp Schmid
Pier Giuseppe Sessa
Pingmei Xu
Piotr Stańczyk
Pouya Dehghani Tafti
Rakesh Shivanna
Renjie Wu
Renke Pan
R. Rokni
Rob Willoughby
Rohith Vallu
Ryan Mullins
Sammy Jerome
Sara Smoot
Sertan Girgin
Shariq Iqbal
Shashir Reddy
Shruti Sheth
Siim Põder
Sijal Bhatnagar
S. Panyam
Sivan Eiger
Susan Zhang
Tianqi Liu
Trevor Yacovone
T. Liechty
Uday Kalra
Utku Evci
Vedant Misra
Vincent Roseberry
Vladimir Feinberg
Vlad Kolesnikov
Woohyun Han
Woosuk Kwon
X. T. Chen
Yinlam Chow
Yuvein Zhu
Zichuan Wei
Z. Egyed
Victor Cotruta
Minh Giang
Phoebe Kirk
Anand Rao
Jessica Lo
Erica Moreira
Luiz GUStavo Martins
Omar Sanseviero
Lucas Gonzalez
Zach Gleicher
Tris Brian Warkentin
Seyed Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
D. Sculley
Slav Petrov
Noah Fiedel
Noam M. Shazeer
Oriol Vinyals
Jeffrey Dean
Demis Hassabis
Koray Kavukcuoglu
Clément Farabet
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry Lepikhin
Sebastian Borgeaud
Olivier Bachem
Armand Joulin
Alek Andreev
Cassidy Hardin
Robert Dadashi
L'eonard Hussenot
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters… (see more). This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
Ramona Merhej
Sarah Perrin
Tatiana Matejovicova
Alexandre Ram'e
Morgane Rivière
Louis Rouillard
Thomas Mesnard
Geoffrey Cideron
Jean-Bastien Grill
Sabela Ramos
Edouard Yvinec
Michelle Casbon
Etienne Pot
Ivo Penchev
Gael Liu
Francesco Visin … (see 190 more)
Kathleen Kenealy
Lucas Beyer
Xiaohai Zhai
Anton Tsitsulin
Róbert Busa-Fekete
Alex Feng
Noveen Sachdeva
Benjamin Coleman
Yi Gao
Basil Mustafa
Iain Barr
Emilio Parisotto
David Tian
Matan Eyal
Colin Cherry
Jan-Thorsten Peter
Danila Sinopalnikov
Surya Bhupatiraju
Mehran Kazemi
Dan Malkin
Ravin Kumar
David Vilar
Idan Brusilovsky
Jiaming Luo
Andreas Steiner
Abe Friesen
Abhanshu Sharma
Abheesht Sharma
Adi Mayrav Gilady
Adrian Goedeckemeyer
Alaa Saade
Alexander Kolesnikov
Alexei Bendebury
Alvin Abdagic
Amit Vadi
Andr'as Gyorgy
André Susano Pinto
Anil Das
Ankur Bapna
Antoine Miech
Antoine Yang
Antonia Paterson
Ashish Shenoy
Ayan Chakrabarti
Bilal Piot
Boxi Wu
Bobak Shahriari
Bryce Petrini
Charlie Chen
Charline Le Lan
Christopher A. Choquette-Choo
CJ Carey
Cormac Brick
Daniel Deutsch
Danielle Eisenbud
Dee Cattle
Derek Cheng
Dimitris Paparas
Divyashree Shivakumar Sreepathihalli
Doug Reid
Dustin Tran
Dustin Zelle
Eric Noland
Erwin Huizenga
Eugene Kharitonov
Frederick Liu
Gagik Amirkhanyan
Glenn Cameron
Hadi Hashemi
Hanna Klimczak-Pluci'nska
Harman Singh
Harsh Mehta
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
Jean Pouget-Abadie
Jetha Chan
Joe Stanton
J. Michael Wieting
Jonathan Lai
Jordi Orbay
Joe Fernandez
Joshua Newlan
Junsong Ji
Jyotinder Singh
Kat Black
Kathy Yu
Kevin Hui
Kiran N. Vodrahalli
Klaus Greff
Linhai Qiu
Marcella Valentine
Marina Coelho
Marvin Ritter
Matt Hoffman
Matthew Watson
Mayank Chaturvedi
Michael Moynihan
Min Ma
Nabila Babar
Natasha Noy
Nathan Byrd
Nick Roy
Nikola Momchev
Nilay Chauhan
Oskar Bunyan
Pankil Botarda
Paul Caron
Paul Kishan Rubenstein
Phil Culliton
Philipp Schmid
Pier Giuseppe Sessa
Pingmei Xu
Piotr Stańczyk
Pouya Dehghani Tafti
Rakesh Shivanna
Renjie Wu
Renke Pan
R. Rokni
Rob Willoughby
Rohith Vallu
Ryan Mullins
Sammy Jerome
Sara Smoot
Sertan Girgin
Shariq Iqbal
Shashir Reddy
Shruti Sheth
Siim Põder
Sijal Bhatnagar
S. Panyam
Sivan Eiger
Susan Zhang
Tianqi Liu
Trevor Yacovone
T. Liechty
Uday Kalra
Utku Evci
Vedant Misra
Vincent Roseberry
Vladimir Feinberg
Vlad Kolesnikov
Woohyun Han
Woosuk Kwon
X. T. Chen
Yinlam Chow
Yuvein Zhu
Zichuan Wei
Z. Egyed
Victor Cotruta
Minh Giang
Phoebe Kirk
Anand Rao
Jessica Lo
Erica Moreira
Luiz GUStavo Martins
Omar Sanseviero
Lucas Gonzalez
Zach Gleicher
Tris Brian Warkentin
Seyed Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
D. Sculley
Slav Petrov
Noah Fiedel
Noam M. Shazeer
Oriol Vinyals
Jeffrey Dean
Demis Hassabis
Koray Kavukcuoglu
Clément Farabet
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry Lepikhin
Sebastian Borgeaud
Olivier Bachem
Armand Joulin
Alek Andreev
Cassidy Hardin
Robert Dadashi
L'eonard Hussenot
Gemma 3 Technical Report
Gemma Team Aishwarya Kamath
Johan Ferret
Shreya Pathak
Nino Vieillard
Ramona Merhej
Sarah Perrin
Tatiana Matejovicova
Alexandre Ram'e
Morgane Rivière
Louis Rouillard
Thomas Mesnard
Geoffrey Cideron
Jean-Bastien Grill
Sabela Ramos
Edouard Yvinec
Michelle Casbon
Etienne Pot
Ivo Penchev
Gael Liu
Francesco Visin … (see 190 more)
Kathleen Kenealy
Lucas Beyer
Xiaohai Zhai
Anton Tsitsulin
Róbert Busa-Fekete
Alex Feng
Noveen Sachdeva
Benjamin Coleman
Yi Gao
Basil Mustafa
Iain Barr
Emilio Parisotto
David Tian
Matan Eyal
Colin Cherry
Jan-Thorsten Peter
Danila Sinopalnikov
Surya Bhupatiraju
Mehran Kazemi
Dan Malkin
Ravin Kumar
David Vilar
Idan Brusilovsky
Jiaming Luo
Andreas Steiner
Abe Friesen
Abhanshu Sharma
Abheesht Sharma
Adi Mayrav Gilady
Adrian Goedeckemeyer
Alaa Saade
Alexander Kolesnikov
Alexei Bendebury
Alvin Abdagic
Amit Vadi
Andr'as Gyorgy
André Susano Pinto
Anil Das
Ankur Bapna
Antoine Miech
Antoine Yang
Antonia Paterson
Ashish Shenoy
Ayan Chakrabarti
Bilal Piot
Boxi Wu
Bobak Shahriari
Bryce Petrini
Charlie Chen
Charline Le Lan
Christopher A. Choquette-Choo
CJ Carey
Cormac Brick
Daniel Deutsch
Danielle Eisenbud
Dee Cattle
Derek Cheng
Dimitris Paparas
Divyashree Shivakumar Sreepathihalli
Doug Reid
Dustin Tran
Dustin Zelle
Eric Noland
Erwin Huizenga
Eugene Kharitonov
Frederick Liu
Gagik Amirkhanyan
Glenn Cameron
Hadi Hashemi
Hanna Klimczak-Pluci'nska
Harman Singh
Harsh Mehta
Harshal Tushar Lehri
Hussein Hazimeh
Ian Ballantyne
Idan Szpektor
Ivan Nardini
Jean Pouget-Abadie
Jetha Chan
Joe Stanton
J. Michael Wieting
Jonathan Lai
Jordi Orbay
Joe Fernandez
Joshua Newlan
Junsong Ji
Jyotinder Singh
Kat Black
Kathy Yu
Kevin Hui
Kiran N. Vodrahalli
Klaus Greff
Linhai Qiu
Marcella Valentine
Marina Coelho
Marvin Ritter
Matt Hoffman
Matthew Watson
Mayank Chaturvedi
Michael Moynihan
Min Ma
Nabila Babar
Natasha Noy
Nathan Byrd
Nick Roy
Nikola Momchev
Nilay Chauhan
Oskar Bunyan
Pankil Botarda
Paul Caron
Paul Kishan Rubenstein
Phil Culliton
Philipp Schmid
Pier Giuseppe Sessa
Pingmei Xu
Piotr Stańczyk
Pouya Dehghani Tafti
Rakesh Shivanna
Renjie Wu
Renke Pan
R. Rokni
Rob Willoughby
Rohith Vallu
Ryan Mullins
Sammy Jerome
Sara Smoot
Sertan Girgin
Shariq Iqbal
Shashir Reddy
Shruti Sheth
Siim Põder
Sijal Bhatnagar
S. Panyam
Sivan Eiger
Susan Zhang
Tianqi Liu
Trevor Yacovone
T. Liechty
Uday Kalra
Utku Evci
Vedant Misra
Vincent Roseberry
Vladimir Feinberg
Vlad Kolesnikov
Woohyun Han
Woosuk Kwon
X. T. Chen
Yinlam Chow
Yuvein Zhu
Zichuan Wei
Z. Egyed
Victor Cotruta
Minh Giang
Phoebe Kirk
Anand Rao
Jessica Lo
Erica Moreira
Luiz GUStavo Martins
Omar Sanseviero
Lucas Gonzalez
Zach Gleicher
Tris Brian Warkentin
Seyed Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
D. Sculley
Slav Petrov
Noah Fiedel
Noam M. Shazeer
Oriol Vinyals
Jeffrey Dean
Demis Hassabis
Koray Kavukcuoglu
Clément Farabet
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry Lepikhin
Sebastian Borgeaud
Olivier Bachem
Armand Joulin
Alek Andreev
Cassidy Hardin
Robert Dadashi
L'eonard Hussenot
We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters… (see more). This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
Generating Complex Question Decompositions in the Face of Distribution Shifts.
Kelvin Han
Claire Gardent
Marah Ihab Abdin
Jyoti Aneja
Hany Hassan Awadalla
Ammar Ahmed Awadallah
Ahmad Awan
Nguyen Bach
Amit Bahree
Arash Bakhtiari
Jianmin Bao
Harkirat Singh Behl
Alon Benhaim
Misha Bilenko
Johan Bjorck
Sébastien Bubeck
Martin Cai
Qin Cai
Vishrav Chaudhary
Dong Chen … (see 342 more)
Weizhu Chen
Yen-Chun Chen 0001
Yi-ling Chen
Hao Cheng
Parul Chopra
Xiyang Dai
Matthew Dixon
Ronen Eldan
Victor Fragoso
Jianfeng Gao
Mei Gao
Min Gao
Amit Garg
Allison Del Giorno
Abhishek Goswami
S. Gunasekar
Emman Haider
Jun-heng Hao
Russell J. Hewett
Wen-Wei Hu
Jamie Huynh
Dan Iter
Sam Ade Jacobs
Mojan Javaheripi
Xin Jin
Nikos Karampatziakis
Piero Kauffmann
Mahoud Khademi
Dongwoo Kim
Young Jin Kim
Lev Kurilenko
James R. Lee
Yin Tat Lee
Yuanzhi Li
Yunsheng Li
Chen Liang
Lars Lidén
Xihui
Zeqi Lin
Ce Lin
Liyuan Liu
Mengchen Liu
Liu Weishung
Xiaodong Liu
Chong Liu
Piyush Luo
Ali Madan
David Mahmoudzadeh
Matt Majercak
Caio Mazzola
César Teodoro
Arindam Mendes
Hardik Mitra
Anh Modi
Brandon Nguyen
Norick Barun
Daniel Patra
Thomas Perez-Becker
Portet Reid
Heyang Pryzant
Marko Qin
Liliang Radmilac
Gustavo Ren
Corby de Rosa
Sambudha Rosset
Roy Olatunji
Olli Ruwase
Amin Saarikivi
Adil Saied
Michael Salim
Shital Santacroce
Ning Shah
Shang Hiteshi
Yelong Sharma
Swadheen Shen
Xia Shukla
Masahiro Song
Andrea Tanaka
Praneetha Tupini
Michael Wu
Bin Wyatt
Can Xiao
Jiahang Xu
Weijiang Xu
Jilong Xu
Sonali Xue
Fan Yadav
Jianwei Yang
Yifan Yang
Ziyi Yang
Donghan Yang
Yu Lu
Chenruidong Yuan
Cyril Zhang
Jianwen Zhang
Zhang
Li Lyna
Yi Zhang
Yue Zhang
Yunan Zhang 0001
Zhang Xiren
Zhou
Phi-3
Priyanka Agrawal
Chris Alberti
Fantine Huot
Joshua Maynez
Ji Ma
Kuzman Ganchev
Viraat Aryabumi
John Dang
Dwarak Talupuru
Saurabh Dash
David Cairuz
Hangyu Lin
Bharat Venkitesh
Madeline Smith
Jon Ander Campos
Yi Chern Tan
Kelly Marchisio
Max Bartolo
Sebastian Ruder
Acyr F. Locatelli
Julia Kreutzer
Nick Frosst
Aidan Gomez
Phil Blunsom
Marzieh Fadaee
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
T. Henighan
Rewon Child
Aditya Ramesh
Daniel M. Ziegler
Jeffrey Wu
Clemens Winter
Chris Hesse
Mark Chen
Eric Sigler
Ma-teusz Litwin
Scott Gray
Benjamin Chess
J. Clark
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei Gemma Team
Morgane Rivière
Shreya Pathak Pier
Giuseppe Sessa
Cassidy Hardin
Surya Bhupati-raju
L'eonard Hussenot
Thomas Mesnard
Bobak Shahriari
Alexandre Ramé
Johan Ferret
Peter Liu
Pouya Dehghani Tafti
Abe Friesen
Michelle Casbon
Sabela Ramos
Ravin Kumar
Charline Le Lan
Sammy Jerome
Anton Tsitsulin
Nino Vieillard
Piotr Stańczyk
Sertan Girgin
Nikola Momchev
Matt Hoffman
Shantanu Thakoor
Jean-Bastien Grill
Behnam Neyshabur
Olivier Bachem
Alanna Wal-ton
Aliaksei Severyn
Alicia Parrish
Aliya Ah-mad
Allen Hutchison
Alvin Abdagic
Amanda Carl
Amy Shen
Andy Brock
Andy Coenen
Anthony Laforge
Antonia Paterson
Ben Bastian
Bilal Piot
Boxi Wu
Brandon Royal
Charlie Chen
Chintu Kumar
Chris Perry
Christoper A. Welty
Christopher A. Choquette-Choo
Danila Sinopalnikov
David Wein-berger
Dimple Vijaykumar
Dominika Rogozi´nska
D. Herbison
Elisa Bandy
Emma Wang
Eric Noland
Erica Moreira
Evan Senter
Evgenii Elty-shev
Francesco Visin
Gabriel Rasskin
Gary Wei
Glenn Cameron
Gus Martins
Hadi Hashemi
Hanna Klimczak-Pluci´nska
Harleen Batra
Harsh Dhand
Ivan Nardini
Jacinda Mein
Jack Zhou
James Svens-son
Jeff Stanway
Jetha Chan
J. Zhou
Joana Carrasqueira
Joana Iljazi
Jocelyn Becker
Joe Fer-nandez
Joost Van Amersfoort
Josh Gordon
Josh Lipschultz
Joshua Newlan
Junsong Ji
Kareem Mo-hamed
Kartikeya Badola
Kat Black
Katie Mil-lican
Keelin McDonell
Kelvin Nguyen
Kiranbir Sodhia
Kish Greene
Lars Lowe Sjoesund
Lauren Usui
Laurent Sifre
L. Heuermann
Leti-cia Lago
Lilly McNealus
Livio Baldini
Soares Logan
Lucas Kilpatrick
Luciano Dixon
Martins Machel
Manvinder Reid
Mark Singh
Martin Görner Iverson
Mateo Wirth Mat Velloso
Matt Davi-dow
Matt Miller
Matthew Rahtz
Matthew Watson
Meg Risdal
Mehran Kazemi
Michael Moynihan
Ming Zhang
Minsuk Kahng
Minwoo Park
Mofi Rahman
Mohit Khatwani
Natalie Dao
Nenshad Bardoliwalla
N. Devanathan
Neta Dumai
Nilay Chauhan
O. Wahltinez
Pankil Botarda
Parker Barnes
Paul R. Barham
Paul Michel
Peng-chong Jin
Petko Georgiev
Phil Culliton
Pradeep Kup-pala
Ramona Comanescu
Ramona Merhej
Reena Jana
R. Rokni
Ryan Mullins
Samaneh Saadat
S. M. Carthy
Sarah Cogan
Sarah Perrin
S'ebastien M. R. Arnold
Se-bastian Krause
Shengyang Dai
S. Garg
Shruti Sheth
S. Ronstrom
Susan Chan
Timothy Jordan
Bing Yu
Tom Eccles
Tom Hennigan
Tomas Kocisky
Tulsee Doshi
Vihan Jain
Vikas Yadav
Vilobh Meshram
Vishal Dharmadhikari
Warren Barkley
Wei Wei
Wenming Ye
Woohyun Han
Woosuk Kwon
Xiang Xu
Zhe Shen
Zhitao Gong
Zichuan Wei
Victor Cotruta
Phoebe Kirk
Anand Rao
Minh Giang
Ludovic Peran
Tris Brian Warkentin
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
D. Sculley
Jeanine Banks
Anca Dragan
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Yinlam Chow
Guy Tennenholtz
Izzeddin Gur
Vincent Zhuang
Bo Dai
Sridhar Thiagarajan
Craig Boutilier
Aviral Kumar
Aleksandra Faust
Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large langu… (see more)age models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Tianyang Zhao
Kunwar Yashraj Singh
Srikar Appalaraju
Peng Tang
Ying Nian Wu
Li Erran Li
Li
Nino Vieillard
Yongchao Zhou
Piotr Stańczyk
Sabela Ramos Garea
Matthieu Geist
Rohan Anil
Andrew M. Dai
Melvin Orhan Firat
Dmitry Lepikhin
Alexandre Passos
Siamak Shakeri
Emanuel Taropa … (see 478 more)
Paige Bailey
Zhifeng Chen
Eric Chu
Jonathan H. Clark
Laurent El
Yanping Huang
K. Meier-Hellstern
Gaurav Mishra
Erica Moreira
Mark Omernick
Kevin Robinson
Sebastian Ruder
Yi Tay
Kefan Xiao
Yuanzhong Xu
Yujing Zhang
Gustavo Hernández Abrego
Junwhan Ahn
Jacob Austin
Paul R. Barham
Jan Botha
James Bradbury
Siddhartha Brahma
Kevin Brooks
M. Catasta
Yong Cheng
Colin Cherry
Christopher A. Choquette-Choo
Aakanksha Chowdhery
Clé-ment Crepy
Shachi Dave
Mostafa Dehghani
Sunipa Dev
Jacob Devlin
Mark Díaz
Nan Du
Ethan Dyer
Vladimir Feinberg
Fangxiaoyu Feng
Vlad Fienber
Markus Freitag
Xavier Garcia
Sebastian Gehrmann
Lucas Gonzalez
Guy Gur-Ari
Steven Hand
Hadi Hashemi
Le Hou
Joshua Howland
Andrea Hu
Jeffrey Hui
Jeremy Hur-witz
Michael Acheson Isard
Abe Ittycheriah
Matthew Jagiel-ski
Wenhao Jia
Kathleen Kenealy
M. Krikun
Sneha Kudugunta 0001
Chang Lan
Kather-ine Lee
Benjamin Lee
Music Eric Li
Wei Li
YaGuang Li
Li Jian
Hyeontaek Li
Hanzhao Lim
Zhongtao Lin
Liu Frederick
Marcello Liu
Aroma Maggioni
Mahendru Joshua
Vedant Maynez
Maysam Misra
Moussalem Zachary
John Nado
E. Nham
Andrew Ni
Alicia Nys-trom
Marie Parrish
M. Pellat
Polacek Alex
Reiner Polozov
Siyuan Pope
Emily Qiao
Reif Bryan
Parker Richter
Alex Riley
Castro Ros
Aurko Roy
Brennan Saeta
Rajkumar Samuel
Renee Shelby
Ambrose Slone
Daniel Smilkov
David R. So
Daniel Sohn
Simon Tokumine
Dasha Valter
Haim-ing Bao
Mo Bavarian
Jeff Belgum
Ir-wan Bello
Jake Berdine
Gabriel Bernadett-Shapiro
Christopher Berner
Lenny Bogdonoff
Oleg Boiko
Madelaine Boyd
Anna-Luisa Brakman
Greg Brock-man
Tim Brooks
M. Brundage
Kevin Button
Trevor Cai
Rosie Campbell
Andrew Cann
Brittany Carey
Chelsea Carlson
Rory Carmichael
Brooke Chan
Che Chang
Fotis Chantzis
Derek Chen
Sully Chen
Ruby Chen
Jason Chen
Mark Chen
Benjamin Chess
Chester Cho
Hyung Casey Chu
Won Chung
Dave Cummings
Jeremiah Currier
Yunxing Dai
Tarun Goel
Gabriel Gogineni
Rapha Goh
Jonathan Gontijo-Lopes
Morgan Gordon
Scott Grafstein
Ryan Gray
Joshua Greene
Shixiang Shane Gross
Yufei Gu
Chris Guo
Jesse Hallacy
Jeff Han
Harris Yuchen
Mike He
Johannes Heaton
C. Heidecke
Alan Hesse
Wade Hickey
Peter Hickey
Hoeschele Brandon
Kenny Houghton
Shengli Hsu
Xin Hu
Joost Hu
Shantanu Huizinga
Shawn Jain
Jain Joanne
Angela Jang
Roger Jiang
Haozhun Jiang
Denny Jin
Shino Jin
Billie Jomoto
Hee-woo Jonn
Tomer Jun
Łukasz Kaftan
Ali Kaiser
Ingmar Ka-mali
Kanitscheider
Nitish Shirish
Keskar Tabarak
Logan Khan
J. Kilpatrick
Kim Christina
Yongjik Kim
Jan Hendrik Kim
Jamie Kirch-ner
Matt Kiros
Daniel Knight
Kokotajlo Łukasz
A. Kondraciuk
Aris Kondrich
Kyle Kon-stantinidis
Gretchen Kosic
Vishal Krueger
Michael Kuo
Ikai Lampe
Teddy Lan
Jan Lee
Jade Leike
Daniel Leung
Chak Ming Levy
Li Rachel
Molly Lim
Stephanie Lin
Mateusz Lin
Theresa Litwin
Ryan Lopez
Patricia Lowe
Lue Anna
Kim Makanju
S. Malfacini
Todor Manning
Yaniv Markov
Bianca Markovski
Katie Martin
Andrew Mayer
Bob Mayne
Scott Mayer McGrew
Christine McKinney
Paul McLeavey
McMillan Jake
David McNeil
Aalok Medina
Jacob Mehta
Luke Menick
Andrey Metz
Pamela Mishchenko
Vinnie Mishkin
Evan Monaco
Daniel Morikawa
Tong Mossing
Mira Mu
Oleg Murati
David Murk
Ashvin Mély
Reiichiro Nair
Rajeev Nakano
Nayak Arvind
Richard Neelakantan
Hyeonwoo Ngo
Noh Long
Cullen Ouyang
Jakub O’Keefe
Alex Pachocki
J. Paino
Ashley Palermo
Pantuliano
Carl Ross
Bob Rotsted
Henri Roussez
Nick Ry-der
Mario Saltarelli
Ted Sanders
Shibani Santurkar
Girish Sastry
Heather Schmidt
David Schnurr
John Schulman
Daniel Selsam
Kyla Sheppard
Toki Sherbakov
Jessica Shieh
Sarah Shoker
Pranav Shyam
Szymon Sidor
Eric Sigler
Maddie Simens
Jordan Sitkin
Katarina Slama
Ian Sohl
Benjamin D. Sokolowsky
Yang Song
Natalie Staudacher
Clemens Winter
Samuel Wolrich
Hannah Wong
Lauren Workman
Sherwin Wu
Michael Wu
Kai Xiao
Tao Xu
Sarah Yoo
Kevin Yu
Qim-ing Yuan
Wojciech Zaremba
Rowan G. Zellers
Chong Zhang
Marvin Zhang
Tianhao Shengjia Zhao
Ouyang Long
Jeff Wu
Xu Jiang
Diogo Almeida
C. Wainwright
Pamela Mishkin
Sandhini Agarwal
Alex Ray
Jacob Hilton
Fraser Kelton
Luke Miller
Amanda Askell
Peter Welinder
Paul F. Christiano
Jan Leike
Ryan Lowe. 2022
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
Gregory Chanan
Trevor Killeen
Ze-Bin Lin
Natalia Gimelshein
L. Antiga
Alban Desmaison
Andreas Köpf
Edward Yang
Zachary DeVito
Martin Raison
A. Tejani
Sasank Chilamkurthy
Benoit Steiner
Giovanni Puccetti
Anna Rogers
Aleksandr Drozd
Felice
Dell’Orletta. 2022. Outlier
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya Ramesh
Gabriel Goh
Girish Sas-try
J. Clark
Rewon Child
David Luan
Victor Sanh
Alex Webson
Colin Raffel
Stephen H. Bach
Lintang A. Sutawika
Zaid Alyafeai
Antoine Chaffin
Arnaud Stiegler
Arun Raja
Manan Dey
Saiful Bari
Canwen Xu
Urmish Thakker
Shanya Sharma Sharma
Eliza Szczechla
Taewoon Kim 0002
Gunjan Chhablani
Ni-hal Nayak
Debajyoti Datta
Mike Jonathan Chang
Tian-Jian Jiang
Han Wang
Matteo Manica
Sheng Shen
Zheng-Xin Yong
Harshit Pandey
Rachel Bawden
Thomas Wang
Trishala Neeraj
Jos Rozen
Abheesht Sharma
Thibault Févry
Jason Alan Fries
Ryan Teehan
Teven Le Scao
Stella Biderman
Leo Gao
Thomas Wolf 0008
A. M. R. 2022
Multi-task
Richard Socher
Alex Perelygin
Jean Wu
Jason Chuang
Christopher D Manning
Andrew Ng
Christopher Potts
Recursive
Aarohi Srivastava
Abhinav Rastogi
Abhishek Rao
Abu Awal
Md. Shoeb
Abubakar Abid
Adam Fisch
Adam R. Brown
Adam Santoro
Aditya Gupta
Adrià Garriga-Alonso
Agnieszka Kluska
Aitor Lewkowycz
Akshat Agarwal
Alethea Power
Alex Warstadt
Alexander W. Kocurek
Ali Safaya
Ali Tazarv
Alice Xiang
Alicia Parrish
Allen Nie
Aman Hussain
Amanda Dsouza
Ameet Rahane
Anantharaman S. Iyer
Anders Johan Andreassen
Andrea Madotto
Andrea Santilli
Andreas Stuhlmüller
Andrew La
Andrew Lampinen
Andy Zou
Angela Jiang
Angelica Chen
Anh Vuong
Animesh Gupta
Anna Gottardi
Antonio Norelli
Anu Venkatesh
Arash Gholamidavoodi
Arfa Tabassum
Arul Menezes
Arun Kirubara-jan
Asher Mullokandov
Ashish Sabharwal
Austin Herrick
Avia Efrat
Aykut Erdem
Ayla Karaka¸s
Ryan Roberts
Bao Sheng Loe
Barret Zoph
Bartłomiej Bojanowski
Batuhan Özyurt
Behnam Hedayatnia
Behnam Neyshabur
Benjamin Inden
Benno Stein
Berk Ekmekci
Bill Yuchen
Blake Lin
Bryan Howald
Cameron Orinion
Cameron Diao
Catherine Dour
Cedrick Stinson
César Argueta
Chandan Ferri
Charles Singh
Chenlin Rathkopf
Chitta Meng
C. Baral
Chris Wu
Chris Callison-Burch
Christopher Waites
Christo-pher D Voigt
Cindy Potts
E. RamirezClara
Clemencia Rivera
Colin Siro
Court-ney Raffel
Cristina Ashcraft
Damien Garbacea
Sileo Dan
Dan Garrette
Dan Hendrycks
Dan Kilman
C. Roth
C. Daniel Freeman
Daniel Khashabi
Daniel Levy
Daniel Moseguí González
Danielle Perszyk
Danny Hernandez
Danqi Chen
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar
Vincent Zhuang
Yi Su
John D Co-Reyes
Avi Singh
Kate Baumli
Shariq Iqbal
Colton Bishop
Rebecca Roelofs
Lei M Zhang
Kay McKinney
Disha Shrivastava
Cosmin Paduraru
George Tucker
Feryal Behbahani
Aleksandra Faust
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (see more)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.