Rishabh Agarwal

Mehran Kazemi

Dan Malkin

Ravin Kumar

David Vilar

Idan Brusilovsky

Jiaming Luo

Andreas Steiner

Abe Friesen

Abhanshu Sharma

Abheesht Sharma

Adi Mayrav Gilady

Adrian Goedeckemeyer

Alaa Saade

Alexander Kolesnikov

Alexei Bendebury

Alvin Abdagic

Amit Vadi

Andr'as Gyorgy

André Susano Pinto

Anil Das

Ankur Bapna

Antoine Miech

Antoine Yang

Antonia Paterson

Ashish Shenoy

Ayan Chakrabarti

Bilal Piot

Boxi Wu

Bobak Shahriari

Bryce Petrini

Charlie Chen

Charline Le Lan

Christopher A. Choquette-Choo

CJ Carey

Cormac Brick

Daniel Deutsch

Danielle Eisenbud

Dee Cattle

Derek Cheng

Dimitris Paparas

Divyashree Shivakumar Sreepathihalli

Doug Reid

Dustin Tran

Dustin Zelle

Eric Noland

Erwin Huizenga

Eugene Kharitonov

Frederick Liu

Gagik Amirkhanyan

Glenn Cameron

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harman Singh

Harsh Mehta

Harshal Tushar Lehri

Hussein Hazimeh

Ian Ballantyne

Idan Szpektor

Ivan Nardini

Jean Pouget-Abadie

Jetha Chan

Joe Stanton

J. Michael Wieting

Jonathan Lai

Jordi Orbay

Joe Fernandez

Joshua Newlan

Junsong Ji

Jyotinder Singh

Kat Black

Kathy Yu

Kevin Hui

Kiran N. Vodrahalli

Klaus Greff

Linhai Qiu

Marcella Valentine

Marina Coelho

Marvin Ritter

Matt Hoffman

Matthew Watson

Mayank Chaturvedi

Michael Moynihan

Min Ma

Nabila Babar

Natasha Noy

Nathan Byrd

Nick Roy

Nikola Momchev

Nilay Chauhan

Oskar Bunyan

Pankil Botarda

Paul Caron

Paul Kishan Rubenstein

Phil Culliton

Philipp Schmid

Pier Giuseppe Sessa

Pingmei Xu

Piotr Stańczyk

Pouya Dehghani Tafti

Rakesh Shivanna

Renjie Wu

Renke Pan

R. Rokni

Rob Willoughby

Rohith Vallu

Ryan Mullins

Sammy Jerome

Sara Smoot

Sertan Girgin

Shariq Iqbal

Shashir Reddy

Shruti Sheth

Siim Põder

Sijal Bhatnagar

S. Panyam

Sivan Eiger

Susan Zhang

Tianqi Liu

Trevor Yacovone

T. Liechty

Uday Kalra

Utku Evci

Vedant Misra

Vincent Roseberry

Vladimir Feinberg

Vlad Kolesnikov

Woohyun Han

Woosuk Kwon

X. T. Chen

Yinlam Chow

Yuvein Zhu

Zichuan Wei

Z. Egyed

Victor Cotruta

Minh Giang

Phoebe Kirk

Anand Rao

Jessica Lo

Erica Moreira

Luiz GUStavo Martins

Omar Sanseviero

Lucas Gonzalez

Zach Gleicher

Tris Brian Warkentin

Seyed Vahab Mirrokni

Evan Senter

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

Yossi Matias

D. Sculley

Slav Petrov

Noah Fiedel

Noam M. Shazeer

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Jean-Baptiste Alayrac

Rohan Anil

Dmitry Lepikhin

Sebastian Borgeaud

Olivier Bachem

Armand Joulin

Alek Andreev

Cassidy Hardin

Robert Dadashi

L'eonard Hussenot

2025-03-25

ArXiv (preprint)

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath

Johan Ferret

Shreya Pathak

Nino Vieillard

Ramona Merhej

Sarah Perrin

Tatiana Matejovicova

Alexandre Ram'e

Morgane Rivière

Louis Rouillard

Thomas Mesnard

Geoffrey Cideron

Jean-Bastien Grill

Sabela Ramos

Edouard Yvinec

Michelle Casbon

Etienne Pot

Ivo Penchev

Gael Liu

Francesco Visin … (see 190 more)

Kathleen Kenealy

Lucas Beyer

Xiaohai Zhai

Anton Tsitsulin

Róbert Busa-Fekete

Alex Feng

Noveen Sachdeva

Benjamin Coleman

Yi Gao

Basil Mustafa

Iain Barr

Emilio Parisotto

David Tian

Matan Eyal

Colin Cherry

Jan-Thorsten Peter

Danila Sinopalnikov

Surya Bhupatiraju

Mehran Kazemi

Dan Malkin

Ravin Kumar

David Vilar

Idan Brusilovsky

Jiaming Luo

Andreas Steiner

Abe Friesen

Abhanshu Sharma

Abheesht Sharma

Adi Mayrav Gilady

Adrian Goedeckemeyer

Alaa Saade

Alexander Kolesnikov

Alexei Bendebury

Alvin Abdagic

Amit Vadi

Andr'as Gyorgy

André Susano Pinto

Anil Das

Ankur Bapna

Antoine Miech

Antoine Yang

Antonia Paterson

Ashish Shenoy

Ayan Chakrabarti

Bilal Piot

Boxi Wu

Bobak Shahriari

Bryce Petrini

Charlie Chen

Charline Le Lan

Christopher A. Choquette-Choo

CJ Carey

Cormac Brick

Daniel Deutsch

Danielle Eisenbud

Dee Cattle

Derek Cheng

Dimitris Paparas

Divyashree Shivakumar Sreepathihalli

Doug Reid

Dustin Tran

Dustin Zelle

Eric Noland

Erwin Huizenga

Eugene Kharitonov

Frederick Liu

Gagik Amirkhanyan

Glenn Cameron

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harman Singh

Harsh Mehta

Harshal Tushar Lehri

Hussein Hazimeh

Ian Ballantyne

Idan Szpektor

Ivan Nardini

Jean Pouget-Abadie

Jetha Chan

Joe Stanton

J. Michael Wieting

Jonathan Lai

Jordi Orbay

Joe Fernandez

Joshua Newlan

Junsong Ji

Jyotinder Singh

Kat Black

Kathy Yu

Kevin Hui

Kiran N. Vodrahalli

Klaus Greff

Linhai Qiu

Marcella Valentine

Marina Coelho

Marvin Ritter

Matt Hoffman

Matthew Watson

Mayank Chaturvedi

Michael Moynihan

Min Ma

Nabila Babar

Natasha Noy

Nathan Byrd

Nick Roy

Nikola Momchev

Nilay Chauhan

Oskar Bunyan

Pankil Botarda

Paul Caron

Paul Kishan Rubenstein

Phil Culliton

Philipp Schmid

Pier Giuseppe Sessa

Pingmei Xu

Piotr Stańczyk

Pouya Dehghani Tafti

Rakesh Shivanna

Renjie Wu

Renke Pan

R. Rokni

Rob Willoughby

Rohith Vallu

Ryan Mullins

Sammy Jerome

Sara Smoot

Sertan Girgin

Shariq Iqbal

Shashir Reddy

Shruti Sheth

Siim Põder

Sijal Bhatnagar

S. Panyam

Sivan Eiger

Susan Zhang

Tianqi Liu

Trevor Yacovone

T. Liechty

Uday Kalra

Utku Evci

Vedant Misra

Vincent Roseberry

Vladimir Feinberg

Vlad Kolesnikov

Woohyun Han

Woosuk Kwon

X. T. Chen

Yinlam Chow

Yuvein Zhu

Zichuan Wei

Z. Egyed

Victor Cotruta

Minh Giang

Phoebe Kirk

Anand Rao

Jessica Lo

Erica Moreira

Luiz GUStavo Martins

Omar Sanseviero

Lucas Gonzalez

Zach Gleicher

Tris Brian Warkentin

Seyed Vahab Mirrokni

Evan Senter

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

Yossi Matias

D. Sculley

Slav Petrov

Noah Fiedel

Noam M. Shazeer

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Jean-Baptiste Alayrac

Rohan Anil

Dmitry Lepikhin

Sebastian Borgeaud

Olivier Bachem

Armand Joulin

Alek Andreev

Cassidy Hardin

Robert Dadashi

L'eonard Hussenot

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters… (see more). This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

2025-03-25

ArXiv (preprint)

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath

Johan Ferret

Shreya Pathak

Nino Vieillard

Ramona Merhej

Sarah Perrin

Tatiana Matejovicova

Alexandre Ram'e

Morgane Rivière

Louis Rouillard

Thomas Mesnard

Geoffrey Cideron

Jean-Bastien Grill

Sabela Ramos

Edouard Yvinec

Michelle Casbon

Etienne Pot

Ivo Penchev

Gael Liu

Francesco Visin … (see 190 more)

Kathleen Kenealy

Lucas Beyer

Xiaohai Zhai

Anton Tsitsulin

Róbert Busa-Fekete

Alex Feng

Noveen Sachdeva

Benjamin Coleman

Yi Gao

Basil Mustafa

Iain Barr

Emilio Parisotto

David Tian

Matan Eyal

Colin Cherry

Jan-Thorsten Peter

Danila Sinopalnikov

Surya Bhupatiraju

Mehran Kazemi

Dan Malkin

Ravin Kumar

David Vilar

Idan Brusilovsky

Jiaming Luo

Andreas Steiner

Abe Friesen

Abhanshu Sharma

Abheesht Sharma

Adi Mayrav Gilady

Adrian Goedeckemeyer

Alaa Saade

Alexander Kolesnikov

Alexei Bendebury

Alvin Abdagic

Amit Vadi

Andr'as Gyorgy

André Susano Pinto

Anil Das

Ankur Bapna

Antoine Miech

Antoine Yang

Antonia Paterson

Ashish Shenoy

Ayan Chakrabarti

Bilal Piot

Boxi Wu

Bobak Shahriari

Bryce Petrini

Charlie Chen

Charline Le Lan

Christopher A. Choquette-Choo

CJ Carey

Cormac Brick

Daniel Deutsch

Danielle Eisenbud

Dee Cattle

Derek Cheng

Dimitris Paparas

Divyashree Shivakumar Sreepathihalli

Doug Reid

Dustin Tran

Dustin Zelle

Eric Noland

Erwin Huizenga

Eugene Kharitonov

Frederick Liu

Gagik Amirkhanyan

Glenn Cameron

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harman Singh

Harsh Mehta

Harshal Tushar Lehri

Hussein Hazimeh

Ian Ballantyne

Idan Szpektor

Ivan Nardini

Jean Pouget-Abadie

Jetha Chan

Joe Stanton

J. Michael Wieting

Jonathan Lai

Jordi Orbay

Joe Fernandez

Joshua Newlan

Junsong Ji

Jyotinder Singh

Kat Black

Kathy Yu

Kevin Hui

Kiran N. Vodrahalli

Klaus Greff

Linhai Qiu

Marcella Valentine

Marina Coelho

Marvin Ritter

Matt Hoffman

Matthew Watson

Mayank Chaturvedi

Michael Moynihan

Min Ma

Nabila Babar

Natasha Noy

Nathan Byrd

Nick Roy

Nikola Momchev

Nilay Chauhan

Oskar Bunyan

Pankil Botarda

Paul Caron

Paul Kishan Rubenstein

Phil Culliton

Philipp Schmid

Pier Giuseppe Sessa

Pingmei Xu

Piotr Stańczyk

Pouya Dehghani Tafti

Rakesh Shivanna

Renjie Wu

Renke Pan

R. Rokni

Rob Willoughby

Rohith Vallu

Ryan Mullins

Sammy Jerome

Sara Smoot

Sertan Girgin

Shariq Iqbal

Shashir Reddy

Shruti Sheth

Siim Põder

Sijal Bhatnagar

S. Panyam

Sivan Eiger

Susan Zhang

Tianqi Liu

Trevor Yacovone

T. Liechty

Uday Kalra

Utku Evci

Vedant Misra

Vincent Roseberry

Vladimir Feinberg

Vlad Kolesnikov

Woohyun Han

Woosuk Kwon

X. T. Chen

Yinlam Chow

Yuvein Zhu

Zichuan Wei

Z. Egyed

Victor Cotruta

Minh Giang

Phoebe Kirk

Anand Rao

Jessica Lo

Erica Moreira

Luiz GUStavo Martins

Omar Sanseviero

Lucas Gonzalez

Zach Gleicher

Tris Brian Warkentin

Seyed Vahab Mirrokni

Evan Senter

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

Yossi Matias

D. Sculley

Slav Petrov

Noah Fiedel

Noam M. Shazeer

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Jean-Baptiste Alayrac

Rohan Anil

Dmitry Lepikhin

Sebastian Borgeaud

Olivier Bachem

Armand Joulin

Alek Andreev

Cassidy Hardin

Robert Dadashi

L'eonard Hussenot

2025-03-25

ArXiv (preprint)

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath

Johan Ferret

Shreya Pathak

Nino Vieillard

Ramona Merhej

Sarah Perrin

Tatiana Matejovicova

Alexandre Ram'e

Morgane Rivière

Louis Rouillard

Thomas Mesnard

Geoffrey Cideron

Jean-Bastien Grill

Sabela Ramos

Edouard Yvinec

Michelle Casbon

Etienne Pot

Ivo Penchev

Gael Liu

Francesco Visin … (see 190 more)

Kathleen Kenealy

Lucas Beyer

Xiaohai Zhai

Anton Tsitsulin

Róbert Busa-Fekete

Alex Feng

Noveen Sachdeva

Benjamin Coleman

Yi Gao

Basil Mustafa

Iain Barr

Emilio Parisotto

David Tian

Matan Eyal

Colin Cherry

Jan-Thorsten Peter

Danila Sinopalnikov

Surya Bhupatiraju

Mehran Kazemi

Dan Malkin

Ravin Kumar

David Vilar

Idan Brusilovsky

Jiaming Luo

Andreas Steiner

Abe Friesen

Abhanshu Sharma

Abheesht Sharma

Adi Mayrav Gilady

Adrian Goedeckemeyer

Alaa Saade

Alexander Kolesnikov

Alexei Bendebury

Alvin Abdagic

Amit Vadi

Andr'as Gyorgy

André Susano Pinto

Anil Das

Ankur Bapna

Antoine Miech

Antoine Yang

Antonia Paterson

Ashish Shenoy

Ayan Chakrabarti

Bilal Piot

Boxi Wu

Bobak Shahriari

Bryce Petrini

Charlie Chen

Charline Le Lan

Christopher A. Choquette-Choo

CJ Carey

Cormac Brick

Daniel Deutsch

Danielle Eisenbud

Dee Cattle

Derek Cheng

Dimitris Paparas

Divyashree Shivakumar Sreepathihalli

Doug Reid

Dustin Tran

Dustin Zelle

Eric Noland

Erwin Huizenga

Eugene Kharitonov

Frederick Liu

Gagik Amirkhanyan

Glenn Cameron

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harman Singh

Harsh Mehta

Harshal Tushar Lehri

Hussein Hazimeh

Ian Ballantyne

Idan Szpektor

Ivan Nardini

Jean Pouget-Abadie

Jetha Chan

Joe Stanton

J. Michael Wieting

Jonathan Lai

Jordi Orbay

Joe Fernandez

Joshua Newlan

Junsong Ji

Jyotinder Singh

Kat Black

Kathy Yu

Kevin Hui

Kiran N. Vodrahalli

Klaus Greff

Linhai Qiu

Marcella Valentine

Marina Coelho

Marvin Ritter

Matt Hoffman

Matthew Watson

Mayank Chaturvedi

Michael Moynihan

Min Ma

Nabila Babar

Natasha Noy

Nathan Byrd

Nick Roy

Nikola Momchev

Nilay Chauhan

Oskar Bunyan

Pankil Botarda

Paul Caron

Paul Kishan Rubenstein

Phil Culliton

Philipp Schmid

Pier Giuseppe Sessa

Pingmei Xu

Piotr Stańczyk

Pouya Dehghani Tafti

Rakesh Shivanna

Renjie Wu

Renke Pan

R. Rokni

Rob Willoughby

Rohith Vallu

Ryan Mullins

Sammy Jerome

Sara Smoot

Sertan Girgin

Shariq Iqbal

Shashir Reddy

Shruti Sheth

Siim Põder

Sijal Bhatnagar

S. Panyam

Sivan Eiger

Susan Zhang

Tianqi Liu

Trevor Yacovone

T. Liechty

Uday Kalra

Utku Evci

Vedant Misra

Vincent Roseberry

Vladimir Feinberg

Vlad Kolesnikov

Woohyun Han

Woosuk Kwon

X. T. Chen

Yinlam Chow

Yuvein Zhu

Zichuan Wei

Z. Egyed

Victor Cotruta

Minh Giang

Phoebe Kirk

Anand Rao

Jessica Lo

Erica Moreira

Luiz GUStavo Martins

Omar Sanseviero

Lucas Gonzalez

Zach Gleicher

Tris Brian Warkentin

Seyed Vahab Mirrokni

Evan Senter

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

Yossi Matias

D. Sculley

Slav Petrov

Noah Fiedel

Noam M. Shazeer

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Jean-Baptiste Alayrac

Rohan Anil

Dmitry Lepikhin

Sebastian Borgeaud

Olivier Bachem

Armand Joulin

Alek Andreev

Cassidy Hardin

Robert Dadashi

L'eonard Hussenot

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters… (see more). This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

2025-03-25

ArXiv (preprint)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2025-01-22

ICLR.cc/2025/Conference (poster)

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.

2025-01-22

ICLR.cc/2025/Conference (poster)

Generating Complex Question Decompositions in the Face of Distribution Shifts.

Kelvin Han

Claire Gardent

Marah Ihab Abdin

Jyoti Aneja

Hany Hassan Awadalla

Ammar Ahmed Awadallah

Ahmad Awan

Nguyen Bach

Amit Bahree

Arash Bakhtiari

Jianmin Bao

Harkirat Singh Behl

Alon Benhaim

Misha Bilenko

Johan Bjorck

Sébastien Bubeck

Martin Cai

Qin Cai

Vishrav Chaudhary

Dong Chen … (see 342 more)

Weizhu Chen

Yen-Chun Chen 0001

Yi-ling Chen

Hao Cheng

Parul Chopra

Xiyang Dai

Matthew Dixon

Ronen Eldan

Victor Fragoso

Jianfeng Gao

Mei Gao

Min Gao

Amit Garg

Allison Del Giorno

Abhishek Goswami

S. Gunasekar

Emman Haider

Jun-heng Hao

Russell J. Hewett

Wen-Wei Hu

Jamie Huynh

Dan Iter

Sam Ade Jacobs

Mojan Javaheripi

Xin Jin

Nikos Karampatziakis

Piero Kauffmann

Mahoud Khademi

Dongwoo Kim

Young Jin Kim

Lev Kurilenko

James R. Lee

Yin Tat Lee

Yuanzhi Li

Yunsheng Li

Chen Liang

Lars Lidén

Xihui

Zeqi Lin

Ce Lin

Liyuan Liu

Mengchen Liu

Liu Weishung

Xiaodong Liu

Chong Liu

Piyush Luo

Ali Madan

David Mahmoudzadeh

Matt Majercak

Caio Mazzola

César Teodoro

Arindam Mendes

Hardik Mitra

Anh Modi

Brandon Nguyen

Norick Barun

Daniel Patra

Thomas Perez-Becker

Portet Reid

Heyang Pryzant

Marko Qin

Liliang Radmilac

Gustavo Ren

Corby de Rosa

Sambudha Rosset

Roy Olatunji

Olli Ruwase

Amin Saarikivi

Adil Saied

Michael Salim

Shital Santacroce

Ning Shah

Shang Hiteshi

Yelong Sharma

Swadheen Shen

Xia Shukla

Masahiro Song

Andrea Tanaka

Praneetha Tupini

Michael Wu

Bin Wyatt

Can Xiao

Jiahang Xu

Weijiang Xu

Jilong Xu

Sonali Xue

Fan Yadav

Jianwei Yang

Yifan Yang

Ziyi Yang

Donghan Yang

Yu Lu

Chenruidong Yuan

Cyril Zhang

Jianwen Zhang

Zhang

Li Lyna

Yi Zhang

Yue Zhang

Yunan Zhang 0001

Zhang Xiren

Zhou

Phi-3

Priyanka Agrawal

Chris Alberti

Fantine Huot

Joshua Maynez

Ji Ma

Kuzman Ganchev

Viraat Aryabumi

John Dang

Dwarak Talupuru

Saurabh Dash

David Cairuz

Hangyu Lin

Bharat Venkitesh

Madeline Smith

Jon Ander Campos

Yi Chern Tan

Kelly Marchisio

Max Bartolo

Sebastian Ruder

Acyr F. Locatelli

Julia Kreutzer

Nick Frosst

Aidan Gomez

Phil Blunsom

Marzieh Fadaee

Tom B. Brown

Benjamin Mann

Nick Ryder

Melanie Subbiah

Jared Kaplan

Prafulla Dhariwal

Arvind Neelakantan

Pranav Shyam

Girish Sastry

Amanda Askell

Sandhini Agarwal

Ariel Herbert-Voss

Gretchen Krueger

T. Henighan

Rewon Child

Aditya Ramesh

Daniel M. Ziegler

Jeffrey Wu

Clemens Winter

Chris Hesse

Mark Chen

Eric Sigler

Ma-teusz Litwin

Scott Gray

Benjamin Chess

J. Clark

Christopher Berner

Sam McCandlish

Alec Radford

Ilya Sutskever

Dario Amodei Gemma Team

Morgane Rivière

Shreya Pathak Pier

Giuseppe Sessa

Cassidy Hardin

Surya Bhupati-raju

L'eonard Hussenot

Thomas Mesnard

Bobak Shahriari

Alexandre Ramé

Johan Ferret

Peter Liu

Pouya Dehghani Tafti

Abe Friesen

Michelle Casbon

Sabela Ramos

Ravin Kumar

Charline Le Lan

Sammy Jerome

Anton Tsitsulin

Nino Vieillard

Piotr Stańczyk

Sertan Girgin

Nikola Momchev

Matt Hoffman

Shantanu Thakoor

Jean-Bastien Grill

Behnam Neyshabur

Olivier Bachem

Alanna Wal-ton

Aliaksei Severyn

Alicia Parrish

Aliya Ah-mad

Allen Hutchison

Alvin Abdagic

Amanda Carl

Amy Shen

Andy Brock

Andy Coenen

Anthony Laforge

Antonia Paterson

Ben Bastian

Bilal Piot

Boxi Wu

Brandon Royal

Charlie Chen

Chintu Kumar

Chris Perry

Christoper A. Welty

Christopher A. Choquette-Choo

Danila Sinopalnikov

David Wein-berger

Dimple Vijaykumar

Dominika Rogozi´nska

D. Herbison

Elisa Bandy

Emma Wang

Eric Noland

Erica Moreira

Evan Senter

Evgenii Elty-shev

Francesco Visin

Gabriel Rasskin

Gary Wei

Glenn Cameron

Gus Martins

Hadi Hashemi

Hanna Klimczak-Pluci´nska

Harleen Batra

Harsh Dhand

Ivan Nardini

Jacinda Mein

Jack Zhou

James Svens-son

Jeff Stanway

Jetha Chan

J. Zhou

Joana Carrasqueira

Joana Iljazi

Jocelyn Becker

Joe Fer-nandez

Joost Van Amersfoort

Josh Gordon

Josh Lipschultz

Joshua Newlan

Junsong Ji

Kareem Mo-hamed

Kartikeya Badola

Kat Black

Katie Mil-lican

Keelin McDonell

Kelvin Nguyen

Kiranbir Sodhia

Kish Greene

Lars Lowe Sjoesund

Lauren Usui

Laurent Sifre

L. Heuermann

Leti-cia Lago

Lilly McNealus

Livio Baldini

Soares Logan

Lucas Kilpatrick

Luciano Dixon

Martins Machel

Manvinder Reid

Mark Singh

Martin Görner Iverson

Mateo Wirth Mat Velloso

Matt Davi-dow

Matt Miller

Matthew Rahtz

Matthew Watson

Meg Risdal

Mehran Kazemi

Michael Moynihan

Ming Zhang

Minsuk Kahng

Minwoo Park

Mofi Rahman

Mohit Khatwani

Natalie Dao

Nenshad Bardoliwalla

N. Devanathan

Neta Dumai

Nilay Chauhan

O. Wahltinez

Pankil Botarda

Parker Barnes

Paul R. Barham

Paul Michel

Peng-chong Jin

Petko Georgiev

Phil Culliton

Pradeep Kup-pala

Ramona Comanescu

Ramona Merhej

Reena Jana

R. Rokni

Ryan Mullins

Samaneh Saadat

S. M. Carthy

Sarah Cogan

Sarah Perrin

S'ebastien M. R. Arnold

Se-bastian Krause

Shengyang Dai

S. Garg

Shruti Sheth

S. Ronstrom

Susan Chan

Timothy Jordan

Bing Yu

Tom Eccles

Tom Hennigan

Tomas Kocisky

Tulsee Doshi

Vihan Jain

Vikas Yadav

Vilobh Meshram

Vishal Dharmadhikari

Warren Barkley

Wei Wei

Wenming Ye

Woohyun Han

Woosuk Kwon

Xiang Xu

Zhe Shen

Zhitao Gong

Zichuan Wei

Victor Cotruta

Phoebe Kirk

Anand Rao

Minh Giang

Ludovic Peran

Tris Brian Warkentin

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

D. Sculley

Jeanine Banks

Anca Dragan

2025-01-01

NAACL (Long Papers) (published)

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow

Guy Tennenholtz

Izzeddin Gur

Vincent Zhuang

Bo Dai

Sridhar Thiagarajan

Craig Boutilier

Aviral Kumar

Aleksandra Faust

Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large langu… (see more)age models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.

2025-01-01

ICLR (published)

On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models

Tianyang Zhao

Kunwar Yashraj Singh

Srikar Appalaraju

Peng Tang

Ying Nian Wu

Li Erran Li

Li

Nino Vieillard

Yongchao Zhou

Piotr Stańczyk

Sabela Ramos Garea

Matthieu Geist

Rohan Anil

Andrew M. Dai

Melvin Orhan Firat

Dmitry Lepikhin

Alexandre Passos

Siamak Shakeri

Emanuel Taropa … (see 478 more)

Paige Bailey

Zhifeng Chen

Eric Chu

Jonathan H. Clark

Laurent El

Yanping Huang

K. Meier-Hellstern

Gaurav Mishra

Erica Moreira

Mark Omernick

Kevin Robinson

Sebastian Ruder

Yi Tay

Kefan Xiao

Yuanzhong Xu

Yujing Zhang

Gustavo Hernández Abrego

Junwhan Ahn

Jacob Austin

Paul R. Barham

Jan Botha

James Bradbury

Siddhartha Brahma

Kevin Brooks

M. Catasta

Yong Cheng

Colin Cherry

Christopher A. Choquette-Choo

Aakanksha Chowdhery

Clé-ment Crepy

Shachi Dave

Mostafa Dehghani

Sunipa Dev

Jacob Devlin

Mark Díaz

Nan Du

Ethan Dyer

Vladimir Feinberg

Fangxiaoyu Feng

Vlad Fienber

Markus Freitag

Xavier Garcia

Sebastian Gehrmann

Lucas Gonzalez

Guy Gur-Ari

Steven Hand

Hadi Hashemi

Le Hou

Joshua Howland

Andrea Hu

Jeffrey Hui

Jeremy Hur-witz

Michael Acheson Isard

Abe Ittycheriah

Matthew Jagiel-ski

Wenhao Jia

Kathleen Kenealy

M. Krikun

Sneha Kudugunta 0001

Chang Lan

Kather-ine Lee

Benjamin Lee

Music Eric Li

Wei Li

YaGuang Li

Li Jian

Hyeontaek Li

Hanzhao Lim

Zhongtao Lin

Liu Frederick

Marcello Liu

Aroma Maggioni

Mahendru Joshua

Vedant Maynez

Maysam Misra

Moussalem Zachary

John Nado

E. Nham

Andrew Ni

Alicia Nys-trom

Marie Parrish

M. Pellat

Polacek Alex

Reiner Polozov

Siyuan Pope

Emily Qiao

Reif Bryan

Parker Richter

Alex Riley

Castro Ros

Aurko Roy

Brennan Saeta

Rajkumar Samuel

Renee Shelby

Ambrose Slone

Daniel Smilkov

David R. So

Daniel Sohn

Simon Tokumine

Dasha Valter

Haim-ing Bao

Mo Bavarian

Jeff Belgum

Ir-wan Bello

Jake Berdine

Gabriel Bernadett-Shapiro

Christopher Berner

Lenny Bogdonoff

Oleg Boiko

Madelaine Boyd

Anna-Luisa Brakman

Greg Brock-man

Tim Brooks

M. Brundage

Kevin Button

Trevor Cai

Rosie Campbell

Andrew Cann

Brittany Carey

Chelsea Carlson

Rory Carmichael

Brooke Chan

Che Chang

Fotis Chantzis

Derek Chen

Sully Chen

Ruby Chen

Jason Chen

Mark Chen

Benjamin Chess

Chester Cho

Hyung Casey Chu

Won Chung

Dave Cummings

Jeremiah Currier

Yunxing Dai

Tarun Goel

Gabriel Gogineni

Rapha Goh

Jonathan Gontijo-Lopes

Morgan Gordon

Scott Grafstein

Ryan Gray

Joshua Greene

Shixiang Shane Gross

Yufei Gu

Chris Guo

Jesse Hallacy

Jeff Han

Harris Yuchen

Mike He

Johannes Heaton

C. Heidecke

Alan Hesse

Wade Hickey

Peter Hickey

Hoeschele Brandon

Kenny Houghton

Shengli Hsu

Xin Hu

Joost Hu

Shantanu Huizinga

Shawn Jain

Jain Joanne

Angela Jang

Roger Jiang

Haozhun Jiang

Denny Jin

Shino Jin

Billie Jomoto

Hee-woo Jonn

Tomer Jun

Łukasz Kaftan

Ali Kaiser

Ingmar Ka-mali

Kanitscheider

Nitish Shirish

Keskar Tabarak

Logan Khan

J. Kilpatrick

Kim Christina

Yongjik Kim

Jan Hendrik Kim

Jamie Kirch-ner

Matt Kiros

Daniel Knight

Kokotajlo Łukasz

A. Kondraciuk

Aris Kondrich

Kyle Kon-stantinidis

Gretchen Kosic

Vishal Krueger

Michael Kuo

Ikai Lampe

Teddy Lan

Jan Lee

Jade Leike

Daniel Leung

Chak Ming Levy

Li Rachel

Molly Lim

Stephanie Lin

Mateusz Lin

Theresa Litwin

Ryan Lopez

Patricia Lowe

Lue Anna

Kim Makanju

S. Malfacini

Todor Manning

Yaniv Markov

Bianca Markovski

Katie Martin

Andrew Mayer

Bob Mayne

Scott Mayer McGrew

Christine McKinney

Paul McLeavey

McMillan Jake

David McNeil

Aalok Medina

Jacob Mehta

Luke Menick

Andrey Metz

Pamela Mishchenko

Vinnie Mishkin

Evan Monaco

Daniel Morikawa

Tong Mossing

Mira Mu

Oleg Murati

David Murk

Ashvin Mély

Reiichiro Nair

Rajeev Nakano

Nayak Arvind

Richard Neelakantan

Hyeonwoo Ngo

Noh Long

Cullen Ouyang

Jakub O’Keefe

Alex Pachocki

J. Paino

Ashley Palermo

Pantuliano

Carl Ross

Bob Rotsted

Henri Roussez

Nick Ry-der

Mario Saltarelli

Ted Sanders

Shibani Santurkar

Girish Sastry

Heather Schmidt

David Schnurr

John Schulman

Daniel Selsam

Kyla Sheppard

Toki Sherbakov

Jessica Shieh

Sarah Shoker

Pranav Shyam

Szymon Sidor

Eric Sigler

Maddie Simens

Jordan Sitkin

Katarina Slama

Ian Sohl

Benjamin D. Sokolowsky

Yang Song

Natalie Staudacher

Clemens Winter

Samuel Wolrich

Hannah Wong

Lauren Workman

Sherwin Wu

Michael Wu

Kai Xiao

Tao Xu

Sarah Yoo

Kevin Yu

Qim-ing Yuan

Wojciech Zaremba

Rowan G. Zellers

Chong Zhang

Marvin Zhang

Tianhao Shengjia Zhao

Ouyang Long

Jeff Wu

Xu Jiang

Diogo Almeida

C. Wainwright

Pamela Mishkin

Sandhini Agarwal

Alex Ray

Jacob Hilton

Fraser Kelton

Luke Miller

Amanda Askell

Peter Welinder

Paul F. Christiano

Jan Leike

Ryan Lowe. 2022

Adam Paszke

Sam Gross

Francisco Massa

Adam Lerer

Gregory Chanan

Trevor Killeen

Ze-Bin Lin

Natalia Gimelshein

L. Antiga

Alban Desmaison

Andreas Köpf

Edward Yang

Zachary DeVito

Martin Raison

A. Tejani

Sasank Chilamkurthy

Benoit Steiner

Giovanni Puccetti

Anna Rogers

Aleksandr Drozd

Felice

Dell’Orletta. 2022. Outlier

Alec Radford

Jong Wook Kim

Chris Hallacy

Aditya Ramesh

Gabriel Goh

Girish Sas-try

J. Clark

Rewon Child

David Luan

Victor Sanh

Alex Webson

Colin Raffel

Stephen H. Bach

Lintang A. Sutawika

Zaid Alyafeai

Antoine Chaffin

Arnaud Stiegler

Arun Raja

Manan Dey

Saiful Bari

Canwen Xu

Urmish Thakker

Shanya Sharma Sharma

Eliza Szczechla

Taewoon Kim 0002

Gunjan Chhablani

Ni-hal Nayak

Debajyoti Datta

Mike Jonathan Chang

Tian-Jian Jiang

Han Wang

Matteo Manica

Sheng Shen

Zheng-Xin Yong

Harshit Pandey

Rachel Bawden

Thomas Wang

Trishala Neeraj

Jos Rozen

Abheesht Sharma

Thibault Févry

Jason Alan Fries

Ryan Teehan

Teven Le Scao

Stella Biderman

Leo Gao

Thomas Wolf 0008

A. M. R. 2022

Multi-task

Richard Socher

Alex Perelygin

Jean Wu

Jason Chuang

Christopher D Manning

Andrew Ng

Christopher Potts

Recursive

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal

Md. Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Dsouza

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubara-jan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karaka¸s

Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen

Blake Lin

Bryan Howald

Cameron Orinion

Cameron Diao

Catherine Dour

Cedrick Stinson

César Argueta

Chandan Ferri

Charles Singh

Chenlin Rathkopf

Chitta Meng

C. Baral

Chris Wu

Chris Callison-Burch

Christopher Waites

Christo-pher D Voigt

Cindy Potts

E. RamirezClara

Clemencia Rivera

Colin Siro

Court-ney Raffel

Cristina Ashcraft

Damien Garbacea

Sileo Dan

Dan Garrette

Dan Hendrycks

Dan Kilman

C. Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

2025-01-01

NAACL (Long Papers) (published)

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar

Vincent Zhuang

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Doina Precup

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (see more)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

2025-01-01

ICLR (published)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2024-10-23

ArXiv (preprint)

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Michael Noukhovitch

Shengyi Huang

Sophie Xhonneux

Arian Hosseini

To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.

2024-10-10

NeurIPS.cc/2024/Workshop/FITML (poster)