Rishabh Agarwal

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Doina Precup

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

2025-01-01

ICLR (publié)

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow

Guy Tennenholtz

Izzeddin Gur

Vincent Zhuang

Bo Dai

Sridhar Thiagarajan

Craig Boutilier

Aviral Kumar

Aleksandra Faust

Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large langu… (voir plus)age models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.

2024-12-18

ArXiv (prépublication)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Shengyi Huang

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (voir plus)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2024-10-23

ArXiv (prépublication)

Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

Shengyi Huang

To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (voir plus)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.

2024-10-10

NeurIPS.cc/2024/Workshop/FITML (poster)

Not All LLM Reasoners Are Created Equal

Daniel Toyama

2024-10-09

NeurIPS.cc/2024/Workshop/Sys2-Reasoning (poster)

Not All LLM Reasoners Are Created Equal

Daniel Toyama

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of e… (voir plus)xisting math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

2024-10-02

ArXiv (prépublication)

Many-Shot In-Context Learning

Avi Singh

Lei M Zhang

Bernd Bohnet

Stephanie C.Y. Chan

Luis Rosias

Biao Zhang

Ankesh Anand

Zaheer Abbas

Azade Nova

John D Co-Reyes

Eric Chu

Feryal Behbahani

Aleksandra Faust

Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (voir plus)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples – the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore two new settings: (1) "Reinforced ICL" that uses model-generated chain-of-thought rationales in place of human rationales, and (2) "Unsupervised ICL" where we remove rationales from the prompt altogether, and prompts the model only with domain-specific inputs. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. We demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to supervised fine-tuning. Finally, we reveal the limitations of next-token prediction loss as an indicator of downstream ICL performance.

2024-09-25

NeurIPS.cc/2024/Conference (spotlight)

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar

Vincent Zhuang

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq N Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Doina Precup

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

2024-09-19

ArXiv (prépublication)

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar

Vincent Zhuang

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq N Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Doina Precup

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

2024-09-19

ArXiv (prépublication)

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team Morgane Riviere

Shreya Pathak

Pier Giuseppe Sessa

Cassidy Hardin

Surya Bhupatiraju

L'eonard Hussenot

Thomas Mesnard

Bobak Shahriari

Alexandre Ram'e

Johan Ferret

Peter Liu

Pouya Dehghani Tafti

Abe Friesen

Michelle Casbon

Sabela Ramos

Ravin Kumar

Charline Le Lan

Sammy Jerome

Anton Tsitsulin

Nino Vieillard … (voir 175 de plus)

Piotr Stańczyk

Sertan Girgin

Nikola Momchev

Matt Hoffman

Shantanu Thakoor

Jean-Bastien Grill

Behnam Neyshabur

Alanna Walton

Aliaksei Severyn

Alicia Parrish

Aliya Ahmad

Allen Hutchison

Alvin Abdagic

Amanda Carl

Amy Shen

Andy Brock

Andy Coenen

Anthony Laforge

Antonia Paterson

Ben Bastian

Bilal Piot

Boxi Wu

Brandon Royal

Charlie Chen

Chintu Kumar

Chris Perry

Christoper A. Welty

Christopher A. Choquette-Choo

Danila Sinopalnikov

David Weinberger

Dimple Vijaykumar

Dominika Rogozi'nska

D. Herbison

Elisa Bandy

Emma Wang

Eric Noland

Erica Moreira

Evan Senter

Evgenii Eltyshev

Francesco Visin

Gabriel Rasskin

Gary Wei

Glenn Cameron

Gus Martins

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harleen Batra

Harsh Dhand

Ivan Nardini

Jacinda Mein

Jack Zhou

James Svensson

Jeff Stanway

Jetha Chan

Jin Zhou

Joana Carrasqueira

Joana Iljazi

Jocelyn Becker

Joe Fernandez

Joost Van Amersfoort

Josh Gordon

Josh Lipschultz

Joshua Newlan

Junsong Ji

Kareem Mohamed

Kartikeya Badola

Kat Black

Katie Millican

Keelin McDonell

Kelvin Nguyen

Kiranbir Sodhia

Kish Greene

Lars Lowe Sjoesund

Lauren Usui

Laurent Sifre

L. Heuermann

Leti-cia Lago

Lilly McNealus

Livio Baldini Soares

Logan Kilpatrick

Lucas Dixon

Luciano Martins

Machel Reid

Manvinder Singh

Mark Iverson

Martin Gorner

Mat Velloso

Mateo Wirth

Matt Davidow

Matt Miller

Matthew Rahtz

Matthew Watson

Meg Risdal

Mehran Kazemi

Michael Moynihan

Ming Zhang

Minsuk Kahng

Minwoo Park

Mofi Rahman

Mohit Khatwani

Natalie Dao

Nenshad Bardoliwalla

N. Devanathan

Neta Dumai

Nilay Chauhan

O. Wahltinez

Pankil Botarda

Parker Barnes

Paul R. Barham

Paul Michel

Peng-chong Jin

Petko Georgiev

Phil Culliton

Pradeep Kuppala

Ramona Comanescu

Ramona Merhej

Reena Jana

R. Rokni

Ryan Mullins

Samaneh Saadat

S. M. Carthy

Sarah Perrin

S'ebastien M. R. Arnold

Se-bastian Krause

Shengyang Dai

S. Garg

Shruti Sheth

S. Ronstrom

Susan Chan

Timothy Jordan

Ting Yu

Tom Eccles

Tom Hennigan

Tomas Kocisky

Tulsee Doshi

Vihan Jain

Vikas Yadav

Vilobh Meshram

Vishal Dharmadhikari

Warren Barkley

Wei Wei

Wenming Ye

Woohyun Han

Woosuk Kwon

Xiang Xu

Zhe Shen

Zhitao Gong

Zichuan Wei

Victor Cotruta

Phoebe Kirk

Anand Rao

Minh Giang

Ludovic Peran

Tris Brian Warkentin

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

D. Sculley

Jeanine Banks

Anca Dragan

Slav Petrov

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Sebastian Borgeaud

Noah Fiedel

Armand Joulin

Kathleen Kenealy

Robert Dadashi

Alek Andreev

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2… (voir plus) billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

2024-07-31

ArXiv (prépublication)

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team Morgane Riviere

Shreya Pathak

Pier Giuseppe Sessa

Cassidy Hardin

Surya Bhupatiraju

L'eonard Hussenot

Thomas Mesnard

Bobak Shahriari

Alexandre Ram'e

Johan Ferret

Peter Liu

Pouya Dehghani Tafti

Abe Friesen

Michelle Casbon

Sabela Ramos

Ravin Kumar

Charline Le Lan

Sammy Jerome

Anton Tsitsulin

Nino Vieillard … (voir 175 de plus)

Piotr Stańczyk

Sertan Girgin

Nikola Momchev

Matt Hoffman

Shantanu Thakoor

Jean-Bastien Grill

Behnam Neyshabur

Alanna Walton

Aliaksei Severyn

Alicia Parrish

Aliya Ahmad

Allen Hutchison

Alvin Abdagic

Amanda Carl

Amy Shen

Andy Brock

Andy Coenen

Anthony Laforge

Antonia Paterson

Ben Bastian

Bilal Piot

Boxi Wu

Brandon Royal

Charlie Chen

Chintu Kumar

Chris Perry

Christoper A. Welty

Christopher A. Choquette-Choo

Danila Sinopalnikov

David Weinberger

Dimple Vijaykumar

Dominika Rogozi'nska

D. Herbison

Elisa Bandy

Emma Wang

Eric Noland

Erica Moreira

Evan Senter

Evgenii Eltyshev

Francesco Visin

Gabriel Rasskin

Gary Wei

Glenn Cameron

Gus Martins

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harleen Batra

Harsh Dhand

Ivan Nardini

Jacinda Mein

Jack Zhou

James Svensson

Jeff Stanway

Jetha Chan

Jin Zhou

Joana Carrasqueira

Joana Iljazi

Jocelyn Becker

Joe Fernandez

Joost Van Amersfoort

Josh Gordon

Josh Lipschultz

Joshua Newlan

Junsong Ji

Kareem Mohamed

Kartikeya Badola

Kat Black

Katie Millican

Keelin McDonell

Kelvin Nguyen

Kiranbir Sodhia

Kish Greene

Lars Lowe Sjoesund

Lauren Usui

Laurent Sifre

L. Heuermann

Leti-cia Lago

Lilly McNealus

Livio Baldini Soares

Logan Kilpatrick

Lucas Dixon

Luciano Martins

Machel Reid

Manvinder Singh

Mark Iverson

Martin Gorner

Mat Velloso

Mateo Wirth

Matt Davidow

Matt Miller

Matthew Rahtz

Matthew Watson

Meg Risdal

Mehran Kazemi

Michael Moynihan

Ming Zhang

Minsuk Kahng

Minwoo Park

Mofi Rahman

Mohit Khatwani

Natalie Dao

Nenshad Bardoliwalla

N. Devanathan

Neta Dumai

Nilay Chauhan

O. Wahltinez

Pankil Botarda

Parker Barnes

Paul R. Barham

Paul Michel

Peng-chong Jin

Petko Georgiev

Phil Culliton

Pradeep Kuppala

Ramona Comanescu

Ramona Merhej

Reena Jana

R. Rokni

Ryan Mullins

Samaneh Saadat

S. M. Carthy

Sarah Perrin

S'ebastien M. R. Arnold

Se-bastian Krause

Shengyang Dai

S. Garg

Shruti Sheth

S. Ronstrom

Susan Chan

Timothy Jordan

Bing Yu

Tom Eccles

Tom Hennigan

Tomas Kocisky

Tulsee Doshi

Vihan Jain

Vikas Yadav

Vilobh Meshram

Vishal Dharmadhikari

Warren Barkley

Wei Wei

Wenming Ye

Woohyun Han

Woosuk Kwon

Xiang Xu

Zhe Shen

Zhitao Gong

Zichuan Wei

Victor Cotruta

Phoebe Kirk

Anand Rao

Minh Giang

Ludovic Peran

Tris Brian Warkentin

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

D. Sculley

Jeanine Banks

Anca Dragan

Slav Petrov

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Sebastian Borgeaud

Noah Fiedel

Armand Joulin

Kathleen Kenealy

Robert Dadashi

Alek Andreev

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2… (voir plus) billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

2024-07-31

ArXiv (prépublication)

V-STaR: Training Verifiers for Self-Taught Reasoners

Xingdi Yuan

Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on sel… (voir plus)f-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

2024-07-10

colmweb.org/COLM/2024/Conference (accepté)