Rishabh Agarwal

Lajanugen Logeswaran

Jaekyeom Kim

Hao Peng

Moontae Lee

Honglak Lee

Lu Wang

Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires… (see more) expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.

2026-03-08

Transactions on Machine Learning Research (accepted)

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

Abhranil Chandra

Ayush Agrawal

Arian Hosseini

Sebastian Fischmeister

Navin Goyal

Aaron Courville

2025-12-23

ArXiv (preprint)

Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

Ziyu Ye

Tianqi Liu

Rishabh Joshi

Sarmishta Velury

Quoc V Le

Qijun Tan

Yuan Liu

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Morgane M Moss

2025-07-06

Conference on Language Modeling (accepted)

Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Max Schwarzer

Jesse Farebrother

Joshua Greaves

Ekin Dogus Cubuk

Aaron Courville

Bellemare Marc-Emmanuel

Sergei Kalinin

Igor Mordatch

Pablo Samuel Castro

Kevin M Roccapriore

We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimu… (see more)lated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

2025-05-19

Advanced Materials Interfaces (published)

Gemma 3 Technical Report

Gemma Team Aishwarya Kamath

Johan Ferret

Shreya Pathak

Nino Vieillard

Ramona Merhej

Sarah Perrin

Tatiana Matejovicova

Alexandre Ram'e

Morgane Rivière

Louis Rouillard

Thomas Mesnard

Geoffrey Cideron

Jean-Bastien Grill

Sabela Ramos

Edouard Yvinec

Michelle Casbon

Etienne Pot

Ivo Penchev

Gael Liu

Francesco Visin … (see 190 more)

Kathleen Kenealy

Lucas Beyer

Xiaohai Zhai

Anton Tsitsulin

Róbert Busa-Fekete

Alex Feng

Noveen Sachdeva

Benjamin Coleman

Yi Gao

Basil Mustafa

Iain Barr

Emilio Parisotto

David Tian

Matan Eyal

Colin Cherry

Jan-Thorsten Peter

Danila Sinopalnikov

Surya Bhupatiraju

Mehran Kazemi

Dan Malkin

Ravin Kumar

David Vilar

Idan Brusilovsky

Jiaming Luo

Andreas Steiner

Abe Friesen

Abhanshu Sharma

Abheesht Sharma

Adi Mayrav Gilady

Adrian Goedeckemeyer

Alaa Saade

Alexander Kolesnikov

Alexei Bendebury

Alvin Abdagic

Amit Vadi

Andr'as Gyorgy

André Susano Pinto

Anil Das

Ankur Bapna

Antoine Miech

Antoine Yang

Antonia Paterson

Ashish Shenoy

Ayan Chakrabarti

Bilal Piot

Boxi Wu

Bobak Shahriari

Bryce Petrini

Charlie Chen

Charline Le Lan

Christopher A. Choquette-Choo

CJ Carey

Cormac Brick

Daniel Deutsch

Danielle Eisenbud

Dee Cattle

Derek Cheng

Dimitris Paparas

Divyashree Shivakumar Sreepathihalli

Doug Reid

Dustin Tran

Dustin Zelle

Eric Noland

Erwin Huizenga

Eugene Kharitonov

Frederick Liu

Gagik Amirkhanyan

Glenn Cameron

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harman Singh

Harsh Mehta

Harshal Tushar Lehri

Hussein Hazimeh

Ian Ballantyne

Idan Szpektor

Ivan Nardini

Jean Pouget-Abadie

Jetha Chan

Joe Stanton

J. Michael Wieting

Jonathan Lai

Jordi Orbay

Joe Fernandez

Joshua Newlan

Junsong Ji

Jyotinder Singh

Kat Black

Kathy Yu

Kevin Hui

Kiran N. Vodrahalli

Klaus Greff

Linhai Qiu

Marcella Valentine

Marina Coelho

Marvin Ritter

Matt Hoffman

Matthew Watson

Mayank Chaturvedi

Michael Moynihan

Min Ma

Nabila Babar

Natasha Noy

Nathan Byrd

Nick Roy

Nikola Momchev

Nilay Chauhan

Oskar Bunyan

Pankil Botarda

Paul Caron

Paul Kishan Rubenstein

Phil Culliton

Philipp Schmid

Pier Giuseppe Sessa

Pingmei Xu

Piotr Stańczyk

Pouya Dehghani Tafti

Rakesh Shivanna

Renjie Wu

Renke Pan

R. Rokni

Rob Willoughby

Rohith Vallu

Ryan Mullins

Sammy Jerome

Sara Smoot

Sertan Girgin

Shariq Iqbal

Shashir Reddy

Shruti Sheth

Siim Põder

Sijal Bhatnagar

S. Panyam

Sivan Eiger

Susan Zhang

Tianqi Liu

Trevor Yacovone

T. Liechty

Uday Kalra

Utku Evci

Vedant Misra

Vincent Roseberry

Vladimir Feinberg

Vlad Kolesnikov

Woohyun Han

Woosuk Kwon

X. T. Chen

Yinlam Chow

Yuvein Zhu

Zichuan Wei

Z. Egyed

Victor Cotruta

Minh Giang

Phoebe Kirk

Anand Rao

Jessica Lo

Erica Moreira

Luiz GUStavo Martins

Omar Sanseviero

Lucas Gonzalez

Zach Gleicher

Tris Brian Warkentin

Seyed Vahab Mirrokni

Evan Senter

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

Yossi Matias

D. Sculley

Slav Petrov

Noah Fiedel

Noam M. Shazeer

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Jean-Baptiste Alayrac

Rohan Anil

Dmitry Lepikhin

Sebastian Borgeaud

Olivier Bachem

Armand Joulin

Alek Andreev

Cassidy Hardin

Robert Dadashi

L'eonard Hussenot

We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters… (see more). This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.

2025-03-24

ArXiv (preprint)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Shengyi Huang

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model which give a worse training signal. We tackle the fundamental challenge in this regime: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we test, online DPO is found to be most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. We verify the scalability of asynchronous RLHF by training a general-purpose chatbot from LLaMA 3.1 8B on an instruction-following task ~40% faster than a synchronous run while matching final performance. Finally, we extend our results to math and reasoning to demonstrate asynchronous RL can finetune Rho 1B on GSM8k ~70% faster while matching synchronous accuracy.

2025-01-21

International Conference on Learning Representations (poster)

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow

Guy Tennenholtz

Izzeddin Gur

Vincent Zhuang

Bo Dai

Sridhar Thiagarajan

Craig Boutilier

Aviral Kumar

Aleksandra Faust

Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large langu… (see more)age models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input -- a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.

2024-12-31

ICLR (published)

On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models

Tianyang Zhao

Kunwar Yashraj Singh

Srikar Appalaraju

Peng Tang

Ying Nian Wu

Li Erran Li

Li

Nino Vieillard

Yongchao Zhou

Piotr Stańczyk

Sabela Ramos Garea

Matthieu Geist

Rohan Anil

Andrew M. Dai

Melvin Orhan Firat

Dmitry Lepikhin

Alexandre Passos

Siamak Shakeri

Emanuel Taropa … (see 478 more)

Paige Bailey

Zhifeng Chen

Eric Chu

Jonathan H. Clark

Laurent El

Yanping Huang

K. Meier-Hellstern

Gaurav Mishra

Erica Moreira

Mark Omernick

Kevin Robinson

Sebastian Ruder

Yi Tay

Kefan Xiao

Yuanzhong Xu

Yujing Zhang

Gustavo Hernández Abrego

Junwhan Ahn

Jacob Austin

Paul R. Barham

Jan Botha

James Bradbury

Siddhartha Brahma

Kevin Brooks

M. Catasta

Yong Cheng

Colin Cherry

Christopher A. Choquette-Choo

Aakanksha Chowdhery

Clé-ment Crepy

Shachi Dave

Mostafa Dehghani

Sunipa Dev

Jacob Devlin

Mark Díaz

Nan Du

Ethan Dyer

Vladimir Feinberg

Fangxiaoyu Feng

Vlad Fienber

Markus Freitag

Xavier Garcia

Sebastian Gehrmann

Lucas Gonzalez

Guy Gur-Ari

Steven Hand

Hadi Hashemi

Le Hou

Joshua Howland

Andrea Hu

Jeffrey Hui

Jeremy Hur-witz

Michael Acheson Isard

Abe Ittycheriah

Matthew Jagiel-ski

Wenhao Jia

Kathleen Kenealy

M. Krikun

Sneha Kudugunta 0001

Chang Lan

Kather-ine Lee

Benjamin Lee

Music Eric Li

Wei Li

YaGuang Li

Li Jian

Hyeontaek Li

Hanzhao Lim

Zhongtao Lin

Liu Frederick

Marcello Liu

Aroma Maggioni

Mahendru Joshua

Vedant Maynez

Maysam Misra

Moussalem Zachary

John Nado

E. Nham

Andrew Ni

Alicia Nys-trom

Marie Parrish

M. Pellat

Polacek Alex

Reiner Polozov

Siyuan Pope

Emily Qiao

Reif Bryan

Parker Richter

Alex Riley

Castro Ros

Aurko Roy

Brennan Saeta

Rajkumar Samuel

Renee Shelby

Ambrose Slone

Daniel Smilkov

David R. So

Daniel Sohn

Simon Tokumine

Dasha Valter

Haim-ing Bao

Mo Bavarian

Jeff Belgum

Ir-wan Bello

Jake Berdine

Gabriel Bernadett-Shapiro

Christopher Berner

Lenny Bogdonoff

Oleg Boiko

Madelaine Boyd

Anna-Luisa Brakman

Greg Brock-man

Tim Brooks

M. Brundage

Kevin Button

Trevor Cai

Rosie Campbell

Andrew Cann

Brittany Carey

Chelsea Carlson

Rory Carmichael

Brooke Chan

Che Chang

Fotis Chantzis

Derek Chen

Sully Chen

Ruby Chen

Jason Chen

Mark Chen

Benjamin Chess

Chester Cho

Hyung Casey Chu

Won Chung

Dave Cummings

Jeremiah Currier

Yunxing Dai

Tarun Goel

Gabriel Gogineni

Rapha Goh

Jonathan Gontijo-Lopes

Morgan Gordon

Scott Grafstein

Ryan Gray

Joshua Greene

Shixiang Shane Gross

Yufei Gu

Chris Guo

Jesse Hallacy

Jeff Han

Harris Yuchen

Mike He

Johannes Heaton

C. Heidecke

Alan Hesse

Wade Hickey

Peter Hickey

Hoeschele Brandon

Kenny Houghton

Shengli Hsu

Xin Hu

Joost Hu

Shantanu Huizinga

Shawn Jain

Jain Joanne

Angela Jang

Roger Jiang

Haozhun Jiang

Denny Jin

Shino Jin

Billie Jomoto

Hee-woo Jonn

Tomer Jun

Łukasz Kaftan

Ali Kaiser

Ingmar Ka-mali

Kanitscheider

Nitish Shirish

Keskar Tabarak

Logan Khan

J. Kilpatrick

Kim Christina

Yongjik Kim

Jan Hendrik Kim

Jamie Kirch-ner

Matt Kiros

Daniel Knight

Kokotajlo Łukasz

A. Kondraciuk

Aris Kondrich

Kyle Kon-stantinidis

Gretchen Kosic

Vishal Krueger

Michael Kuo

Ikai Lampe

Teddy Lan

Jan Lee

Jade Leike

Daniel Leung

Chak Ming Levy

Li Rachel

Molly Lim

Stephanie Lin

Mateusz Lin

Theresa Litwin

Ryan Lopez

Patricia Lowe

Lue Anna

Kim Makanju

S. Malfacini

Todor Manning

Yaniv Markov

Bianca Markovski

Katie Martin

Andrew Mayer

Bob Mayne

Scott Mayer McGrew

Christine McKinney

Paul McLeavey

McMillan Jake

David McNeil

Aalok Medina

Jacob Mehta

Luke Menick

Andrey Metz

Pamela Mishchenko

Vinnie Mishkin

Evan Monaco

Daniel Morikawa

Tong Mossing

Mira Mu

Oleg Murati

David Murk

Ashvin Mély

Reiichiro Nair

Rajeev Nakano

Nayak Arvind

Richard Neelakantan

Hyeonwoo Ngo

Noh Long

Cullen Ouyang

Jakub O’Keefe

Alex Pachocki

J. Paino

Ashley Palermo

Pantuliano

Carl Ross

Bob Rotsted

Henri Roussez

Nick Ry-der

Mario Saltarelli

Ted Sanders

Shibani Santurkar

Girish Sastry

Heather Schmidt

David Schnurr

John Schulman

Daniel Selsam

Kyla Sheppard

Toki Sherbakov

Jessica Shieh

Sarah Shoker

Pranav Shyam

Szymon Sidor

Eric Sigler

Maddie Simens

Jordan Sitkin

Katarina Slama

Ian Sohl

Benjamin D. Sokolowsky

Yang Song

Natalie Staudacher

Clemens Winter

Samuel Wolrich

Hannah Wong

Lauren Workman

Sherwin Wu

Michael Wu

Kai Xiao

Tao Xu

Sarah Yoo

Kevin Yu

Qim-ing Yuan

Wojciech Zaremba

Rowan G. Zellers

Chong Zhang

Marvin Zhang

Tianhao Shengjia Zhao

Ouyang Long

Jeff Wu

Xu Jiang

Diogo Almeida

C. Wainwright

Pamela Mishkin

Sandhini Agarwal

Alex Ray

Jacob Hilton

Fraser Kelton

Luke Miller

Amanda Askell

Peter Welinder

Paul F. Christiano

Jan Leike

Ryan Lowe. 2022

Adam Paszke

Sam Gross

Francisco Massa

Adam Lerer

Gregory Chanan

Trevor Killeen

Ze-Bin Lin

Natalia Gimelshein

L. Antiga

Alban Desmaison

Andreas Köpf

Edward Yang

Zachary DeVito

Martin Raison

A. Tejani

Sasank Chilamkurthy

Benoit Steiner

Giovanni Puccetti

Anna Rogers

Aleksandr Drozd

Felice

Dell’Orletta. 2022. Outlier

Alec Radford

Jong Wook Kim

Chris Hallacy

Aditya Ramesh

Gabriel Goh

Girish Sas-try

J. Clark

Rewon Child

David Luan

Victor Sanh

Alex Webson

Colin Raffel

Stephen H. Bach

Lintang A. Sutawika

Zaid Alyafeai

Antoine Chaffin

Arnaud Stiegler

Arun Raja

Manan Dey

Saiful Bari

Canwen Xu

Urmish Thakker

Shanya Sharma Sharma

Eliza Szczechla

Taewoon Kim 0002

Gunjan Chhablani

Ni-hal Nayak

Debajyoti Datta

Mike Jonathan Chang

Tian-Jian Jiang

Han Wang

Matteo Manica

Sheng Shen

Zheng-Xin Yong

Harshit Pandey

Rachel Bawden

Thomas Wang

Trishala Neeraj

Jos Rozen

Abheesht Sharma

Thibault Févry

Jason Alan Fries

Ryan Teehan

Teven Le Scao

Stella Biderman

Leo Gao

Thomas Wolf 0008

A. M. R. 2022

Multi-task

Richard Socher

Alex Perelygin

Jean Wu

Jason Chuang

Christopher D Manning

Andrew Ng

Christopher Potts

Recursive

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal

Md. Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Dsouza

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubara-jan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karaka¸s

Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen

Blake Lin

Bryan Howald

Cameron Orinion

Cameron Diao

Catherine Dour

Cedrick Stinson

César Argueta

Chandan Ferri

Charles Singh

Chenlin Rathkopf

Chitta Meng

C. Baral

Chris Wu

Chris Callison-Burch

Christopher Waites

Christo-pher D Voigt

Cindy Potts

E. RamirezClara

Clemencia Rivera

Colin Siro

Court-ney Raffel

Cristina Ashcraft

Damien Garbacea

Sileo Dan

Dan Garrette

Dan Hendrycks

Dan Kilman

C. Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

2024-12-31

NAACL (Long Papers) (published)

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar

Vincent Zhuang

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Doina Precup

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (see more)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

2024-12-31

ICLR (published)

Not All LLM Reasoners Are Created Equal

Daniel Toyama

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of e… (see more)xisting math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

2024-10-08

NeurIPS.cc/2024/Workshop/Sys2-Reasoning (poster)

Many-Shot In-Context Learning

Avi Singh

Lei M Zhang

Bernd Bohnet

Stephanie C.Y. Chan

Luis Rosias

Biao Zhang

Ankesh Anand

Zaheer Abbas

Azade Nova

John D Co-Reyes

Eric Chu

Feryal Behbahani

Aleksandra Faust

Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (see more)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples – the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated outputs. To mitigate this limitation, we explore two new settings: (1) "Reinforced ICL" that uses model-generated chain-of-thought rationales in place of human rationales, and (2) "Unsupervised ICL" where we remove rationales from the prompt altogether, and prompts the model only with domain-specific inputs. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. We demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to supervised fine-tuning. Finally, we reveal the limitations of next-token prediction loss as an indicator of downstream ICL performance.

2024-09-24

NeurIPS.cc/2024/Conference (spotlight)