Thomas Mesnard

Bobak Shahriari

Alexandre Ram'e

Johan Ferret

Peter Liu

Pouya Dehghani Tafti

Abe Friesen

Michelle Casbon

Sabela Ramos

Ravin Kumar

Charline Le Lan

Sammy Jerome

Anton Tsitsulin

Nino Vieillard … (see 175 more)

Piotr Stańczyk

Sertan Girgin

Nikola Momchev

Matt Hoffman

Shantanu Thakoor

Jean-Bastien Grill

Behnam Neyshabur

Alanna Walton

Aliaksei Severyn

Alicia Parrish

Aliya Ahmad

Allen Hutchison

Alvin Abdagic

Amanda Carl

Amy Shen

Andy Brock

Andy Coenen

Anthony Laforge

Antonia Paterson

Ben Bastian

Bilal Piot

Boxi Wu

Brandon Royal

Charlie Chen

Chintu Kumar

Chris Perry

Christoper A. Welty

Christopher A. Choquette-Choo

Danila Sinopalnikov

David Weinberger

Dimple Vijaykumar

Dominika Rogozi'nska

D. Herbison

Elisa Bandy

Emma Wang

Eric Noland

Erica Moreira

Evan Senter

Evgenii Eltyshev

Francesco Visin

Gabriel Rasskin

Gary Wei

Glenn Cameron

Gus Martins

Hadi Hashemi

Hanna Klimczak-Pluci'nska

Harleen Batra

Harsh Dhand

Ivan Nardini

Jacinda Mein

Jack Zhou

James Svensson

Jeff Stanway

Jetha Chan

Jin Zhou

Joana Carrasqueira

Joana Iljazi

Jocelyn Becker

Joe Fernandez

Joost Van Amersfoort

Josh Gordon

Josh Lipschultz

Joshua Newlan

Junsong Ji

Kareem Mohamed

Kartikeya Badola

Kat Black

Katie Millican

Keelin McDonell

Kelvin Nguyen

Kiranbir Sodhia

Kish Greene

Lars Lowe Sjoesund

Lauren Usui

Laurent Sifre

L. Heuermann

Leti-cia Lago

Lilly McNealus

Livio Baldini Soares

Logan Kilpatrick

Lucas Dixon

Luciano Martins

Machel Reid

Manvinder Singh

Mark Iverson

Martin Gorner

Mat Velloso

Mateo Wirth

Matt Davidow

Matt Miller

Matthew Rahtz

Matthew Watson

Meg Risdal

Mehran Kazemi

Michael Moynihan

Ming Zhang

Minsuk Kahng

Minwoo Park

Mofi Rahman

Mohit Khatwani

Natalie Dao

Nenshad Bardoliwalla

N. Devanathan

Neta Dumai

Nilay Chauhan

O. Wahltinez

Pankil Botarda

Parker Barnes

Paul R. Barham

Paul Michel

Peng-chong Jin

Petko Georgiev

Phil Culliton

Pradeep Kuppala

Ramona Comanescu

Ramona Merhej

Reena Jana

R. Rokni

Rishabh Agarwal

Ryan Mullins

Samaneh Saadat

S. M. Carthy

Sarah Perrin

S'ebastien M. R. Arnold

Se-bastian Krause

Shengyang Dai

S. Garg

Shruti Sheth

S. Ronstrom

Susan Chan

Timothy Jordan

Ting Yu

Tom Eccles

Tom Hennigan

Tomas Kocisky

Tulsee Doshi

Vihan Jain

Vikas Yadav

Vilobh Meshram

Vishal Dharmadhikari

Warren Barkley

Wei Wei

Wenming Ye

Woohyun Han

Woosuk Kwon

Xiang Xu

Zhe Shen

Zhitao Gong

Zichuan Wei

Victor Cotruta

Phoebe Kirk

Anand Rao

Minh Giang

Ludovic Peran

Tris Brian Warkentin

Eli Collins

Joelle Barral

Zoubin Ghahramani

Raia Hadsell

D. Sculley

Jeanine Banks

Anca Dragan

Slav Petrov

Oriol Vinyals

Jeffrey Dean

Demis Hassabis

Koray Kavukcuoglu

Clément Farabet

Elena Buchatskaya

Sebastian Borgeaud

Noah Fiedel

Armand Joulin

Kathleen Kenealy

Robert Dadashi

Alek Andreev

In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2… (see more) billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.

2024-07-31

ArXiv (preprint)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Côme Fiegel

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM's policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.

2024-05-01

ICML.cc/2024/Conference (spotlight)

openreview.net

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Côme Fiegel

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM’s policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.

2024-05-01

ICML.cc/2024/Conference (spotlight)

openreview.net

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev

Soham De

Samuel L. Smith

Anushan Fernando

George-Cristian Muraru

Ruba Haroun

Leonard Berrada

Razvan Pascanu

Pier Giuseppe Sessa

Robert Dadashi

L'eonard Hussenot

Johan Ferret

Sertan Girgin

Olivier Bachem

Alek Andreev

Kathleen Kenealy

Cassidy Hardin

Surya Bhupatiraju

Shreya Pathak … (see 43 more)

Laurent Sifre

Morgane Rivière

Mihir Kale

J Christopher Love

Juliette Love

Pouya Dehghani Tafti

Armand Joulin

Noah Fiedel

Evan Senter

Yutian Chen 0001

Srivatsan Srinivasan

Guillaume Desjardins

David Mark Budden

Arnaud Doucet

Sharad Mandyam Vikram

Adam Paszke

Trevor Gale

Sebastian Borgeaud

Charlie Chen

Andy Brock

Antonia Paterson

Jenny Brennan

Meg Risdal

Raj Gundluru

N. Devanathan

Paul Mooney

Nilay Chauhan

Phil Culliton

Luiz GUStavo Martins

Elisa Bandy

David W. Huntsperger

Glenn Cameron

Arthur Zucker

Tris Brian Warkentin

Ludovic Peran

Minh Giang

Zoubin Ghahramani

Clément Farabet

Koray Kavukcuoglu

Demis Hassabis

Raia Hadsell

Yee Whye Teh

Nando de Frietas

We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurr… (see more)ences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

2024-04-01

arXiv (published)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Nash Learning from Human Feedback

R'emi Munos

Michal Valko

Daniele Calandriello

M. G. Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (preprint)

Ghost Units Yield Biologically Plausible Backprop in Deep Neural Networks

Gaetan Vignoud

João Sacramento

Walter Senn

Yoshua Bengio

2018-09-05

2018 Conference on Cognitive Computational Neuroscience (published)

Generalization of Equilibrium Propagation to Vector Field Dynamics

The biological plausibility of the backpropagation algorithm has long been doubted by neuroscientists. Two major reasons are that neurons wo… (see more)uld need to send two different types of signal in the forward and backward phases, and that pairs of neurons would need to communicate through symmetric bidirectional connections. We present a simple two-phase learning procedure for fixed point recurrent networks that addresses both these issues. In our model, neurons perform leaky integration and synaptic weights are updated through a local mechanism. Our learning method generalizes Equilibrium Propagation to vector field dynamics, relaxing the requirement of an energy function. As a consequence of this generalization, the algorithm does not compute the true gradient of the objective function, but rather approximates it at a precision which is proven to be directly related to the degree of symmetry of the feedforward and feedback weights. We show experimentally that our algorithm optimizes the objective function.

2018-08-14

ArXiv (preprint)