Doina Precup

Arushi Jain

Doctorat - McGill

Doctorat - McGill

Postdoctorat - McGill

Elaine Lau

Maîtrise recherche - McGill

Jonathan Lebensold

Collaborateur·rice alumni - McGill

Baccalauréat - McGill

Ray Luo

Doctorat - McGill

G McCracken

Doctorat - McGill

Nazanin Mohammadi Sepahvand

Doctorat - McGill

Shahrad Mohammadzadeh

Maîtrise recherche - McGill

Gabriela Moisescu-Pareja

Collaborateur·rice de recherche - McGill

Doctorat - UdeM

Charles Onu

Doctorat - McGill

Doctorat - McGill

Nate Rahn

Doctorat - McGill

Sahand Rezaei-Shoshtari

Doctorat - McGill

Doctorat - McGill

samiemandana@gmail.com

Doctorat - McGill

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Doctorat - McGill

Doctorat - McGill

Stagiaire de recherche - McGill

Zihan Wang

Doctorat - McGill

Steve Wen

Maîtrise recherche - McGill

Zijing Wu

Doctorat - McGill

Doctorat - McGill

Harry Zhao

Doctorat - McGill

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Prakash Panangaden

Sahand Rezaei-Shoshtari

Rosie Zhao

David Meger

Nash Learning from Human Feedback

R'emi Munos

Michal Valko

Daniele Calandriello

M. G. Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (prépublication)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

2023-12-01

ArXiv (prépublication)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

2023-12-01

ArXiv (prépublication)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

2023-12-01

ArXiv (prépublication)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

2023-12-01

ArXiv (prépublication)

Learning domain-invariant classifiers for infant cry sounds

Charles Onu

Hemanth K. Sheetha

Arsenii Gorin

2023-11-30

ArXiv (prépublication)

MUDiff: Unified Diffusion for Complete Molecule Generation

Chenqing Hua

Sitao Luan

Minkai Xu

Zhitao Ying

Rex Ying

Jie Fu

Stefano Ermon

2023-11-18

logconference.io/LOG/2023/Conference (poster)

DGFN: Double Generative Flow Networks

Elaine Lau

Nikhil Murali Vemgal

Emmanuel Bengio

2023-10-27

NeurIPS.cc/2023/Workshop/GenBio (poster)

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abbas Mehrabian

Ankit Anand

Hyunjik Kim

Nicolas Sonnerat

Matej Balog

Gheorghe Comanici

Tudor Berariu

Andrew Lee

Anian Ruoss

Anna Bulanova

Daniel Toyama

Sam Blackwell

Bernardino Romera Paredes

Petar Veličković

Laurent Orseau

Joonkyung Lee

Anurag Murty Naredla

Adam Zsolt Wagner

2023-10-27

NeurIPS.cc/2023/Workshop/MATH-AI (poster)

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

Thomas Jiralerspong

Flemming Kondrup

Khimya Khetarpal

The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and th… (voir plus)us enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.

2023-10-27

NeurIPS.cc/2023/Workshop/GenPlan (publié)

A cry for help: Early detection of brain injury in newborns

Charles Onu

Samantha Latremouille

Arsenii Gorin

Junhao Wang

Uchenna Ekwochi

P. Ubuane

O. Kehinde

Muhammad A. Salisu

Datonye Briggs

Yoshua Bengio

2023-10-12

ArXiv (prépublication)