Mixtures of Experts Unlock Parameter Scaling for Deep RL
Johan Samir Obando Ceron
Ghada Sokar
Timon Willi
Clare Lyle
Jesse Farebrother
Jakob Nicolaus Foerster
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (voir plus)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Nash Learning from Human Feedback
Remi Munos
Michal Valko
Daniele Calandriello
Mohammad Gheshlaghi Azar
Mark Rowland
Zhaohan Daniel Guo
Yunhao Tang
Matthieu Geist
Thomas Mesnard
Côme Fiegel
Andrea Michi
Marco Selvi
Sertan Girgin
Nikola Momchev
Olivier Bachem
Daniel J Mankowitz
Bilal Piot
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM's policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.
Nash Learning from Human Feedback
Remi Munos
Michal Valko
Daniele Calandriello
Mohammad Gheshlaghi Azar
Mark Rowland
Zhaohan Daniel Guo
Yunhao Tang
Matthieu Geist
Thomas Mesnard
Côme Fiegel
Andrea Michi
Marco Selvi
Sertan Girgin
Nikola Momchev
Olivier Bachem
Daniel J Mankowitz
Bilal Piot
Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM’s policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.
Nilearn and Big Data Facilitates Transdiagnostic Brain Biomarkers
Hao-Ting Wang
Natasha Clarke
Quentin Dessain
Fraçois Paugam
Overcoming boundaries: Interdisciplinary challenges and opportunities in cognitive neuroscience
Arnaud Brignol
Anita Paas
Luis Sotelo-Castro
David St-Onge
Emily B.J. Coffey
Patient-Centered Surgical Care for Children in Low and Lower-Middle Income Countries (LMICs) - A Systematic Scoping Review of the Literature
Riya Sawhney
Kacylia Roy Proulx
Ayla Gerk
Elena Guadagno
Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada
Mehdi Mousavi
Shabnam Shafiee
Jason M Harley
Introduction The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical educa… (voir plus)tion, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). Method Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews’ score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. Result According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer’s scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p0.001). Similarly, the Reviewers’ Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed th
Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada
Mehdi Mousavi
Shabnam Shafiee
Jason M Harley
Introduction The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical educa… (voir plus)tion, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). Method Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews’ score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. Result According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer’s scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p0.001). Similarly, the Reviewers’ Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed th
Position: Application-Driven Innovation in Machine Learning
Alan Aspuru-Guzik
Sara Beery
Bistra Dilkina
Priya L. Donti
Marzyeh Ghassemi
Hannah Kerner
Claire Monteleoni
Esther Rolf
Milind Tambe
Adam White
No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO
Skander Moalla
Andrea Miele
Daniil Pyatko
Caglar Gulcehre
Reinforcement learning (RL) is inherently rife with non-stationarity since the states and rewards the agent observes during training depend … (voir plus)on its changing policy. Therefore, networks in deep RL must be capable of adapting to new observations and fitting new targets. However, previous works have observed that networks trained under non-stationarity exhibit an inability to continue learning, termed loss of plasticity, and eventually a collapse in performance. For off-policy deep value-based RL methods, this phenomenon has been correlated with a decrease in representation rank and the ability to fit random targets, termed capacity loss. Although this correlation has generally been attributed to neural network learning under non-stationarity, the connection to representation dynamics has not been carefully studied in on-policy policy optimization methods. In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and MuJoCo environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss. We show that this is aggravated by stronger non-stationarity, ultimately driving the actor's performance to collapse, regardless of the performance of the critic. We ask why the trust region, specific to methods like PPO, cannot alleviate or prevent the collapse and find a connection between representation collapse and the degradation of the trust region, one exacerbating the other. Finally, we present Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with other interventions, shows that regularizing the representation dynamics mitigates the performance collapse of PPO agents.
Robust Data-driven Prescriptiveness Optimization
Mehran Poursoltani
Angelos Georghiou
Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features
Aleksandr Beznosikov
David Dobre