Doina Precup

thomas.jiralerspong@mila.quebec

Arushi Jain

Doctorat - McGill

Maîtrise recherche - UdeM

Superviseur⋅e principal⋅e :

Yoshua Bengio

Doctorat - McGill

Postdoctorat - McGill

flemming.kondrup@mila.quebec

Elaine Lau

Maîtrise recherche - McGill

elaine.lau@mila.quebec

Doctorat - McGill

Sitao Luan

Doctorat - McGill

Ray Luo

Doctorat - McGill

Superviseur⋅e principal⋅e :

Xujie Si

luo.ziyan@mila.quebec

shahrad.mohammadzadeh@mila.quebec

G McCracken

Doctorat - McGill

Doctorat - McGill

Shahrad Mohammadzadeh

Collaborateur·rice de recherche - McGill

Superviseur⋅e principal⋅e :

Reihaneh Rabbany

padideh.nouri@mila.quebec

Gabriela Moisescu-Pareja

Maîtrise recherche - McGill

moisescg@mila.quebec

Girdhar Neil Girdhar

Collaborateur·rice de recherche - McGill

neil.girdhar@mila.quebec

Padideh Nouri

Maîtrise recherche - UdeM

Sarath Chandar Anbil Parthipan

Charles Onu

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Nate Rahn

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

nathan.rahn@mila.quebec

Janarthanan Rajendran

Postdoctorat - UdeM

Superviseur⋅e principal⋅e :

janarthanan.rajendran@mila.quebec

sahand.rezaei-shoshtari@mila.quebec

Sahand Rezaei-Shoshtari

Doctorat - McGill

Co-superviseur⋅e :

David Meger

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Maîtrise recherche - McGill

nikhil-murali.vemgal@mila.quebec

Doctorat - McGill

Doctorat - McGill

Stagiaire de recherche - McGill

keyu.wang@mila.quebec

guangyuan.wang@mila.quebec

Guangyuan Wang

Stagiaire de recherche - McGill

Steve Wen

Baccalauréat - McGill

steve.wen@mila.quebec

Shuyuan Zhang

Doctorat - McGill

shuyuan.zhang@mila.quebec

Harry Zhao

Doctorat - McGill

Co-superviseur⋅e :

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

QGFN: Controllable Greediness with Action Values

Elaine Lau

Stephen Zhewen Lu

Ling Pan

Emmanuel Bengio

Generative Flow Networks (GFlowNets; GFNs) are a family of reward/energy-based generative methods for combinatorial objects, capable of gene… (voir plus)rating diverse and high-utility samples. However, biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate,

2024-06-17

ICML.cc/2024/Workshop/SPIGM (poster)

Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto

Mohammad Sami Nur Islam

Martin Klissarov

Sherry Yang

Ankit Anand

2024-05-01

ICML.cc/2024/Conference (spotlight)

Gintare Karolina Dziugaite

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Johan Samir Obando Ceron

Ghada Sokar

Timon Willi

Clare Lyle

Jesse Farebrother

Jakob Nicolaus Foerster

Pablo Samuel Castro

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (voir plus)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

2024-05-01

ICML.cc/2024/Conference (spotlight)

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Côme Fiegel

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM's policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.

2024-05-01

ICML.cc/2024/Conference (spotlight)

Discrete Probabilistic Inference as Control in Multi-path Environments

Tristan Deleu

Padideh Nouri

Nikolay Malkin

Yoshua Bengio

We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to … (voir plus)find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

2024-04-26

auai.org/UAI/2024/Conference (poster)

Conditions on Preference Relations that Guarantee the Existence of Optimal Policies

Jonathan Colaco Carr

Prakash Panangaden

2024-04-18

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (publié)

arxiv.org

On learning history-based policies for controlling Markov decision processes

Gandharv Patil

Aditya Mahajan

2024-04-18

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (publié)

On the Privacy of Selection Mechanisms with Gaussian Noise

Jonathan Lebensold

Borja Balle

Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding … (voir plus)noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.

2024-04-18

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (publié)

arxiv.org

Offline Multitask Representation Learning for Reinforcement Learning

Haque Ishfaq

Thanh Nguyen-Tang

Songtao Feng

Raman Arora

Mengdi Wang

Ming Yin 0003