Doina Precup

guangyuan.wang@mila.quebec

Guangyuan Wang

Stagiaire de recherche - McGill University

Haque Ishfaq

Doctorat - McGill University

Doctorat - McGill University

huanghow@mila.quebec

Janarthanan Rajendran

Postdoctorat - Université de Montréal

Superviseur⋅e principal⋅e :

Sarath Chandar Anbil Parthipan

janarthanan.rajendran@mila.quebec

jonathan.colaco-carr@mila.quebec

Jaume Minano Masip

Doctorat - McGill University

masipmij@mila.quebec

Jesse Farebrother

Doctorat - McGill University

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Maîtrise recherche - McGill University

Superviseur⋅e principal⋅e :

Prakash Panangaden

Jonathan Lebensold

Doctorat - McGill University

Stagiaire de recherche - McGill University

keyu.wang@mila.quebec

Kushal Arora

Doctorat - McGill University

Superviseur⋅e principal⋅e :

Lynn Cherif

Maîtrise recherche - McGill University

Co-superviseur⋅e :

Khimya Khetarpal

lynn.cherif@mila.quebec

Mohammad Sami Nur Islam Islam

Mandana Samiei

Doctorat - McGill University

Co-superviseur⋅e :

Doctorat - McGill University

delvermm@mila.quebec

Martin Klissarov

Doctorat - McGill University

Harry Zhao

Doctorat - McGill University

Co-superviseur⋅e :

Stagiaire de recherche - McGill University

mohammad-sami-nur.islam@mila.quebec

nathan.de-lara@mila.quebec

Nathan de Lara

Stagiaire de recherche - McGill University

Nate Rahn

Doctorat - McGill University

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

nathan.rahn@mila.quebec

Girdhar Neil Girdhar

Collaborateur·rice de recherche - McGill University

neil.girdhar@mila.quebec

Nikhil Vemgal

Maîtrise recherche - McGill University

nikhil-murali.vemgal@mila.quebec

padideh.nouri@mila.quebec

Nishanth Anand Vemgal

Doctorat - McGill University

Maîtrise recherche - Université de Montréal

Doctorat - McGill University

Ray Chua

Doctorat - McGill University

Co-superviseur⋅e :

Blake Richards

chuaraym@mila.quebec

Riashat Islam

Doctorat - McGill University

Safa Alver

Doctorat - McGill University

alversaf@mila.quebec

Sahand Rezaei-Shoshtari

Doctorat - McGill University

Co-superviseur⋅e :

sahand.rezaei-shoshtari@mila.quebec

Doctorat - McGill University

Doctorat - McGill University

Co-superviseur⋅e :

shahrad.mohammadzadeh@mila.quebec

fujimots@mila.quebec

Shahrad Mohammadzadeh

Collaborateur·rice de recherche - McGill University

Superviseur⋅e principal⋅e :

Reihaneh Rabbany

Doctorat - McGill University

Doctorat - McGill University

shuyuan.zhang@mila.quebec

Sitao Luan

Doctorat - McGill University

Steve Wen

Baccalauréat - McGill University

steve.wen@mila.quebec

Sumana Basu

Doctorat - McGill University

Co-superviseur⋅e :

Adriana Romero Soriano

Maîtrise recherche - Université de Montréal

Superviseur⋅e principal⋅e :

Yoshua Bengio

thomas.jiralerspong@mila.quebec

Doctorat - McGill University

cheluver@mila.quebec

Wesley Chung

Doctorat - McGill University

Superviseur⋅e principal⋅e :

chungwes@mila.quebec

Ray Luo

Doctorat - McGill University

Superviseur⋅e principal⋅e :

Xujie Si

luo.ziyan@mila.quebec

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

QGFN: Controllable Greediness with Action Values

Elaine Lau

Stephen Zhewen Lu

Ling Pan

Emmanuel Bengio

Generative Flow Networks (GFlowNets; GFNs) are a family of reward/energy-based generative methods for combinatorial objects, capable of gene… (voir plus)rating diverse and high-utility samples. However, biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate,

2024-02-07

ArXiv (prépublication)

Effective Protein-Protein Interaction Exploration with PPIretrieval

Chenqing Hua

Connor Coley

Guy Wolf

Shuangjia Zheng

2024-02-06

ArXiv (prépublication)

Consciousness-Inspired Spatio-Temporal Abstractions for Better Generalization in Reinforcement Learning

Harry Zhao

Mingde Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Yoshua Bengio

Inspired by human conscious planning, we propose Skipper, a model-based reinforcement learning framework utilizing spatio-temporal abstracti… (voir plus)ons to generalize better in novel situations. It automatically decomposes the given task into smaller, more manageable subtasks, and thus enables sparse decision-making and focused computation on the relevant parts of the environment. The decomposition relies on the extraction of an abstracted proxy problem represented as a directed graph, in which vertices and edges are learned end-to-end from hindsight. Our theoretical analyses provide performance guarantees under appropriate assumptions and establish where our approach is expected to be helpful. Generalization-focused experiments validate Skipper’s significant advantage in zero-shot generalization, compared to some existing state-of-the-art hierarchical planning methods.

2024-01-16

ICLR.cc/2024/Conference (poster)

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq

Qingfeng Lan

Pan Xu

A. Rupam Mahmood

Animashree Anandkumar

Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (voir plus)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of

2024-01-16

ICLR.cc/2024/Conference (poster)

Connecting Weighted Automata, Tensor Networks and Recurrent Neural Networks through Spectral Learning

Tianyu Li

Guillaume Rabusseau

2024-01-01

Mach. Learn. (publié)

Policy Gradient Methods in the Presence of Symmetries and State Abstractions

Prakash Panangaden

Sahand Rezaei-Shoshtari

Rosie Zhao

Nash Learning from Human Feedback

R'emi Munos

Michal Valko

Daniele Calandriello

M. G. Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (voir plus)erences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

2023-12-01

ArXiv (prépublication)

Learning domain-invariant classifiers for infant cry sounds

Charles Onu

Hemanth K. Sheetha

Arsenii Gorin

2023-11-30

ArXiv (prépublication)

MUDiff: Unified Diffusion for Complete Molecule Generation

Chenqing Hua

Sitao Luan

Minkai Xu

Rex Ying

Zhitao Ying

Jie Fu

Stefano Ermon

2023-11-18

logconference.io/LOG/2023/Conference (poster)

DGFN: Double Generative Flow Networks

Elaine Lau

Nikhil Murali Vemgal

Emmanuel Bengio

2023-10-27

NeurIPS.cc/2023/Workshop/GenBio (poster)

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abbas Mehrabian

Ankit Anand

Hyunjik Kim

Nicolas Sonnerat

Matej Balog

Gheorghe Comanici

Tudor Berariu

Andrew Lee

Anian Ruoss

Anna Bulanova

Daniel Toyama

Sam Blackwell

Bernardino Romera Paredes

Petar Veličković

Laurent Orseau

Joonkyung Lee

Anurag Murty Naredla

Adam Zsolt Wagner

2023-10-27

NeurIPS.cc/2023/Workshop/MATH-AI (poster)

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

Thomas Jiralerspong

Flemming Kondrup

Khimya Khetarpal

The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and th… (voir plus)us enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.

2023-10-27

NeurIPS.cc/2023/Workshop/GenPlan (publié)