Portrait of Pierre-Luc Bacon

Pierre-Luc Bacon

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research

Biography

Pierre-Luc Bacon is an assistant professor at Université de Montréal in the Department of Computer Science and Operations Research (DIRO). He is also a core academic member of Mila – Quebec Artificial Intelligence Institute and IVADO, and holds a Facebook CIFAR AI Chair. Bacon leads a research group that investigates the challenges posed by the curse of the horizon in reinforcement learning and optimal control.

Current Students

Diego Calanzone
Collaborating Researcher - University of Trento
diego.calanzone@mila.quebec
Léa Côté-Turcotte
Master's Research - Université de Montréal
lea.cote-turcotte@mila.quebec
Pierluca D'Oro
PhD - Université de Montréal
Co-supervisor :
pierluca.doro@mila.quebec
Esther Derman
Postdoctorate - Université de Montréal
Co-supervisor :
esther.derman@mila.quebec
Simon Dufort-Labbé
PhD - Université de Montréal
simon.dufort-labbe@mila.quebec
Mahan Fathi
Master's Research - Université de Montréal
mahan.fathi@mila.quebec
Arielle Gazzé
Research Intern - Université de Montréal
arielle.gazze@mila.quebec
Clement Gehring
Postdoctorate - Université de Montréal
clement.gehring@mila.quebec
Niki Howe
PhD - Université de Montréal
howeniko@mila.quebec
David Kanaa
Collaborating Alumni
kanaadjs@mila.quebec
Michel Ma
PhD - Université de Montréal
michel.ma@mila.quebec
Sobhan Mohammadpour
Master's Research - Université de Montréal
sobhan.mohammadpour@mila.quebec
Aneri Muni
PhD - Université de Montréal
aneri.muni@mila.quebec
Tianwei Ni
PhD - Université de Montréal
tianwei.ni@mila.quebec
Evgenii Nikishin
PhD - Université de Montréal
Co-supervisor :
evgenii.nikishin@mila.quebec
Anushree Rankawat
PhD - Université de Montréal
anushree.rankawat@mila.quebec
Samy Rasmy
Research Intern - Université de Montréal
samy.rasmy@mila.quebec
Julien Roy
PhD - Polytechnique Montréal
Principal supervisor :
royjulie@mila.quebec
Vincent Taboga
PhD - Polytechnique Montréal
Principal supervisor :
vincent.taboga@mila.quebec
Justin Veilleux
Research Intern - Université de Montréal
justin.veilleux@mila.quebec

Publications

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons
Simon Dufort-Labbé
Pierluca D'Oro
Evgenii Nikishin
Razvan Pascanu
Aristide Baratin
Do Transformer World Models Give Better Policy Gradients?
Michel Ma
Tianwei Ni
Clement Gehring
Pierluca D'Oro
Bridging State and History Representations: Understanding Self-Predictive RL
Tianwei Ni
Benjamin Eysenbach
Erfan SeyedSalehi
Michel Ma
Clement Gehring
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (see more)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.
Course Correcting Koopman Representations
Mahan Fathi
Clement Gehring
Jonathan Pilault
David Kanaa
Ross Goroshin
Decoupling regularization from the action space
Sobhan Mohammadpour
Regularized reinforcement learning (RL), particularly the entropy-regularized kind, has gained traction in optimal control and inverse RL. W… (see more)hile standard unregularized RL methods remain unaffected by changes in the number of actions, we show that it can severely impact their regularized counterparts. This paper demonstrates the importance of decoupling the regularizer from the action space: that is, to maintain a consistent level of regularization regardless of how many actions are involved to avoid over-regularization. Whereas the problem can be avoided by introducing a task-specific temperature parameter, it is often undesirable and cannot solve the problem when action spaces are state-dependent. In the state-dependent action context, different states with varying action spaces are regularized inconsistently. We introduce two solutions: a static temperature selection approach and a dynamic counterpart, universally applicable where this problem arises. Implementing these changes improves performance on the DeepMind control suite in static and dynamic temperature regimes and a biological design task.
Motif: Intrinsic Motivation from Artificial Intelligence Feedback
Martin Klissarov
Pierluca D'Oro
Shagun Sodhani
Roberta Raileanu
Amy Zhang
Mikael Henaff
Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (see more)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
Maximum entropy GFlowNets with soft Q-learning
Sobhan Mohammadpour
Emmanuel Bengio
Block-State Transformers
Jonathan Pilault
Mahan Fathi
Orhan Firat
Ross Goroshin
Double Gumbel Q-Learning.
David Yu-Tung Hui
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
Nathan Rahn
Pierluca D'Oro
Harley Wiltzer
Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In th… (see more)is work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
Tianwei Ni
Michel Ma
Benjamin Eysenbach
Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, an… (see more)d determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations
Goal-conditioned GFlowNets for Controllable Multi-Objective Molecular Design
Julien Roy
Emmanuel Bengio
In recent years, in-silico molecular design has received much attention from the machine learning community. When designing a new compound f… (see more)or pharmaceutical applications, there are usually multiple properties of such molecules that need to be optimised: binding energy to the target, synthesizability, toxicity, EC50, and so on. While previous approaches have employed a scalarization scheme to turn the multi-objective problem into a preference-conditioned single objective, it has been established that this kind of reduction may produce solutions that tend to slide towards the extreme points of the objective space when presented with a problem that exhibits a concave Pareto front. In this work we experiment with an alternative formulation of goal-conditioned molecular generation to obtain a more controllable conditional model that can uniformly explore solutions along the entire Pareto front.