Portrait de Pierre-Luc Bacon

Pierre-Luc Bacon

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage par renforcement

Biographie

Pierre-Luc Bacon est professeur agrégé au Département d'informatique et de recherche opérationnelle de l'Université de Montréal. Il est également membre de Mila – Institut québécois d’intelligence artificielle et d’IVADO et titulaire d'une chaire Facebook-CIFAR. Il dirige un groupe de recherche qui travaille sur le défi posé par la malédiction de l'horizon dans l'apprentissage par renforcement et le contrôle optimal.

Étudiants actuels

Stagiaire de recherche - UdeM
Maîtrise recherche - UdeM
Stagiaire de recherche - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Maîtrise recherche - UdeM
Maîtrise recherche - UdeM
Postdoctorat - UdeM
Collaborateur·rice alumni
Collaborateur·rice de recherche - Université de Montréal
Doctorat - UdeM
Maîtrise recherche - UdeM
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Stagiaire de recherche - UdeM
Doctorat - Polytechnique
Superviseur⋅e principal⋅e :
Postdoctorat - Polytechnique
Superviseur⋅e principal⋅e :
Stagiaire de recherche - UdeM

Publications

Double Gumbel Q-Learning.
David Yu-Tung Hui
Policy Optimization in a Noisy Neighborhood: On Return Landscapes in Continuous Control
Nathan Rahn
Pierluca D'Oro
Harley Wiltzer
Deep reinforcement learning agents for continuous control are known to exhibit significant instability in their performance over time. In th… (voir plus)is work, we provide a fresh perspective on these behaviors by studying the return landscape: the mapping between a policy and a return. We find that popular algorithms traverse noisy neighborhoods of this landscape, in which a single update to the policy parameters leads to a wide range of returns. By taking a distributional view of these returns, we map the landscape, characterizing failure-prone regions of policy space and revealing a hidden dimension of policy quality. We show that the landscape exhibits surprising structure by finding simple paths in parameter space which improve the stability of a policy. To conclude, we develop a distribution-aware procedure which finds such paths, navigating away from noisy neighborhoods in order to improve the robustness of a policy. Taken together, our results provide new insight into the optimization, evaluation, and design of agents.
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
Tianwei Ni
Michel Ma
Benjamin Eysenbach
Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, an… (voir plus)d determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations
Goal-conditioned GFlowNets for Controllable Multi-Objective Molecular Design
Julien Roy
Emmanuel Bengio
In recent years, in-silico molecular design has received much attention from the machine learning community. When designing a new compound f… (voir plus)or pharmaceutical applications, there are usually multiple properties of such molecules that need to be optimised: binding energy to the target, synthesizability, toxicity, EC50, and so on. While previous approaches have employed a scalarization scheme to turn the multi-objective problem into a preference-conditioned single objective, it has been established that this kind of reduction may produce solutions that tend to slide towards the extreme points of the objective space when presented with a problem that exhibits a concave Pareto front. In this work we experiment with an alternative formulation of goal-conditioned molecular generation to obtain a more controllable conditional model that can uniformly explore solutions along the entire Pareto front.
Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier
Pierluca D'Oro
Max Schwarzer
Evgenii Nikishin
Increasing the replay ratio, the number of updates of an agent's parameters per environment interaction, is an appealing strategy for improv… (voir plus)ing the sample efficiency of deep reinforcement learning algorithms. In this work, we show that fully or partially resetting the parameters of deep reinforcement learning agents causes better replay ratio scaling capabilities to emerge. We push the limits of the sample efficiency of carefully-modified algorithms by training them using an order of magnitude more updates than usual, significantly improving their performance in the Atari 100k and DeepMind Control Suite benchmarks. We then provide an analysis of the design choices required for favorable replay ratio scaling to be possible and discuss inherent limits and tradeoffs.
Block-State Transformers
Mahan Fathi
Jonathan Pilault
Orhan Firat
Ross Goroshin
Options of Interest: Temporal Abstraction with Interest Functions
Martin Klissarov
Maxime Chevalier-Boisvert
Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time. Th… (voir plus)e options framework describes such behaviours as consisting of a subset of states in which they can initiate, an internal policy and a stochastic termination condition. However, much of the subsequent work on option discovery has ignored the initiation set, because of difficulty in learning it from data. We provide a generalization of initiation sets suitable for general function approximation, by defining an interest function associated with an option. We derive a gradient-based learning algorithm for interest functions, leading to a new interest-option-critic architecture. We investigate how interest functions can be leveraged to learn interpretable and reusable temporal abstractions. We demonstrate the efficacy of the proposed approach through quantitative and qualitative results, in both discrete and continuous environments.