Portrait of Pierre-Luc Bacon

Pierre-Luc Bacon

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Reinforcement Learning

Biography

Pierre-Luc Bacon is an assistant professor at Université de Montréal in the Department of Computer Science and Operations Research (DIRO). He is also a core academic member of Mila – Quebec Artificial Intelligence Institute and IVADO, and holds a Facebook CIFAR AI Chair. Bacon leads a research group that investigates the challenges posed by the curse of the horizon in reinforcement learning and optimal control.

Current Students

Collaborating researcher - Concordia University
Collaborating researcher - ÉTS
PhD - Université de Montréal
Professional Master's - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
Master's Research - Polytechnique Montréal
Principal supervisor :
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni
PhD - Université de Montréal
Postdoctorate - McGill University
Principal supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Polytechnique Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal

Publications

State Entropy Regularization for Robust Reinforcement Learning
Yonatan Ashlag
Uri Koren
Mirco Mutti
Shie Mannor
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its the… (see more)oretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments
Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning
A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.
Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning
A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Guozheng Ma
Li Li
Zilin Wang
Li Shen
Dacheng Tao
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, moti… (see more)vating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
Scaling Trends in Language Model Robustness
Nikolaus H. R. Howe
Ian R. McKenzie
Oskar John Hollinsworth
Michał Zając
Tom Tseng
Aaron David Tucker
Adam Gleave
Mol-MoE: Training Preference-Guided Routers for Molecule Generation
Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons
Mol-MoE: Training Preference-Guided Routers for Molecule Generation
Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
MaestroMotif: Skill Design from Artificial Intelligence Feedback
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.
MaestroMotif: Skill Design from Artificial Intelligence Feedback
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.
MaestroMotif: Skill Design from Artificial Intelligence Feedback
Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.