Portrait of Pierre-Luc Bacon

Pierre-Luc Bacon

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Reinforcement Learning

Biography

Pierre-Luc Bacon is an assistant professor at Université de Montréal in the Department of Computer Science and Operations Research (DIRO). He is also a core academic member of Mila – Quebec Artificial Intelligence Institute and IVADO, and holds a Facebook CIFAR AI Chair. Bacon leads a research group that investigates the challenges posed by the curse of the horizon in reinforcement learning and optimal control.

Current Students

Collaborating researcher - Concordia University
Collaborating researcher - ÉTS
PhD - Université de Montréal
Professional Master's - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
Master's Research - Polytechnique Montréal
Principal supervisor :
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni
PhD - Université de Montréal
Postdoctorate - McGill University
Principal supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Polytechnique Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal

Publications

What Makes Value Learning Efficient in Residual Reinforcement Learning?
Guozheng Ma
Li Li
Haoyu Wang
Zixuan Liu
Dacheng Tao
Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning on… (see more)ly bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.
Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity
Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastr… (see more)ophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.
Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism
Discovery of Sustainable Refrigerants through Physics-Informed RL Fine-Tuning of Sequence Models
Most refrigerants currently used in air-conditioning systems, such as hydrofluorocarbons, are potent greenhouse gases and are being phased d… (see more)own. Large-scale molecular screening has been applied to the search for alternatives, but in practice only about 300 refrigerants are known, and only a few additional candidates have been suggested without experimental validation. This scarcity of reliable data limits the effectiveness of purely data-driven methods. We present Refgen, a generative pipeline that integrates machine learning with physics-grounded inductive biases. Alongside fine-tuning for valid molecular generation, Refgen incorporates predictive models for critical properties, equations of state, thermochemical polynomials, and full vapor compression cycle simulations. These models enable reinforcement learning fine-tuning under thermodynamic constraints, enforcing consistency and guiding discovery toward molecules that balance efficiency, safety, and environmental impact. By embedding physics into the learning process, Refgen leverages scarce data effectively and enables de novo refrigerant discovery beyond the known set of compounds.
The Three Regimes of Offline-to-Online Reinforcement Learning
Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online… (see more) interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online-fine tuning that work well in one setting can fail completely in another. We propose a stability--plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.
Planning with Unified Multimodal Models
Zhilong Zhang
Yang Yu
With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored … (see more)using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Roger Creus Castanyer
Johan Obando-Ceron
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (see more)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.
State Entropy Regularization for Robust Reinforcement Learning
Yonatan Ashlag
Uri Koren
Mirco Mutti
Shie Mannor
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its the… (see more)oretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
What Matters when Modeling Human Behavior using Imitation Learning?
As AI systems become increasingly embedded in human decision-making process, aligning their behavior with human values is critical to ensuri… (see more)ng safe and trustworthy deployment. A central approach to AI Alignment called Imitation Learning (IL), trains a learner to directly mimic desirable human behaviors from expert demonstrations. However, standard IL methods assume that (1) experts act to optimize expected returns; (2) expert policies are Markovian. Both assumptions are inconsistent with empirical findings from behavioral economics, according to which humans are (1) risk-sensitive; and (2) make decisions based on past experience. In this work, we examine the implications of risk sensitivity for IL and show that standard approaches do not capture all optimal policies under risk-sensitive decision criteria. By characterizing these expert policies, we identify key limitations of existing IL algorithms in replicating expert performance in risk-sensitive settings. Our findings underscore the need for new IL frameworks that account for both risk-aware preferences and temporal dependencies to faithfully align AI behavior with human experts.
Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a … (see more)small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.
Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments
A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space and embeddi… (see more)ng these learned distances in the representation space. While promising for robustness to task-irrelevant noise, as shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep reinforcement learning (RL), we evaluate five recent approaches, unified conceptually as isometric embeddings with varying design choices. We benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 370 task configurations with diverse noise settings. Beyond final returns, we introduce the evaluation of a denoising factor to quantify the encoder's ability to filter distractions. To further isolate the effect of metric learning, we propose and evaluate an isolated metric estimation setting, in which the encoder is influenced solely by the metric loss. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Guozheng Ma
Zilin Wang
Li Shen
Dacheng Tao
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motiv… (see more)ating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.