Portrait of Pierre-Luc Bacon

Pierre-Luc Bacon

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Reinforcement Learning

Biography

Pierre-Luc Bacon is an assistant professor at Université de Montréal in the Department of Computer Science and Operations Research (DIRO). He is also a core academic member of Mila – Quebec Artificial Intelligence Institute and IVADO, and holds a Facebook CIFAR AI Chair. Bacon leads a research group that investigates the challenges posed by the curse of the horizon in reinforcement learning and optimal control.

Current Students

Collaborating researcher - Concordia University
Collaborating researcher - ÉTS
PhD - Université de Montréal
Professional Master's - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
Master's Research - Polytechnique Montréal
Principal supervisor :
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni
PhD - Université de Montréal
Postdoctorate - McGill University
Principal supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Polytechnique Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal

Publications

Discovery of Sustainable Refrigerants through Physics-Informed RL Fine-Tuning of Sequence Models
Most refrigerants currently used in air-conditioning systems, such as hydrofluorocarbons, are potent greenhouse gases and are being phased d… (see more)own. Large-scale molecular screening has been applied to the search for alternatives, but in practice only about 300 refrigerants are known, and only a few additional candidates have been suggested without experimental validation. This scarcity of reliable data limits the effectiveness of purely data-driven methods. We present Refgen, a generative pipeline that integrates machine learning with physics-grounded inductive biases. Alongside fine-tuning for valid molecular generation, Refgen incorporates predictive models for critical properties, equations of state, thermochemical polynomials, and full vapor compression cycle simulations. These models enable reinforcement learning fine-tuning under thermodynamic constraints, enforcing consistency and guiding discovery toward molecules that balance efficiency, safety, and environmental impact. By embedding physics into the learning process, Refgen leverages scarce data effectively and enables de novo refrigerant discovery beyond the known set of compounds.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Roger Creus Castanyer
Johan Obando-Ceron
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (see more)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Guozheng Ma
Li Li
Zilin Wang
Li Shen
Dacheng Tao
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motiv… (see more)ating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Guozheng Ma
Li Li
Zilin Wang
Li Shen
Dacheng Tao
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motiv… (see more)ating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning
A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a … (see more)small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.
Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a smal… (see more)l set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (see more)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (see more)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.
What Matters when Modeling Human Behavior using Imitation Learning?
As AI systems become increasingly embedded in human decision-making process, aligning their behavior with human values is critical to ensuri… (see more)ng safe and trustworthy deployment. A central approach to AI Alignment called Imitation Learning (IL), trains a learner to directly mimic desirable human behaviors from expert demonstrations. However, standard IL methods assume that (1) experts act to optimize expected returns; (2) expert policies are Markovian. Both assumptions are inconsistent with empirical findings from behavioral economics, according to which humans are (1) risk-sensitive; and (2) make decisions based on past experience. In this work, we examine the implications of risk sensitivity for IL and show that standard approaches do not capture all optimal policies under risk-sensitive decision criteria. By characterizing these expert policies, we identify key limitations of existing IL algorithms in replicating expert performance in risk-sensitive settings. Our findings underscore the need for new IL frameworks that account for both risk-aware preferences and temporal dependencies to faithfully align AI behavior with human experts.
State Entropy Regularization for Robust Reinforcement Learning
Uri Koren
Yonatan Ashlag
Mirco Mutti
Shie Mannor
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its the… (see more)oretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators
A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a smal… (see more)l set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.
Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning
Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (see more)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.