Pierre-Luc Bacon

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

Tianwei Ni

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (accepted)

openreview.net

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

Tianwei Ni

Pierre-Luc Bacon

Doina Precup

Xujie Si

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (published)

openreview.net

Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Diego Calanzone

Pierluca D'Oro

Pierre-Luc Bacon

Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

2025-03-05

ICLR.cc/2025/Workshop/GEM (published)

doi.org

openreview.net

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé

Pierluca D'Oro

Evgenii Nikishin

Razvan Pascanu

Irina Rish

Pierre-Luc Bacon

Aristide Baratin

2025-02-13

TMLR (accepted)

doi.org

openreview.net

Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Diego Calanzone

Pierluca D'Oro

Pierre-Luc Bacon

Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

2025-02-08

ArXiv (preprint)

arxiv.org

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Pascal Vincent

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2025-01-22

ICLR.cc/2025/Conference (oral)

doi.org

openreview.net

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Pascal Vincent

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2024-12-11

ArXiv (preprint)

doi.org

arxiv.org

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Pascal Vincent

Amy Zhang

Pierre-Luc Bacon

Doina Precup

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2024-12-11

ArXiv (preprint)

arxiv.org

Neural differential equations for temperature control in buildings under demand response programs

Vincent Taboga

Clement Gehring

Mathieu Le Cam

Hanane Dagdougui

Pierre-Luc Bacon

2024-08-01

Applied Energy (published)

doi.org

Neural differential equations for temperature control in buildings under demand response programs

Vincent Taboga

Clement Gehring

Mathieu Le Cam

Hanane Dagdougui

Pierre-Luc Bacon

2024-08-01

Applied Energy (published)

doi.org

Effects of Scale on Language Model Robustness

Nikolaus H. R. Howe

Ian R. McKenzie

Oskar John Hollinsworth

Michał Zając

Tom Tseng

Aaron David Tucker

Pierre-Luc Bacon

Adam Gleave

Language models exhibit scaling laws, whereby increasing model and dataset size yields predictable decreases in negative log likelihood, unl… (see more)ocking a dazzling array of capabilities. This phenomenon spurs many companies to train ever larger models in pursuit of ever improved performance. Yet, these models are vulnerable to adversarial inputs such as ``jailbreaks'' and prompt injections that induce models to perform undesired behaviors, posing a growing risk as models become more capable. Prior work indicates that computer vision models become more robust with model and data scaling, raising the question: does language model robustness also improve with scale? We study this question empirically in the classification setting, finding that without explicit defense training, larger models tend to be modestly more robust on most tasks, though the effect is not reliable. Even with the advantage conferred by scale, undefended models remain easy to attack in absolute terms, and we thus turn our attention to explicitly training models for adversarial robustness, which we show to be a much more compute-efficient defense than scaling model size alone. In this setting, we also observe that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models. Finally, we analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others, suggesting that adversarial training alone is not sufficient to solve robustness, even at greater model scales.

2024-07-25

ArXiv (preprint)

arxiv.org

Scaling Trends in Language Model Robustness

Nikolaus H. R. Howe

Ian R. McKenzie

Oskar John Hollinsworth

Michał Zając

Tom Tseng

Aaron David Tucker

Pierre-Luc Bacon

Adam Gleave

2024-07-25

ArXiv (preprint)

arxiv.org

AI Advantage

Mila AI Policy Fellowship

Strategic Priorities

AI Advantage

Mila AI Policy Fellowship

Pierre-Luc Bacon

Biography

Current Students

Blog Posts

Publications

AI Advantage

Mila AI Policy Fellowship

Strategic Priorities

AI Advantage

Mila AI Policy Fellowship

Popular keywords:

Pierre-Luc Bacon

Biography

Current Students

Blog Posts

Publications