Doina Precup

Mohammad Sami Nur Islam Islam

Jesse Farebrother

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Doctorat - McGill

Superviseur⋅e principal⋅e :

Eilif Benjamin Muller

Doctorat - McGill

Doctorat - McGill

Maîtrise recherche - McGill

Nazanin Mohammadi Sepahvand

Google Scholar

Arushi Jain

Doctorat - McGill

Doctorat - McGill

Postdoctorat - McGill

Google Scholar

Elaine Lau

Maîtrise recherche - McGill

Jonathan Lebensold

Collaborateur·rice alumni - McGill

Baccalauréat - McGill

Ray Luo

Doctorat - McGill

Superviseur⋅e principal⋅e :

G McCracken

Doctorat - McGill

Google Scholar

Doctorat - McGill

Shahrad Mohammadzadeh

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Gabriela Moisescu-Pareja

Maîtrise recherche - McGill

Padideh Nouri

Doctorat - UdeM

Co-superviseur⋅e :

Charles Onu

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Nate Rahn

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Sahand Rezaei-Shoshtari

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Blake Richards

samiemandana@gmail.com

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Samira Ebrahimi Kahou

Stagiaire de recherche - McGill

Steve Wen

Maîtrise recherche - McGill

Co-superviseur⋅e :

Gregory Dudek

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Zijing Wu

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Harry Zhao

Doctorat - McGill

Co-superviseur⋅e :

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning

Lynn Cherif

Flemming Kondrup

David Venuto

Ankit Anand

Khimya Khetarpal

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through intent-based affordances -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances**

2025-03-05

ICLR.cc/2025/Workshop/DL4C (publié)

Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

Samin Yeasar Arnob

Zhan Su

Minseon Kim

Oleksiy Ostapenko

Lucas Caccia

Alessandro Sordoni

Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (voir plus)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepté)

Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

Samin Yeasar Arnob

Zhan Su

Minseon Kim

Oleksiy Ostapenko

Lucas Caccia

Alessandro Sordoni

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepté)

Partial Models for Building Adaptive Model-Based Reinforcement Learning Agents

Safa Alver

Ali Rahimi-Kalahroudi

In neuroscience, one of the key behavioral tests for determining whether a subject of study exhibits model-based behavior is to study its ad… (voir plus)aptiveness to local changes in the environment. In reinforcement learning, however, recent studies have shown that modern model-based agents display poor adaptivity to such changes. The main reason for this is that modern agents are typically designed to improve sample efficiency in single task settings and thus do not take into account the challenges that can arise in other settings. In local adaptation settings, one particularly important challenge is in quickly building and maintaining a sufficiently accurate model after a local change. This is challenging for deep model-based agents as their models and replay buffers are monolithic structures lacking distribution shift handling capabilities. In this study, we show that the conceptually simple idea of partial models can allow deep model-based agents to overcome this challenge and thus allow for building locally adaptive model-based agents. By modeling the different parts of the state space through different models, the agent can not only maintain a model that is accurate across the state space, but it can also quickly adapt it in the presence of a local change in the environment. We demonstrate this by showing that the use of partial models in agents such as deep Dyna-Q, PlaNet and Dreamer can allow for them to effectively adapt to the local changes in their environments.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (publié)

Agency Is Frame-Dependent

David Abel

Andre Barreto

Michael Bowling

Will Dabney

Shi Dong

Steven Hansen

Anna Harutyunyan

Khimya Khetarpal

Clare Lyle

Razvan Pascanu

Georgios Piliouras

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

Agency is a system's capacity to steer outcomes toward a goal, and is a central topic of study across biology, philosophy, cognitive science… (voir plus), and artificial intelligence. Determining if a system exhibits agency is a notoriously difficult question: Dennett (1989), for instance, highlights the puzzle of determining which principles can decide whether a rock, a thermostat, or a robot each possess agency. We here address this puzzle from the viewpoint of reinforcement learning by arguing that agency is fundamentally frame-dependent: Any measurement of a system's agency must be made relative to a reference frame. We support this claim by presenting a philosophical argument that each of the essential properties of agency proposed by Barandiaran et al. (2009) and Moreno (2018) are themselves frame-dependent. We conclude that any basic science of agency requires frame-dependence, and discuss the implications of this claim for reinforcement learning.

2025-02-06

ArXiv (prépublication)

Agency Is Frame-Dependent

David Abel

Andre Barreto

Michael Bowling

Will Dabney

Shi Dong

Steven Hansen

A. Harutyunyan

Khimya Khetarpal

Clare Lyle

Razvan Pascanu

Georgios Piliouras

Jonathan Richens

Mark Rowland

Tom Schaul

Satinder Singh

2025-02-06

ArXiv (prépublication)

Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning

Haque Ishfaq

Guangyuan Wang

Mohammad Sami Nur Islam

Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample effici… (voir plus)ency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based

2025-01-22

ICLR.cc/2025/Conference (poster)

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Martin Klissarov

Mikael Henaff

Roberta Raileanu

Shagun Sodhani

Pascal Vincent

Amy Zhang

Pierre-Luc Bacon

Marlos C. Machado

Pierluca D'Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (voir plus) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Selective Unlearning via Representation Erasure Using Domain Adversarial Training

Nazanin Mohammadi Sepahvand

Eleni Triantafillou

Hugo Larochelle

Gintare Karolina Dziugaite

James J. Clark

Daniel M. Roy

When deploying machine learning models in the real world, we often face the challenge of “unlearning” specific data points or subsets a… (voir plus)fter training. Inspired by Domain-Adversarial Training of Neural Networks (DANN), we propose a novel algorithm,SURE, for targeted unlearning.SURE treats the process as a domain adaptation problem, where the “forget set” (data to be removed) and a validation set from the same distribution form two distinct domains. We train a domain classifier to discriminate between representations from the forget and validation sets.Using a gradient reversal strategy similar to DANN, we perform gradient updates to the representations to “fool” the domain classifier and thus obfuscate representations belonging to the forget set. Simultaneously, gradient descent is applied to the retain set (original training data minus the forget set) to preserve its classification performance. Unlike other unlearning approaches whose training objectives are built based on model outputs, SURE directly manipulates the representations.This is key to ensure robustness against a set of more powerful attacks than currently considered in the literature, that aim to detect which examples were unlearned through access to learned embeddings. Our thorough experiments reveal that SURE has a better unlearning quality to utility trade-off compared to other standard unlearning techniques for deep neural networks.

2025-01-22

ICLR.cc/2025/Conference (poster)

Selective Unlearning via Representation Erasure Using Domain Adversarial Training

Nazanin Mohammadi Sepahvand

Eleni Triantafillou

Hugo Larochelle

Gintare Karolina Dziugaite

James J. Clark

Daniel M. Roy

2025-01-22

ICLR.cc/2025/Conference (poster)

Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar

Vincent Zhuang

Rishabh Agarwal

Yi Su

John D Co-Reyes

Avi Singh

Kate Baumli

Shariq Iqbal

Colton Bishop

Rebecca Roelofs

Lei M Zhang

Kay McKinney

Disha Shrivastava

Cosmin Paduraru

George Tucker

Feryal Behbahani

Aleksandra Faust

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffecti… (voir plus)ve in modern LLMs. Existing approaches for training self-correction either require multiple models or rely on a more capable model or other forms of supervision. To this end, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are insufficient for instilling self-correction behavior. In particular, we observe that training via SFT either suffers from a distribution mismatch between the training data and the model's own responses or implicitly prefers only a certain mode of correction behavior that is often not effective at test time. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction strategy that is effective at test time as opposed to simply fitting high-reward responses for a given prompt. This regularization prescribes running a first phase of RL on a base model to generate a policy initialization that is less susceptible to collapse and then using a reward bonus to amplify self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.

2025-01-01

ICLR (publié)

Fairness in Reinforcement Learning with Bisimulation Metrics

Sahand Rezaei-Shoshtari

Hanna Yurchyk

Scott Fujimoto

David Meger

Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environment… (voir plus)s. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

2024-12-22

ArXiv (prépublication)