Doina Precup

Samin Yeasar Arnob

PhD - McGill University

Sumana Basu

Collaborating Alumni - McGill University

Co-supervisor :

Adriana Romero Soriano

Collaborating Alumni - McGill University

Raymond Chua

PhD - McGill University

Co-supervisor :

PhD - McGill University

Principal supervisor :

David Meger

Jonathan Colaço Carr

Master's Research - McGill University

Principal supervisor :

Prakash Panangaden

Élodie Coté-Gauthier

Collaborating researcher - McGill University

Co-supervisor :

Isabeau Prémont-Schwarz

Franco Del Balso

Collaborating researcher - Université de Montréal

Jesse Farebrother

PhD - McGill University

Principal supervisor :

Marc Gendron-Bellemare

PhD - McGill University

Principal supervisor :

Collaborating researcher - Birla Institute of Technology

Jonathan Hu

Master's Research - McGill University

Howard Huang

PhD - McGill University

Haque Ishfaq

Collaborating Alumni - McGill University

Mohammad Sami Nur Islam Islam

Master's Research - McGill University

Hangzhan Jin

PhD - Polytechnique Montréal

Martin Klissarov

PhD - McGill University

Postdoctorate - McGill University

Jonathan Lebensold

Collaborating Alumni - McGill University

Collaborating Alumni - McGill University

Ray Luo

PhD - McGill University

Principal supervisor :

G McCracken

PhD - McGill University

Nazanin Mohammadi Sepahvand

Collaborating Alumni - McGill University

Shahrad Mohammadzadeh

Master's Research - McGill University

Principal supervisor :

Gabriela Moisescu-Pareja

Collaborating researcher - McGill University

Co-supervisor :

Irina Rish

Padideh Nouri

PhD - Université de Montréal

Co-supervisor :

PhD - McGill University

Co-supervisor :

Research Intern - McGill University

Nate Rahn

PhD - McGill University

Principal supervisor :

Marc Gendron-Bellemare

Manoosh Samiei

PhD - McGill University

Co-supervisor :

PhD - McGill University

Co-supervisor :

PhD - McGill University

Nishanth Anand Vemgal

PhD - McGill University

PhD - McGill University

Co-supervisor :

Samira Ebrahimi Kahou

Research Intern - McGill University

Zihan Wang

PhD - McGill University

Skipper: Combining Spatial and Temporal Abstraction for Better Generalization

Steve Wen

Master's Research - McGill University

Co-supervisor :

Gregory Dudek

Zijing Wu

PhD - McGill University

Principal supervisor :

PhD - McGill University

Harry Zhao

Collaborating Alumni - McGill University

Co-supervisor :

Blog Posts

Generic thumbnail for Mila Blog articles.

February 22, 2024

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Read the article

Publications

Hierarchical Integration of Predictive Representations of State from General Value Functions

Sonny Jones

Patrick M. Pilarski

Ashley N Dalrymple

In this work, we investigate how predictive representations of state in the form of continually learned General Value Functions (GVFs) inter… (see more)act with downstream policy networks. Intelligent agents deployed in real-world environments need to adapt to changing conditions in their environment. Adapting to one’s environment requires a model or representation of the environment on which to base decision-making. Models that take the form of predictions and GVFs have been shown to provide temporally abstracted predictive representations of state that can forecast useful elements of an agent's or environment's future behaviour. While GVFs have been concretely deployed in rehabilitation and robotic domains, existing approaches treat predictions as input features into model frameworks, without examining or comparing how best to integrate them into downstream learning processes. In this work, we compare multiple strategies for integrating observations and GVF predictions into another learning architecture: 1) actual observations solely in the input layer, 2) predictions solely in the input layer, 3) actual observations and predictions in the input layer, and 4) actual observations in the input layer and predictions in the later latent representations. We evaluate these strategies in a rehabilitation setting, using GVFs to learn predictive representations of kinetic and kinematic signals collected from wearable sensors on the lower limb during ambulation across varied terrains, and policy networks to classify walking terrain.

2026-06-09

rl-conference.cc/RLC/2026/Workshop/RL_in_Big_Worlds (poster)

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

Anthony GX-Chen

Ankit Anand

Gheorghe Comanici

Zaheer Abbas

Eser Aygün

David Smalling

Shibl Mourad

Andre Barreto

Mark Rowland

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern… (see more) applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

2026-06-01

arXiv (preprint)

The schema spectrum: Emergent structures and levels of abstraction in AI and the brain

Mandana Samiei

Blake A. Richards

2026-05-31

Neuron (published)

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr

Prakash Panangaden

Benjamin Van Roy

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefer… (see more)ences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

2026-05-28

arXiv (preprint)

A systematic review of human-LLM interactions in computational thinking empirical studies

Yimei Zhang

You Song

Reihaneh Rabbany

Maria Cutumisu

2026-05-11

Computer Science Education (published)

Rotation-Preserving Supervised Fine-Tuning

Mohammad Hamdaqa

Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that thi… (see more)s degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-

2026-05-07

arXiv (preprint)

Nazanin Mohammadi Sepahvand

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Eleni Triantafillou

Hugo Larochelle

Gintare Karolina Dziugaite

Daniel M. Roy

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based… (see more) on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

2026-02-23

arXiv (unknown)

Fluid-Agent Reinforcement Learning

Shishir Sharma

Theodore J. Perkins

The primary focus of multi-agent reinforcement learning (MARL) has been to study interactions among a fixed number of agents embedded in an … (see more)environment. However, in the real world, the number of agents is neither fixed nor known a priori. Moreover, an agent can decide to create other agents (for example, a cell may divide, or a company may spin off a division). In this paper, we propose a framework that allows agents to create other agents; we call this a fluid-agent environment. We present game-theoretic solution concepts for fluid-agent games and empirically evaluate the performance of several MARL algorithms within this framework. Our experiments include fluid variants of established benchmarks such as Predator-Prey and Level-Based Foraging, where agents can dynamically spawn, as well as a new environment we introduce that highlights how fluidity can unlock novel solution strategies beyond those observed in fixed-population settings. We demonstrate that this framework yields agent teams that adjust their size dynamically to match environmental demands.

2026-02-15

ArXiv (preprint)

Pregnancy AI: Development and Internal Validation of an Artificial Intelligence Tool to Predict Live Births in ICSI and IVF Cycles Using Clinical Features and Embryo Images

Jaume Minano Masip

Penelope Borduas

Isaac-Jacques Kadoch

Simon Phillips

Daniel Dufort

2026-02-11

Medicina (published)

Affordances Enable Partial World Modeling with LLMs

Khimya Khetarpal

Gheorghe Comanici

Jonathan Richens

Jeremy Shar

Fei Xia

Laurent Orseau

Aleksandra Faust

2026-02-10

ArXiv (preprint)

Deep neural networks divide and conquer dihedral multiplication

Sihui Wei

Gavin McCracken

Gabriela Moisescu-Pareja

Harley Wiltzer

Irina Rish

Jonathan Love

We find multilayer perceptrons and transformers both universally learn an instantiation of the same divide-and-conquer algorithm that requir… (see more)es only a logarithmic number of neural representations to solve dihedral multiplication. Clustering neurons based on similar activation behaviour reveals remarkably clear structure: each neural representation corresponds to a Cayley graph. To our knowledge, this is the first work that fully characterizes and describes all neural representations that are learnable on a dataset, while prior work on group multiplications studied neuron-level behavior, or preliminarily investigated cluster behavior. Thus, we can understand the algorithm networks universally learn at three levels of abstraction: 1) Neurons activate on coset or approximate coset structure of the dihedral group. 2) Groups of neurons together form neural representations that act to divide the dataset into different subproblems, being Cayley graphs, where the equivalence class of the answer is computed. 3) The global algorithm then linearly combines each neural representation (subproblem) together at the logits. This work provides a deep case study and provides the community with a very well understood toy model for interpretability, as well as makes steps toward proving the conjecture that DNNs will divide and conquer all group multiplication tasks.

2025-12-31

International Conference on Machine Learning (Accept (regular))

Learning from Pairwise Preferences in Long-Term Decision Problems

Jonathan Colaço Carr

Prakash Panangaden

Benjamin Van Roy

Agents that can beat or tie any other under a model of pairwise preference have strong guarantees for both user satisfaction and overall soc… (see more)ial welfare. However, searching for these agents in long-term decision problems is not computationally tractable with current approaches, which require the size of an agent's policy to increase with the problem length. We introduce the \textit{Markov decision contest}, a model of learning from general preferences in long-term (infinite-horizon) decision problems. Within this model, we prove that agents only need a stationary Markov policy in order to be optimal (that is, to beat or tie any agent with a history-dependent policy); that the problem of finding an optimal policy is in P; and that a simple iterative algorithm (which we call Hedged Policy Iteration) converges to an optimal policy at a sublinear rate. In a suite of high-dimensional experiments, we demonstrate that Hedged Policy Iteration scales well to function approximation. Lastly, we present a near approximation of Hedged Policy Iteration, called HPI-Clip, which both matches the performance of Proximal Policy Optimization on reward-based tasks while also outperforming it on tasks with non-transitive preferences. These results show that learning from pairwise preferences in long-term decision problems can be far more tractable than what is known from prior work.

2025-12-31

International Conference on Machine Learning (Accept (regular))