Aaron Courville

Anirudh Buvanesh

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Laurent Charlin

anirudb1102@gmail.com

Razvan Ciuca

Maîtrise recherche - Université de Montréal

Alexandre Diz Ganito

Maîtrise recherche - UdeM

Juan Duque

Doctorat - UdeM

Doctorat - UdeM

Doctorat - UdeM

Uday Kapur

Maîtrise professionnelle - UdeM

Amr Khalifa

Doctorat - UdeM

andrei.nicolicioiu@gmail.com

Samuel Lavoie

Doctorat - UdeM

Zhixuan Lin

Doctorat - UdeM

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

Rishabh Agarwal

Andrei Nicolicioiu

Doctorat - UdeM

Google Scholar

Evgenii Nikishin

Collaborateur·rice alumni - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Michell Mercedes Payano Perez

Doctorat - UdeM

Co-superviseur⋅e :

Stagiaire de recherche - UdeM

Dereck Piché

Maîtrise recherche - UdeM

pichedereck@gmail.com

Esra'a Saleh

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Maîtrise recherche - UdeM

Superviseur⋅e principal⋅e :

Anna (Cheng-Zhi) Huang

Doctorat - UdeM

Superviseur⋅e principal⋅e :

(Rex) Devon Hjelm

Google Scholar

Yusong Wu

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Anna (Cheng-Zhi) Huang

Xiaofeng Zhang

Doctorat - UdeM

Dinghuai Zhang

Doctorat - UdeM

Co-superviseur⋅e :

Yoshua Bengio

Hattie Zhou

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Hugo Larochelle

Publications

Towards Sustainable Investment Policies Informed by Opponent Shaping

Juan Agustin Duque

Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfar… (voir plus)e, resulting in social dilemmas. InvestESG is a recently proposed multi-agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general-sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long-term sustainability goals.

2025-06-23

rl-conference.cc/RLC/2025/Workshop/CoCoMARL (poster)

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Thinking

Sangmin Bae

Yujin Kim

Reza Bayat

Sungnyun Kim

Jiyoun Ha

Tal Schuster

Adam Fisch

Hrayr Harutyunyan

Ziwei Ji

Se-Young Yun

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deploy… (voir plus)ment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assign recursion depth to tokens, thereby focusing quadratic attention computation only where it is most useful. Further enhancing its efficiency, MoR incorporates a recursion-wise key-value caching mechanism that eliminates redundant memory access across recursion steps by selectively storing only the key-value caches for designated tokens. Across pretraining runs at model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning

Roger Creus Castanyer

Lu Liu

Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (voir plus)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.

2025-06-01

arXiv (publié)

arxiv.org

Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Max Schwarzer

Jesse Farebrother

Joshua Greaves

Ekin Dogus Cubuk

Rishabh Agarwal

Marc Gendron-Bellemare

Sergei Kalinin

Igor Mordatch

Kevin M Roccapriore

We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimu… (voir plus)lated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

2025-05-20

Advanced Materials Interfaces (publié)

arxiv.org

FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu

Christos Tsirigotis

Ke Chen

Anna (Cheng-Zhi) Huang

Oriol Nieto

Prem Seetharaman

Justin Salamon

2025-05-01

ICML.cc/2025/Conference (poster)

Measure gradients, not activations! Enhancing neuronal activity in deep reinforcement learning

Jiashun Liu

Zihao Wu

Ling Pan

2025-05-01

arXiv (publié)

arxiv.org

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn

Hongyao Tang

Glen Berseth

Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this pap… (voir plus)er, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability for out-of-batch data induced by mini-batch training. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.

2025-05-01

ICML.cc/2025/Conference (poster)

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu

Ling Pan

Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help i… (voir plus)mprove sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of"polluting"the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.

2025-05-01

ICML.cc/2025/Conference (poster)

The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

walter Mayor

The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in… (voir plus) which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.

2025-05-01

ICML.cc/2025/Conference (poster)

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (voir plus)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-05-01

ICML.cc/2025/Conference (poster)

Advantage Alignment Algorithms

Juan Agustin Duque

Milad Aghajohari

Tim Cooijmans

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Shengyi Huang

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (voir plus)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

2025-01-22

ICLR.cc/2025/Conference (poster)