Aaron Courville

Anirudh Buvanesh

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Laurent Charlin

Razvan Ciuca

Maîtrise recherche - Université de Montréal

Juan Duque

Doctorat - UdeM

Doctorat - UdeM

Doctorat - UdeM

Uday Kapur

Doctorat - UdeM

Amr Khalifa

Doctorat - UdeM

Samuel Lavoie

Doctorat - UdeM

Zhixuan Lin

Doctorat - UdeM

Google Scholar

Ahmed Masry

Collaborateur·rice de recherche - N/A

Google Scholar

Alan Milligan

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - UdeM

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

Collaborateur·rice de recherche - UdeM

Dereck Piché

Maîtrise recherche - UdeM

Khaled Rouissi

Maîtrise recherche - UdeM

Esra'a Saleh

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Glen Berseth

Google Scholar

Vedant Shah

Doctorat - UdeM

Doctorat - UdeM

Yusong Wu

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Anna (Cheng-Zhi) Huang

Sujin yun

Doctorat - UdeM

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

Yoshua Bengio

Hattie Zhou

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Hugo Larochelle

Publications

The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

Xiaofeng Zhang

Michal Drozdzal

Adriana Romero Soriano

2025-10-22

ArXiv (prépublication)

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Samuel Lavoie

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallel… (voir plus)ization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

2025-10-15

ArXiv (prépublication)

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Samuel Lavoie

2025-10-15

ArXiv (prépublication)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (voir plus)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

2025-10-08

ArXiv (prépublication)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

2025-10-08

ArXiv (prépublication)

FLAM: Frame-Wise Language-Audio Modeling

Yusong Wu

Christos Tsirigotis

Ke Chen

Anna (Cheng-Zhi) Huang

Oriol Nieto

Prem Seetharaman

Justin Salamon

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works … (voir plus)use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn

Hongyao Tang

Glen Berseth

Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this pap… (voir plus)er, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability for out-of-batch data induced by mini-batch training. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

openreview.net

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn

Hongyao Tang

Glen Berseth

Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this pap… (voir plus)er, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability induced by the data in each training batch. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu

Ling Pan

Off-policy deep reinforcement learning (RL) agents typically leverage replay buffers for reusing past experiences during learning. This can … (voir plus)help sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it has the effect of “polluting” the replay buffer with data that can exacerbate optimization challenges in addition to wasting environment interactions due to redundant sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose the learn to stop (LEAST) mechanism which uses statistics based on

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep Reinforcement Learning

Jiashun Liu

Ling Pan

Off-policy deep reinforcement learning (RL) typically leverages replay buffers for reusing past experiences during learning. This can help i… (voir plus)mprove sample efficiency when the collected data is informative and aligned with the learning objectives; when that is not the case, it can have the effect of"polluting"the replay buffer with data which can exacerbate optimization challenges in addition to wasting environment interactions due to wasteful sampling. We argue that sampling these uninformative and wasteful transitions can be avoided by addressing the sunk cost fallacy, which, in the context of deep RL, is the tendency towards continuing an episode until termination. To address this, we propose learn to stop (LEAST), a lightweight mechanism that enables strategic early episode termination based on Q-value and gradient statistics, which helps agents recognize when to terminate unproductive episodes early. We demonstrate that our method improves learning efficiency on a variety of RL algorithms, evaluated on both the MuJoCo and DeepMind Control Suite benchmarks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

openreview.net

The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

Walter Mayor

The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in… (voir plus) which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

Walter Mayor