Publications

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Damien Ernst

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (see more)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Towards a Formal Theory of Representational Compositionality

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Towards a Mechanistic Explanation of Diffusion Model Generalization

Matthew Niedoba

Berend Zwartsenberg

Kevin Patrick Murphy

Frank N. Wood

We propose a mechanism for diffusion generalization based on local denoising operations. Through analysis of network and empirical denoisers… (see more), we identify local inductive biases in diffusion models. We demonstrate that local denoising operations can be used to approximate the optimal diffusion denoiser. Using a collection of patch-based, local empirical denoisers, we construct a denoiser which approximates the generalization behaviour of diffusion model denoisers over forward and reverse diffusion processes.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

When to retrain a machine learning model

Florence Regol

Leo Schwinn

Kyle Sprague

Mark J. Coates

Thomas Markovich

A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of dat… (see more)a. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Democratizing Game Modding with GenAI: A Case Study of StarCharM, a Stardew Valley Character Maker

Hamid Zand Miralvand

Mohammad Ronagh Nikghalb

Mohammad Darandeh

Abidullah Khan

Ian Arawjo

Jinghui Cheng

Game modding offers unique and personalized gaming experiences, but the technical complexity of creating mods often limits participation to … (see more)skilled users. We envision a future where every player can create personalized mods for their games. To explore this space, we designed StarCharM, a GenAI-based non-player character (NPC) creator for Stardew Valley. Our tool enables players to iteratively create new NPC mods, requiring minimal user input while allowing for fine-grained adjustments through user control. We conducted a user study with ten Stardew Valley players who had varied mod usage experiences to understand the impacts of StarCharM and provide insights into how GenAI tools may reshape modding, particularly in NPC creation. Participants expressed excitement in bringing their character ideas to life, although they noted challenges in generating rich content to fulfill complex visions. While they believed GenAI tools like StarCharM can foster a more diverse modding community, some voiced concerns about diminished originality and community engagement that may come with such technology. Our findings provided implications and guidelines for the future of GenAI-powered modding tools and co-creative modding practices.

2025-10-04

Proceedings of the ACM on Human-Computer Interaction (published)

doi.org

arxiv.org

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

A. Jaiswal

Patrice Béchard

Oleh Shliazhko

Orlando Marquez Ayala

Mathieu Reymond

Massimo Caccia

Alexandre Drouin

A. Chandar

Alexandre Lacoste

2025-10-04

ArXiv (preprint)

doi.org

arxiv.org

Refactoring with LLMs: Bridging Human Expertise and Machine Understanding

Yonnel Chen Kuang Piao

Jean Carlors Paul

Leuson Da Silva

Arghavan Moradi Dakhel

Mohammad Hamdaqa

Foutse Khomh

2025-10-03

ArXiv (preprint)

doi.org

arxiv.org

Capacity Planning in Stable Matching

Federico Bobbio

Margarida Carvalho

Andrea Lodi

Ignacio Rios

Alfredo Torrico

We introduce the problem of jointly increasing school capacities and finding a student-optimal assignment in the expanded market. Due to the… (see more) impossibility of efficiently solving the problem with classical methods, we generalize existent mathematical programming formulations of stability constraints to our setting, most of which result in integer quadratically-constrained programs. In addition, we propose a novel mixed-integer linear programming formulation that is exponentially large on the problem size. We show that its stability constraints can be separated by exploiting the objective function, leading to an effective cutting-plane algorithm. We conclude the theoretical analysis of the problem by discussing some mechanism properties. On the computational side, we evaluate the performance of our approaches in a detailed study, and we find that our cutting-plane method outperforms our generalization of existing mixed-integer approaches. We also propose two heuristics that are effective for large instances of the problem. Finally, we use the Chilean school choice system data to demonstrate the impact of capacity planning under stability conditions. Our results show that each additional seat can benefit multiple students and that we can effectively target the assignment of previously unassigned students or improve the assignment of several students through improvement chains. These insights empower the decision-maker in tuning the matching algorithm to provide a fair application-oriented solution.

2025-10-02

Operations Research (published)

doi.org

arxiv.org

Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Christopher Pal

2025-10-02

ArXiv (preprint)

doi.org

arxiv.org

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Léo Boisvert

Abhay Puri

Chandra Kiran Reddy Evuru

Nazanin Sepahvand

Nicolas Chapados

Quentin Cappart

Alexandre Lacoste

Krishnamurthy (DJ) Dvijotham

Alexandre Drouin

The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general … (see more)recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.

2025-10-02

arXiv (preprint)

doi.org

arxiv.org

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan Samir Obando Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

Pablo Samuel Castro

Aaron Courville

Ling Pan

Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely prag… (see more)matic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

2025-10-01

ArXiv (preprint)

doi.org

arxiv.org

Genetic contribution to asthma informs acute chest syndrome pathophysiology and risk stratification

Sara El Aouhel

Vanessa Bellegarde

Stennio Da

Silva Faria

Tristan St-Laurent

Estelle Lecluze

Anne-Laure Pham Hung d’Alexandry d’Orengiani

F. Galactéros

Pablo Bartolucci

Marc-André Legault

Guillaume Lettre

Thomas Pincez

2025-10-01

medRxiv (preprint)

doi.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications