Publications

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Damien Ernst

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (voir plus)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

Towards a Formal Theory of Representational Compositionality

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

Towards a Mechanistic Explanation of Diffusion Model Generalization

Matthew Niedoba

Berend Zwartsenberg

Kevin Patrick Murphy

Frank N. Wood

We propose a mechanism for diffusion generalization based on local denoising operations. Through analysis of network and empirical denoisers… (voir plus), we identify local inductive biases in diffusion models. We demonstrate that local denoising operations can be used to approximate the optimal diffusion denoiser. Using a collection of patch-based, local empirical denoisers, we construct a denoiser which approximates the generalization behaviour of diffusion model denoisers over forward and reverse diffusion processes.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Juan A. Rodriguez

Montek Kalsi

M. Tamer Özsu

Christopher Pal

Sai Rajeswar

Human Annotator

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enh… (voir plus)ance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

When to retrain a machine learning model

Florence Regol

Leo Schwinn

Kyle Sprague

Mark J. Coates

Thomas Markovich

A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of dat… (voir plus)a. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

Democratizing Game Modding with GenAI: A Case Study of StarCharM, a Stardew Valley Character Maker

Hamid Zand Miralvand

Mohammad Ronagh Nikghalb

Mohammad Darandeh

Abidullah Khan

Ian Arawjo

Jinghui Cheng

Game modding offers unique and personalized gaming experiences, but the technical complexity of creating mods often limits participation to … (voir plus)skilled users. We envision a future where every player can create personalized mods for their games. To explore this space, we designed StarCharM, a GenAI-based non-player character (NPC) creator for Stardew Valley. Our tool enables players to iteratively create new NPC mods, requiring minimal user input while allowing for fine-grained adjustments through user control. We conducted a user study with ten Stardew Valley players who had varied mod usage experiences to understand the impacts of StarCharM and provide insights into how GenAI tools may reshape modding, particularly in NPC creation. Participants expressed excitement in bringing their character ideas to life, although they noted challenges in generating rich content to fulfill complex visions. While they believed GenAI tools like StarCharM can foster a more diverse modding community, some voiced concerns about diminished originality and community engagement that may come with such technology. Our findings provided implications and guidelines for the future of GenAI-powered modding tools and co-creative modding practices.

2025-10-04

Proceedings of the ACM on Human-Computer Interaction (publié)

doi.org

arxiv.org

Refactoring with LLMs: Bridging Human Expertise and Machine Understanding

Yonnel Chen Kuang Piao

Jean Carlors Paul

Leuson Da Silva

Arghavan Moradi Dakhel

Mohammad Hamdaqa

Foutse Khomh

2025-10-03

ArXiv (prépublication)

doi.org

arxiv.org

Capacity Planning in Stable Matching

Federico Bobbio

Margarida Carvalho

Andrea Lodi

Ignacio Rios

Alfredo Torrico

We introduce the problem of jointly increasing school capacities and finding a student-optimal assignment in the expanded market. Due to the… (voir plus) impossibility of efficiently solving the problem with classical methods, we generalize existent mathematical programming formulations of stability constraints to our setting, most of which result in integer quadratically-constrained programs. In addition, we propose a novel mixed-integer linear programming formulation that is exponentially large on the problem size. We show that its stability constraints can be separated by exploiting the objective function, leading to an effective cutting-plane algorithm. We conclude the theoretical analysis of the problem by discussing some mechanism properties. On the computational side, we evaluate the performance of our approaches in a detailed study, and we find that our cutting-plane method outperforms our generalization of existing mixed-integer approaches. We also propose two heuristics that are effective for large instances of the problem. Finally, we use the Chilean school choice system data to demonstrate the impact of capacity planning under stability conditions. Our results show that each additional seat can benefit multiple students and that we can effectively target the assignment of previously unassigned students or improve the assignment of several students through improvement chains. These insights empower the decision-maker in tuning the matching algorithm to provide a fair application-oriented solution.

2025-10-02

Operations Research (publié)

doi.org

arxiv.org

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan Samir Obando Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

Pablo Samuel Castro

Aaron Courville

Ling Pan

Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely prag… (voir plus)matic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

2025-10-01

ArXiv (prépublication)

doi.org

arxiv.org

Genetic contribution to asthma informs acute chest syndrome pathophysiology and risk stratification

Sara El Aouhel

Vanessa Bellegarde

Stennio Da

Silva Faria

Tristan St-Laurent

Estelle Lecluze

Anne-Laure Pham Hung d’Alexandry d’Orengiani

F. Galactéros

Pablo Bartolucci

Marc-André Legault

Guillaume Lettre

Thomas Pincez

2025-10-01

medRxiv (prépublication)

doi.org

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora

R Devon Hjelm

Alexander T Toshev

Omar Attia

Bogdan Mazoure

Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield"black-box"models that… (voir plus) are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

2025-10-01

ArXiv (prépublication)

doi.org

arxiv.org

Attention-Based Multi-Agent RL for Multi-Machine Tending Using Mobile Robots

Abdalwhab Bakheet Mohamed Abdalwhab

Giovanni Beltrame

Samira Ebrahimi Kahou

David St-Onge

Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborat… (voir plus)ive robots can tackle that can also greatly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability. We introduce a multi-agent multi-machine-tending learning framework using mobile robots based on multi-agent reinforcement learning (MARL) techniques, with the design of a suitable observation and reward. Moreover, we integrate an attention-based encoding mechanism into the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine-tending scenarios. Our model (AB-MAPPO) outperforms MAPPO in this new challenging scenario in terms of task success, safety, and resource utilization. Furthermore, we provided an extensive ablation study to support our design decisions.

2025-09-30

AI (publié)

doi.org

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Mila sur Udemy

Publications

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Mila sur Udemy

Mots-clés populaires:

Publications