Publications

Multi-Agent Matrix Games with Individual learners: How Exploration-Exploitation Strategies Impact the Emergence of Coordination

Julien Armand

Tommy Chien-Hsuan Lin

Maxime Heuillet

Audrey Durand

Coordination between independent learning agents in a multi-agent environment is an important problem where AI systems may impact each other… (see more)s learning process. In this paper, we study how individual agents converge to optimal equilibrium in multi-agent where coordination is necessary to achieve optimality. Specifically, we cover the case of coordination to maximize every individual payoffs and coordination to maximize the collective payoff (cooperation). We study the emergence of such coordination behaviours in two-players matrix games with unknown payoff matrices and noisy bandit feedback. We consider five different environments along with widely used deterministic and stochastic bandit strategies. We study how different learning strategies and observation noise influence convergence to the optimal equilibrium. Our results indicate that coordination often emerge more easily from interactions between deterministic agents, especially when they follow the same learning behaviour. However, stochastic learning strategies appear to be more robust in the presence of many optimal joint actions. Overall, noisy observations often help stabilizing learning behaviours.

2025-06-22

rl-conference.cc/RLC/2025/Workshop/CoCoMARL (poster)

openreview.net

Opening the Scope of Openness in AI

Tamara Paris

AJung Moon

Jin L.C. Guo

2025-06-22

Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (published)

doi.org

arxiv.org

Relative Explanations for Contextual Problems with Endogenous Uncertainty: An Application to Competitive Facility Location

Jasone Ram'irez-Ayerbe

Emma Frejinger

2025-06-22

ArXiv (preprint)

arxiv.org

A Survey of State Representation Learning for Deep Reinforcement Learning

Ayoub Echchahed

Pablo Samuel Castro

Representation learning methods are an important tool for addressing the challenges posed by complex observations spaces in sequential decis… (see more)ion making problems. Recently, many methods have used a wide variety of types of approaches for learning meaningful state representations in reinforcement learning, allowing better sample efficiency, generalization, and performance. This survey aims to provide a broad categorization of these methods within a model-free online setting, exploring how they tackle the learning of state representations differently. We categorize the methods into six main classes, detailing their mechanisms, benefits, and limitations. Through this taxonomy, our aim is to enhance the understanding of this field and provide a guide for new researchers. We also discuss techniques for assessing the quality of representations, and detail relevant future directions.

2025-06-22

TMLR (accepted)

openreview.net

The challenge of hidden gifts in multi-agent reinforcement learning

Dane Malenfant

Blake Aaron Richards

Cooperation between people is not always obvious. Sometimes we benefit from actions that others have taken even when we are unaware that the… (see more)y took those actions. For example, if your neighbor chooses not to take a parking spot in front of your house when you are not there, you can benefit, even without being aware that they took this action. These “hidden gifts” represent an interesting challenge for multi-agent reinforcement learning (MARL), since assigning credit to your own actions correctly when the beneficial actions of others are hidden is non-trivial. Here, we study the impact of hidden gifts with a very simple MARL task. In this task, agents in a grid-world environment have individual doors to unlock in order to obtain individual rewards. As well, if all the agents unlock their door the group receives a larger collective reward. However, there is only one key for all of the doors, such that the collective reward can only be obtained when the agents drop the key for others after they use it. Notably, there is nothing to indicate to an agent that the other agents have dropped the key, thus the act of dropping the key for others is a “hidden gift”. We show that several different state-of-the-art RL algorithms, including MARL algorithms, fail to learn how to obtain the collective reward in this simple task. Interestingly, we find that independent model-free policy gradient agents can solve the task when we provide them with information about their action history, but MARL agents still cannot solve the task with action history. Finally, we derive a correction term for these independent agents, inspired by learning aware approaches, which reduces the variance in learning and helps them to converge to collective success more reliably. These results show how credit assignment in multi-agent settings can be particularly challenging in the presence of “hidden gifts”, and demonstrate that learning awareness can benefit these settings

2025-06-22

rl-conference.cc/RLC/2025/Workshop/CoCoMARL (poster)

openreview.net

Towards Sustainable Investment Policies Informed by Opponent Shaping

Juan Agustin Duque

Addressing climate change requires global coordination, yet rational economic actors often prioritize immediate gains over collective welfar… (see more)e, resulting in social dilemmas. InvestESG is a recently proposed multi-agent simulation that captures the dynamic interplay between investors and companies under climate risk. We provide a formal characterization of the conditions under which InvestESG exhibits an intertemporal social dilemma, deriving theoretical thresholds at which individual incentives diverge from collective welfare. Building on this, we apply Advantage Alignment, a scalable opponent shaping algorithm shown to be effective in general-sum games, to influence agent learning in InvestESG. We offer theoretical insights into why Advantage Alignment systematically favors socially beneficial equilibria by biasing learning dynamics toward cooperative outcomes. Our results demonstrate that strategically shaping the learning processes of economic agents can result in better outcomes that could inform policy mechanisms to better align market incentives with long-term sustainability goals.

2025-06-22

rl-conference.cc/RLC/2025/Workshop/CoCoMARL (poster)

openreview.net

Rethinking Full Finetuning from Pretraining Checkpoints in Active Learning for African Languages

Bonaventure F. P. Dossou

Ines Arous

Jackie CK Cheung

2025-06-21

aclweb.org/ACL/2025/SRW (poster)

openreview.net

Revisiting Laplacian Representations for Value Function Approximation in Deep RL

Priyesh Vijayan

Padideh Nouri

Rishav

A. Chandar

Yash Chandak

Mathieu Reymond

S Ebrahimi Kahou

Doina Precup

Proto-value functions (PVFs) introduced Laplacian embeddings as an effective feature basis for value-function approximation; however, their … (see more)utility remained limited to small, fully known state spaces. Recent work has scaled Laplacian embeddings to high-dimensional inputs, using them for reward shaping and option discovery in goal-directed tasks, yet only as auxiliary signals, rather than directly using them as features for value functions. In this paper, we learn Laplacian eigenvectors online and employ them as features for Q-learning in 23 Atari games. We empirically demonstrate that these online–learned embeddings substantially improve model-free RL in large, high-dimensional domains. We demonstrate that enriching state representations with action embeddings yields additional gains under both behavior-policy and uniform-random policies. Additionally, we introduce the Fusion architecture, which augments the representation with useful inductive bias at the embedding level. To assess the usefulness of each embedding used in the Fusion architecture, we use Shapley values analysis.

2025-06-21

rl-conference.cc/RLC/2025/Workshop/IBRL (published)

openreview.net

Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence

Homayoun Honari

Roger Creus Castanyer

Pablo Samuel Castro

Glen Berseth

In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training … (see more)reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.

2025-06-21

rl-conference.cc/RLC/2025/Workshop/IBRL (published)

openreview.net

Neuromorphic hierarchical modular reservoirs

Filip Milisav

Andrea I Luppi

Laura E Suarez

Guillaume Lajoie

Bratislav Misic

Modularity is a fundamental principle of brain organization, reflected in the presence of segregated sub-networks that enable specialized in… (see more)formation processing. These small, densely connected modules are often nested within larger, higher-order modules, giving rise to a hierarchical modular architecture. This structure is posited to balance information segregation in specialized neuronal communities and global integration via intermodular communication. Yet, how hierarchical modularity shapes network function remains unclear. Here we introduce a simple blockmodeling framework for generating and comparing multi-level hierarchical modular networks and implement them as recurrent neural network reservoirs to evaluate their computational capacity. We show that hierarchical modular networks enhance memory capacity, support multitasking, and give rise to a broader range of temporal dynamics compared to strictly modular and random networks. These functional advantages can be traced to topological features enriched in hierarchical modular networks, which include reciprocal and cyclic network motifs. To test whether the computational advantages of hierarchical modularity subsist in empirical human brain structural connectivity patterns, we develop a novel hierarchical modularity-preserving network null model, allowing us to isolate the positive effect of empirical hierarchical modularity patterns on memory capacity. To evaluate the biomimetic validity of connectome-informed reservoir dynamics, we compare reservoir timescales to empirical brain timescales derived from MEG data and find that hierarchical modularity contributes to shaping brain-like neural timescales. Altogether, across multiple benchmarks, these results show that hierarchical modularity endows networks with computationally advantageous properties, providing insight into the relationship between neural network structure and function with potential applications for the design of neuromorphic computing architectures.

2025-06-20

bioRxiv (preprint)

doi.org

Behavioral Suite Analysis of Self-Supervised Learning in Atari

Somjit Nath

Rishav

Gopeshh Subbaraj

D. Nowrouzezahrai

S Ebrahimi Kahou

2025-06-19

rl-conference.cc/RLC/2025/Workshop/RLVG (accepted)

openreview.net

A deep generative model for deciphering cellular dynamics and in silico drug discovery in complex diseases

Yumin Zheng

Jonas C. Schupp

Taylor Adams

Geremy Clair

Aurelien Justet

Farida Ahangari

Xiting Yan

Paul Hansen

Marianne Carlon

Emanuela Cortesi

Marie Vermant

Robin Vos

Laurens J. De Sadeleer

Ivan O. Rosas

Ricardo Pineda

John Sembrat

Melanie Königshoff

John E. McDonough

Bart M. Vanaudenaerde

Wim A. Wuyts … (see 2 more)

Naftali Kaminski

Jun Ding

Human diseases are characterized by intricate cellular dynamics. Single-cell transcriptomics provides critical insights, yet a persistent ga… (see more)p remains in computational tools for detailed disease progression analysis and targeted in silico drug interventions. Here we introduce UNAGI, a deep generative neural network tailored to analyse time-series single-cell transcriptomic data. This tool captures the complex cellular dynamics underlying disease progression, enhancing drug perturbation modelling and screening. When applied to a dataset from patients with idiopathic pulmonary fibrosis, UNAGI learns disease-informed cell embeddings that sharpen our understanding of disease progression, leading to the identification of potential therapeutic drug candidates. Validation using proteomics reveals the accuracy of UNAGI’s cellular dynamics analysis, and the use of the fibrotic cocktail-treated human precision-cut lung slices confirms UNAGI’s predictions that nifedipine, an antihypertensive drug, may have anti-fibrotic effects on human tissues. UNAGI’s versatility extends to other diseases, including COVID, demonstrating adaptability and confirming its broader applicability in decoding complex cellular dynamics beyond idiopathic pulmonary fibrosis, amplifying its use in the quest for therapeutic solutions across diverse pathological landscapes.

2025-06-19

Nature Biomedical Engineering (published)

doi.org

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Publications