Publications

Revisiting Laplacian Representations for Value Function Approximation in Deep RL

Priyesh Vijayan

Padideh Nouri

Rishav

A. Chandar

Yash Chandak

Mathieu Reymond

S Ebrahimi Kahou

Doina Precup

Proto-value functions (PVFs) introduced Laplacian embeddings as an effective feature basis for value-function approximation; however, their … (see more)utility remained limited to small, fully known state spaces. Recent work has scaled Laplacian embeddings to high-dimensional inputs, using them for reward shaping and option discovery in goal-directed tasks, yet only as auxiliary signals, rather than directly using them as features for value functions. In this paper, we learn Laplacian eigenvectors online and employ them as features for Q-learning in 23 Atari games. We empirically demonstrate that these online–learned embeddings substantially improve model-free RL in large, high-dimensional domains. We demonstrate that enriching state representations with action embeddings yields additional gains under both behavior-policy and uniform-random policies. Additionally, we introduce the Fusion architecture, which augments the representation with useful inductive bias at the embedding level. To assess the usefulness of each embedding used in the Fusion architecture, we use Shapley values analysis.

2025-06-21

rl-conference.cc/RLC/2025/Workshop/IBRL (published)

openreview.net

Training PPO-Clip with Parallelized Data Generation: A Case of Fixed-Point Convergence

Homayoun Honari

Roger Creus Castanyer

Pablo Samuel Castro

Glen Berseth

In recent years, with the increase in the compute power of GPUs, parallelized data collection has become the dominant approach for training … (see more)reinforcement learning (RL) agents. Proximal Policy Optimization (PPO) is one of the widely-used on-policy methods for training RL agents. In this paper, we focus on the training behavior of PPO-Clip with the increase in the number of parallel environments. In particular, we show that as we increase the amount of data used to train PPO-Clip, the optimized policy would converge to a fixed distribution. We use the results to study the behavior of PPO-Clip in two case studies: the effect of change in the minibatch size and the effect of increase in the number of parallel environments versus the increase in the rollout lengths. The experiments show that settings with high-return PPO runs result in slower convergence to the fixed-distribution and higher consecutive KL divergence changes. Our results aim to offer a better understanding for the prediction of the performance of PPO with the scaling of the parallel environments.

2025-06-21

rl-conference.cc/RLC/2025/Workshop/IBRL (published)

openreview.net

Neuromorphic hierarchical modular reservoirs

Filip Milisav

Andrea I Luppi

Laura E Suarez

Guillaume Lajoie

Bratislav Misic

Modularity is a fundamental principle of brain organization, reflected in the presence of segregated sub-networks that enable specialized in… (see more)formation processing. These small, densely connected modules are often nested within larger, higher-order modules, giving rise to a hierarchical modular architecture. This structure is posited to balance information segregation in specialized neuronal communities and global integration via intermodular communication. Yet, how hierarchical modularity shapes network function remains unclear. Here we introduce a simple blockmodeling framework for generating and comparing multi-level hierarchical modular networks and implement them as recurrent neural network reservoirs to evaluate their computational capacity. We show that hierarchical modular networks enhance memory capacity, support multitasking, and give rise to a broader range of temporal dynamics compared to strictly modular and random networks. These functional advantages can be traced to topological features enriched in hierarchical modular networks, which include reciprocal and cyclic network motifs. To test whether the computational advantages of hierarchical modularity subsist in empirical human brain structural connectivity patterns, we develop a novel hierarchical modularity-preserving network null model, allowing us to isolate the positive effect of empirical hierarchical modularity patterns on memory capacity. To evaluate the biomimetic validity of connectome-informed reservoir dynamics, we compare reservoir timescales to empirical brain timescales derived from MEG data and find that hierarchical modularity contributes to shaping brain-like neural timescales. Altogether, across multiple benchmarks, these results show that hierarchical modularity endows networks with computationally advantageous properties, providing insight into the relationship between neural network structure and function with potential applications for the design of neuromorphic computing architectures.

2025-06-20

bioRxiv (preprint)

doi.org

Behavioral Suite Analysis of Self-Supervised Learning in Atari

Somjit Nath

Rishav

Gopeshh Subbaraj

D. Nowrouzezahrai

S Ebrahimi Kahou

2025-06-19

rl-conference.cc/RLC/2025/Workshop/RLVG (accepted)

openreview.net

A deep generative model for deciphering cellular dynamics and in silico drug discovery in complex diseases

Yumin Zheng

Jonas C. Schupp

Taylor Adams

Geremy Clair

Aurelien Justet

Farida Ahangari

Xiting Yan

Paul Hansen

Marianne Carlon

Emanuela Cortesi

Marie Vermant

Robin Vos

Laurens J. De Sadeleer

Iván O. Rosas

Ricardo Pineda

John Sembrat

Melanie Königshoff

John E. McDonough

Bart M. Vanaudenaerde

Wim A. Wuyts … (see 2 more)

Naftali Kaminski

Jun Ding

Human diseases are characterized by intricate cellular dynamics. Single-cell transcriptomics provides critical insights, yet a persistent ga… (see more)p remains in computational tools for detailed disease progression analysis and targeted in silico drug interventions. Here we introduce UNAGI, a deep generative neural network tailored to analyse time-series single-cell transcriptomic data. This tool captures the complex cellular dynamics underlying disease progression, enhancing drug perturbation modelling and screening. When applied to a dataset from patients with idiopathic pulmonary fibrosis, UNAGI learns disease-informed cell embeddings that sharpen our understanding of disease progression, leading to the identification of potential therapeutic drug candidates. Validation using proteomics reveals the accuracy of UNAGI’s cellular dynamics analysis, and the use of the fibrotic cocktail-treated human precision-cut lung slices confirms UNAGI’s predictions that nifedipine, an antihypertensive drug, may have anti-fibrotic effects on human tissues. UNAGI’s versatility extends to other diseases, including COVID, demonstrating adaptability and confirming its broader applicability in decoding complex cellular dynamics beyond idiopathic pulmonary fibrosis, amplifying its use in the quest for therapeutic solutions across diverse pathological landscapes.

2025-06-19

Nature Biomedical Engineering (published)

doi.org

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Marco Jiralerspong

Esther Derman

Danilo Vucetic

Nikolay Malkin

Bilun Sun

Tianyu Zhang

Pierre-Luc Bacon

Gauthier Gidel

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a … (see more)small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

2025-06-19

ArXiv (preprint)

doi.org

arxiv.org

Human-AI Alignment of Learning Trajectories in Video Games: a continual RL benchmark proposal

Yann Harel

Lune P Bellec

François Paugam

Hugo Delhaye

Audrey Durand

We propose a design for a continual reinforcement learning (CRL) benchmark called GHAIA, centered on human-AI alignment of learning trajecto… (see more)ries in structured video game environments. Using \textit{Super Mario Bros.} as a case study, gameplay is decomposed into short, annotated scenes organized into diverse task sequences based on gameplay patterns and difficulty. Evaluation protocols measure both plasticity and stability, with flexible revisit and pacing schedules. A key innovation is the inclusion of high-resolution human gameplay data collected under controlled conditions, enabling direct comparison of human and agent learning. In addition to adapting classical CRL metrics like forgetting and backward transfer, we introduce semantic transfer metrics capturing learning over groups of scenes sharing similar game patterns. We demonstrate the feasibility of our approach on human and agent data, and discuss key aspects of the first release for community input.

2025-06-19

rl-conference.cc/RLC/2025/Workshop/RLVG (accepted)

openreview.net

SlepNet: Spectral Subgraph Representation Learning for Neural Dynamics

Siddharth Viswanath

Rahul Singh

Yanlei Zhang

J. Adam Noah

Joy Hirsch

Smita Krishnaswamy

Graph neural networks have been useful in machine learning on graph-structured data, particularly for node classification and some types of … (see more)graph classification tasks. However, they have had limited use in representing patterning of signals over graphs. Patterning of signals over graphs and in subgraphs carries important information in many domains including neuroscience. Neural signals are spatiotemporally patterned, high dimensional and difficult to decode. Graph signal processing and associated GCN models utilize the graph Fourier transform and are unable to efficiently represent spatially or spectrally localized signal patterning on graphs. Wavelet transforms have shown promise here, but offer non-canonical representations and cannot be tightly confined to subgraphs. Here we propose SlepNet, a novel GCN architecture that uses Slepian bases rather than graph Fourier harmonics. In SlepNet, the Slepian harmonics optimally concentrate signal energy on specifically relevant subgraphs that are automatically learned with a mask. Thus, they can produce canonical and highly resolved representations of neural activity, focusing energy of harmonics on areas of the brain which are activated. We evaluated SlepNet across three fMRI datasets, spanning cognitive and visual tasks, and two traffic dynamics datasets, comparing its performance against conventional GNNs and graph signal processing constructs. SlepNet outperforms the baselines in all datasets. Moreover, the extracted representations of signal patterns from SlepNet offers more resolution in distinguishing between similar patterns, and thus represent brain signaling transients as informative trajectories. Here we have shown that these extracted trajectory representations can be used for other downstream untrained tasks. Thus we establish that SlepNet is useful both for prediction and representation learning in spatiotemporal data.

2025-06-19

ArXiv (preprint)

doi.org

arxiv.org

Sparse-Reg: Improving Sample Complexity in Offline Reinforcement Learning using Sparsity

Samin Yeasar Arnob

Scott Fujimoto

Doina Precup

In this paper, we investigate the use of small datasets in the context of offline reinforcement learning (RL). While many common offline RL … (see more)benchmarks employ datasets with over a million data points, many offline RL applications rely on considerably smaller datasets. We show that offline RL algorithms can overfit on small datasets, resulting in poor performance. To address this challenge, we introduce"Sparse-Reg": a regularization technique based on sparsity to mitigate overfitting in offline reinforcement learning, enabling effective learning in limited data settings and outperforming state-of-the-art baselines in continuous control.

2025-06-19

ArXiv (preprint)

doi.org

arxiv.org

An Empirical Study of Sensitive Information in Logs

Roozbeh Aghili

Heng Li

Foutse Khomh

2025-06-18

Proceedings of the ACM on Software Engineering (published)

doi.org

arxiv.org

Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

The rapid adaptation ability of auto-regressive foundation models is often attributed to the diversity of their pre-training data. This is b… (see more)ecause, from a Bayesian standpoint, minimizing prediction error in such settings requires integrating over all plausible latent hypotheses consistent with observations. While this behavior is desirable in principle, it often proves too ambitious in practice: under high ambiguity, the number of plausible latent alternatives makes Bayes-optimal prediction computationally intractable. Cognitive science has long recognized this limitation, suggesting that under such conditions, heuristics or information-seeking strategies are preferable to exhaustive inference. Translating this insight to next-token prediction, we hypothesize that low- and high-ambiguity predictions pose different computational demands, making ambiguity-agnostic next-token prediction a detrimental inductive bias. To test this, we introduce MetaHMM, a synthetic sequence meta-learning benchmark with rich compositional structure and a tractable Bayesian oracle. We show that Transformers indeed struggle with high-ambiguity predictions across model sizes. Motivated by cognitive theories, we propose a method to convert pre-trained models into Monte Carlo predictors that decouple task inference from token prediction. Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference, though challenges remain.

2025-06-18

ArXiv (preprint)

doi.org

arxiv.org

Visual symbolic mechanisms: Emergent symbol processing in vision language models

Rim Assouel

Declan Campbell

Taylor Webb

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for… (see more) instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this'binding problem'via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.

2025-06-17

ArXiv (preprint)

doi.org

arxiv.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications