Publications

AI Agent Safety is a Reinforcement Learning Problem

Reginald McLean

Tabitha Edith Lee

Montaser Mohammedalamen

Kevin Roice

Glen Berseth

Patrick M. Pilarski

Marlos C. Machado

Alyssa Lefaivre Škopac

Benjamin Rosman

With the rapid advancement and deployment of Agentic AI, our scientific understanding of capabilities and limitations has not kept pace, lea… (see more)ding to cases where AI agents cause harm. We argue that many of these safety limitations are not novel problems. Instead, the safety challenges currently facing AI agents can be seen as instances of problems the reinforcement learning (RL) community has studied rigorously for decades. The core of this argument concerns the problem formulation of AI agents. AI agents are designed to solve sequential decision-making problems: problems with long-term objectives in which actions have delayed consequences. To model these types of problem, the problem is set up the problem such that the agent receives observations, feedback on its progress, and then takes actions. This is precisely the formulation of the RL problem. In this paper, we formalize the problem equivalence, which we then leverage to argue that \textbf{AI Agent safety is a reinforcement learning problem: the failure modes currently observed in deployed AI agents are structural instances of problems RL has formalized for decades, and the RL safety literature provides principled tools to diagnose and address them.}. We conclude with a call for deliberate collaboration between the RL and AI agent research communities: AI agent researchers gain access to principled frameworks, while RL researchers gain a class of real-world problems that could expose fundamental gaps in current RL benchmarks and theory.

2026-05-22

AIWILD @ International Conference on Machine Learning (published)

openreview.net

Consistent Identification of Top-$K$ Nodes in Noisy Networks

Hui Shen

Eric D. Kolaczyk

Identifying the most influential nodes in a network, typically using centrality measures, is a central task in applied network analysis. How… (see more)ever, real-world networks are often constructed from noisy or incomplete data, which can distort rankings and lead to errors in identifying the true top-

2026-05-22

arXiv (preprint)

doi.org

arxiv.org

ReCode: Unify Plan and Action for Universal Granularity Control

Zhaoyang Yu

Jiayi Zhang

Huixue Su

Yufan Zhao

Yifan Wu

Mingyi Deng

Jinyu Xiang

Yizhang Lin

Fanqi Kong

Lingxiao Tang

Yuyu Luo

Bang Liu

Chenglin Wu

Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where… (see more) planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose **ReCode** (**Re**cursive **Code** Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control.

2026-05-22

AIWILD @ International Conference on Machine Learning (published)

doi.org

openreview.net

WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents

Imene Kerboua

Fatemeh Pesaran zadeh

Xing Han Lu

Weijian Qi

Alexander Miller

Junyi Song

Yunjia Tian

Dongjin Kang

Seyeon Choi

Marzia Nouri

Ewen Gueguen

Matteo Boglioni

Fengyuan Liu

Zeyi Liao

Mengqi Yuan

Yue Li

Alexandre Lacoste

Alexandre Drouin

Spandana Gella

Huan Sun … (see 2 more)

Gunhee Kim

Siva Reddy

Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans heterogeneous … (see more)applications, multimodal content, and stateful workflows. However, existing reproducible web-agent benchmarks cover only a small number of web applications drawn from a few software categories, and restrict modality to text and vision. Live benchmarks broaden site coverage but sacrifice reproducibility, since pages and data drift between runs. Moreover, existing benchmarks do not meaningfully evaluate whether agents can understand and use audio and video content embedded within web tasks. To address these gaps, we introduce WebArena-Pro, a benchmark comprising 300 tasks across 20 self-hosted web applications in six domain categories, spanning distinct interface conventions, workflows, and data models. Across the evaluated agents, the best performance is achieved by Gemini 3.1 Pro, which attains 37.0 % success under a 50-step budget, while open-source models' performance does not exceed 27.7% success. Among reproducible, human-curated web agent benchmarks, WebArena-Pro provides the broadest application coverage and the most comprehensive multimodal support to date. The benchmark treats audio and video as core observations alongside text and vision, with dedicated actions for extracting information from each. WebArena-Pro runs each task in isolation and supports reproducible, parallel evaluation. Tasks are authored through a dedicated annotator interface, filtered by LLM-assisted triage, and finally validated by humans before release.

2026-05-22

AIWILD @ International Conference on Machine Learning (published)

openreview.net

A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification

Shubham Vashisth

Olivier Michaud

Bettina Kemme

Oana Balmau

Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage b… (see more)y approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.

2026-05-21

arXiv (preprint)

doi.org

arxiv.org

Cell type transcriptomic modules reveal shared molecular mechanisms in Alzheimer’s and Parkinson’s disease

Anwesha Bhattacharya

Edward A. Fon

Alain Dagher

Yasser Iturria-Medina

Jo Anne Stratton

L. M. Hodgson

David A Bennett

Historically, Alzheimer's disease (AD) and Parkinson's disease (PD) have been investigated as two distinct disorders of the brain. However, … (see more)a few similarities in neuropathology and clinical symptoms have been documented over the years. Traditional single-gene centric studies, such as differential gene expression analyses, have struggled to unravel the molecular basis for the observed pathological links between AD and PD. To address this, we tailor a latent factor framework to analyze synchronous gene co-expression at sub-cell-type resolution. Utilizing large, single-nucleus transcriptomics datasets in AD (70,634 nuclei) and PD (340,902 nuclei) from postmortem human brains, we systematically extract and juxtapose disease-critical molecular signatures in the brain. Our transcriptomic analysis reveals shared molecular programs between AD and PD that systematically localize to specific glial and neuronal cell types. In neurons, convergent gene groups in AD and PD relate to cytoskeletal dynamics and mitochondrial stress mechanisms. Similarly, overlapping gene groups in microglia modules implicate T cell activation mechanisms and synapse pruning pathways. In parallel, AD- and PD-associated genes in astrocytes are involved in heavy metal processing; oligodendrocytes highlight convergent dysregulation in myelin synthesis. In addition, our analysis reveals APOE, an AD GWAS gene, has disease predictive roles in PD-associated gene modules. Conversely, SNCA, a PD GWAS gene, emerges within AD associated gene modules. Our multi-module sub-cell-type approach offers unique insights into the molecular basis of shared neuropathology in AD and PD.

2026-05-20

GigaScience (published)

doi.org

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Le Zhang

Ning Mang

Aishwarya Agrawal

Flow matching with …

2026-05-20

arXiv (preprint)

doi.org

arxiv.org

Sex-specific hormone-sensitive regulatory architecture in adolescence as a scaffold for depression vulnerability

Gladi Thng

Michel Garcia-Miranda

Kailu Song

Anjali Chawla

Reine Khoury

Minh Nguyen

Gabriella Frosi

Matthew Suderman

David Liao

Natalina Salmaso

Tie Yuan Zhang

Pan Wong Tak

Yashar Zeighami

Jun Ding

Corina Nagy

2026-05-20

bioRxiv (preprint)

doi.org

External validation of cough-based algorithms for pulmonary tuberculosis screening from the CODA TB DREAM challenge using cough data from Peru

Alexandra J. Zimmer

Patricia Espinoza-Lopez

Vijay Ravi

Solveig K. Sieberts

Samira Abbasgholizadeh Rahimi

Madhukar Pai

César Ugarte-Gil

Simon Grandjean Lapierre

The COugh Diagnostic Algorithm for Tuberculosis (CODA TB) DREAM Challenge recently evaluated the performance of artificial intelligence (AI)… (see more) algorithms for tuberculosis (TB) screening using cough sounds. Eleven AI models were developed using a dataset of 733,756 cough sounds collected from 2143 adults from seven countries. This study evaluates the CODA Challenge AI models with an external independent cough dataset from Peru. Cough recordings from 303 coughing adults were collected from health facilities in Lima, Peru. The AUCs of the models ranged from 0.480 to 0.615, showing a decrease in performance compared to their performance when internally validated using the CODA Challenge, which ranged from 0.689 to 0.743. The best performing model in the CODA Challenge was also the best performing model in this external validation. Sub-group analyses revealed that models performed better in older (≥ 35 years) populations and among people with prior TB. The external validation revealed limitations in the generalizability of the CODA Challenge models to other settings. While some models showed promise, the overall performance decline highlights the need for continued model validation on external datasets. It also underscores the importance of developing context-specific models to account for population-specific factors that influence cough characteristics and TB prevalence.

2026-05-19

Scientific Reports (published)

doi.org

Mem-$π$: Adaptive Memory through Learning When and What to Generate

Xiaoqiang Wang

Chao Wang

Hadi Nekoei

Christopher Pal

Alexandre Lacoste

Spandana Gella

Bang Liu

Perouz Taslakian

We present Mem-…

2026-05-19

arXiv (preprint)

doi.org

arxiv.org

Model Stealing Through the Lens of Model Multiplicity

Eliott Baltz

Satoshi Hara

Ulrich Aivodji

Model stealing attacks, where adversaries create high-fidelity surrogate models, are a significant threat to the intellectual property of ma… (see more)chine learning services. Conventional wisdom suggests these surrogates could provide adversaries with economic leverage comparable to the original service providers. This paper challenges this assumption by evaluating model stealing attacks beyond mere fidelity to the target model. Because query-based extraction provides only partial supervision of the target's input-output behavior, the surrogate is not uniquely identified: many near-optimal surrogates can achieve comparable fidelity while differing in deployment-relevant properties. Instead of performing a classic learning-based model stealing attack, we compute the Rashomon Set (i.e., the set of almost-equally-accurate models) of surrogate models, and evaluate its diversity using multiplicity metrics (ambiguity, discrepancy and rashomon capcity) and group fairness metrics. Our experiments on real-world datasets reveal that despite exhibiting similar fidelity to the target model, surrogate models can display significant variances in other critical performance metrics. These findings cast doubt on the presumed equivalence between high-fidelity surrogates and the target model in practical deployment scenarios.

2026-05-19

DMP @ Canadian Conference on Artificial Intelligence and Conference on Robots and Vision (oral)

openreview.net

Representations in vision and language converge in a shared, multidimensional space of perceived similarities

Katerina M. Simkova

Adrien Doerig

Clayton Hickey

Ian Charest

Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a sign… (see more)ificant challenge. Emerging evidence suggests that human brain representations in both vision and language are well predicted by semantic feature spaces obtained from large language models (LLMs). This raises the possibility that sensory systems converge in their inherent ability to transform their inputs onto shared, embedding-like representational space. However, it remains unclear how such a space manifests in human behavior. To investigate this, 63 participants performed behavioral similarity judgments separately on 100 natural scene images and 100 corresponding sentence captions from the Natural Scenes Dataset. We found that visual and linguistic similarity judgments not only converge at the behavioral level but also predict a remarkably similar network of functional magnetic resonance imaging brain responses evoked by viewing the natural scene images. Furthermore, computational models trained to map images onto LLM-embeddings outperformed both category-trained and AlexNet controls in predicting the behavioral similarity structure. These findings demonstrate that human visual and linguistic similarity judgments are grounded in a shared, modality-agnostic representational structure that mirrors how the visual system encodes experience. The convergence between sensory and artificial systems observed here suggests a common capacity of how conceptual representations are formed-not as arbitrary products of first order, modality-specific input, but as structured representations that reflect the stable, relational properties of the external world.

2026-05-19

Journal of Vision (published)

doi.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications