Publications

AI Agent Safety is a Reinforcement Learning Problem
Reginald McLean
Montaser Mohammedalamen
Kevin Roice
Patrick M. Pilarski
Marlos C. Machado
Alyssa Lefaivre Škopac
Benjamin Rosman
With the rapid advancement and deployment of Agentic AI, our scientific understanding of capabilities and limitations has not kept pace, lea… (see more)ding to cases where AI agents cause harm. We argue that many of these safety limitations are not novel problems. Instead, the safety challenges currently facing AI agents can be seen as instances of problems the reinforcement learning (RL) community has studied rigorously for decades. The core of this argument concerns the problem formulation of AI agents. AI agents are designed to solve sequential decision-making problems: problems with long-term objectives in which actions have delayed consequences. To model these types of problem, the problem is set up the problem such that the agent receives observations, feedback on its progress, and then takes actions. This is precisely the formulation of the RL problem. In this paper, we formalize the problem equivalence, which we then leverage to argue that \textbf{AI Agent safety is a reinforcement learning problem: the failure modes currently observed in deployed AI agents are structural instances of problems RL has formalized for decades, and the RL safety literature provides principled tools to diagnose and address them.}. We conclude with a call for deliberate collaboration between the RL and AI agent research communities: AI agent researchers gain access to principled frameworks, while RL researchers gain a class of real-world problems that could expose fundamental gaps in current RL benchmarks and theory.
Consistent Identification of Top-$K$ Nodes in Noisy Networks
Hui Shen
Eric D. Kolaczyk
Identifying the most influential nodes in a network, typically using centrality measures, is a central task in applied network analysis. How… (see more)ever, real-world networks are often constructed from noisy or incomplete data, which can distort rankings and lead to errors in identifying the true top-
ReCode: Unify Plan and Action for Universal Granularity Control
Zhaoyang Yu
Jiayi Zhang
Huixue Su
Yufan Zhao
Yifan Wu
Mingyi Deng
Jinyu Xiang
Yizhang Lin
Fanqi Kong
Lingxiao Tang
Yuyu Luo
Chenglin Wu
Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where… (see more) planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose **ReCode** (**Re**cursive **Code** Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control.
WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents
Fatemeh Pesaran zadeh
Weijian Qi
Alexander Miller
Junyi Song
Yunjia Tian
Dongjin Kang
Seyeon Choi
Ewen Gueguen
Zeyi Liao
Mengqi Yuan
Alexandre Lacoste
Huan Sun … (see 2 more)
Gunhee Kim
Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans heterogeneous … (see more)applications, multimodal content, and stateful workflows. However, existing reproducible web-agent benchmarks cover only a small number of web applications drawn from a few software categories, and restrict modality to text and vision. Live benchmarks broaden site coverage but sacrifice reproducibility, since pages and data drift between runs. Moreover, existing benchmarks do not meaningfully evaluate whether agents can understand and use audio and video content embedded within web tasks. To address these gaps, we introduce WebArena-Pro, a benchmark comprising 300 tasks across 20 self-hosted web applications in six domain categories, spanning distinct interface conventions, workflows, and data models. Across the evaluated agents, the best performance is achieved by Gemini 3.1 Pro, which attains 37.0 % success under a 50-step budget, while open-source models' performance does not exceed 27.7% success. Among reproducible, human-curated web agent benchmarks, WebArena-Pro provides the broadest application coverage and the most comprehensive multimodal support to date. The benchmark treats audio and video as core observations alongside text and vision, with dedicated actions for extracting information from each. WebArena-Pro runs each task in isolation and supports reproducible, parallel evaluation. Tasks are authored through a dedicated annotator interface, filtered by LLM-assisted triage, and finally validated by humans before release.
A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification
Olivier Michaud
Bettina Kemme
Learned indexes have emerged as a promising alternative to traditional index structures, offering higher throughput and lower memory usage b… (see more)y approximating the cumulative key distribution function with lightweight models. Despite these benefits, adoption in production systems remains limited, partly because learned indexes that support concurrency and persistence as effectively as, e.g., the B+-Tree, do not yet exist, while many research prototypes introduce substantial complexity. In this paper, we investigate whether off-the-shelf learned indexes can be integrated into a production database with minimal storage-engine redesign. Using RocksDB as a case study, we exploit its separation between in-memory Memtables and immutable on-disk files to deploy specialized indexes at each level. We show that directly applying existing learned indexes is insufficient under write-heavy workloads because frequent Memtable replacement prevents models from fully adapting. To address this, we introduce a reuse mechanism that preserves structural knowledge across Memtable instances. At the storage level, we replace RocksDB's disk index with a learned index without modifying the storage layer or read path. We further adapt a read-only learned index to be block-aware, enabling worst-case single-I/O lookups. We implement these techniques in MountDB, an extension of RocksDB. Experiments on large-scale workloads with diverse data distributions and access patterns show up to 1.5X higher write throughput and 2.1X higher read throughput than state-of-the-art systems, demonstrating that established learned indexes can be integrated into production systems with minimal overhead and substantial performance benefits.
Cell type transcriptomic modules reveal shared molecular mechanisms in Alzheimer’s and Parkinson’s disease
Edward A. Fon
Alain Dagher
Yasser Iturria-Medina
Jo Anne Stratton
L. M. Hodgson
David A Bennett
Historically, Alzheimer's disease (AD) and Parkinson's disease (PD) have been investigated as two distinct disorders of the brain. However, … (see more)a few similarities in neuropathology and clinical symptoms have been documented over the years. Traditional single-gene centric studies, such as differential gene expression analyses, have struggled to unravel the molecular basis for the observed pathological links between AD and PD. To address this, we tailor a latent factor framework to analyze synchronous gene co-expression at sub-cell-type resolution. Utilizing large, single-nucleus transcriptomics datasets in AD (70,634 nuclei) and PD (340,902 nuclei) from postmortem human brains, we systematically extract and juxtapose disease-critical molecular signatures in the brain. Our transcriptomic analysis reveals shared molecular programs between AD and PD that systematically localize to specific glial and neuronal cell types. In neurons, convergent gene groups in AD and PD relate to cytoskeletal dynamics and mitochondrial stress mechanisms. Similarly, overlapping gene groups in microglia modules implicate T cell activation mechanisms and synapse pruning pathways. In parallel, AD- and PD-associated genes in astrocytes are involved in heavy metal processing; oligodendrocytes highlight convergent dysregulation in myelin synthesis. In addition, our analysis reveals APOE, an AD GWAS gene, has disease predictive roles in PD-associated gene modules. Conversely, SNCA, a PD GWAS gene, emerges within AD associated gene modules. Our multi-module sub-cell-type approach offers unique insights into the molecular basis of shared neuropathology in AD and PD.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
Flow matching with …
Sex-specific hormone-sensitive regulatory architecture in adolescence as a scaffold for depression vulnerability
Gladi Thng
Michel Garcia-Miranda
Kailu Song
Anjali Chawla
Reine Khoury
Minh Nguyen
Gabriella Frosi
Matthew Suderman
David Liao
Natalina Salmaso
Tie Yuan Zhang
Pan Wong Tak
Yashar Zeighami
Corina Nagy
External validation of cough-based algorithms for pulmonary tuberculosis screening from the CODA TB DREAM challenge using cough data from Peru
Alexandra J. Zimmer
Patricia Espinoza-Lopez
Vijay Ravi
Solveig K. Sieberts
Samira Abbasgholizadeh Rahimi
Madhukar Pai
César Ugarte-Gil
Simon Grandjean Lapierre
The COugh Diagnostic Algorithm for Tuberculosis (CODA TB) DREAM Challenge recently evaluated the performance of artificial intelligence (AI)… (see more) algorithms for tuberculosis (TB) screening using cough sounds. Eleven AI models were developed using a dataset of 733,756 cough sounds collected from 2143 adults from seven countries. This study evaluates the CODA Challenge AI models with an external independent cough dataset from Peru. Cough recordings from 303 coughing adults were collected from health facilities in Lima, Peru. The AUCs of the models ranged from 0.480 to 0.615, showing a decrease in performance compared to their performance when internally validated using the CODA Challenge, which ranged from 0.689 to 0.743. The best performing model in the CODA Challenge was also the best performing model in this external validation. Sub-group analyses revealed that models performed better in older (≥ 35 years) populations and among people with prior TB. The external validation revealed limitations in the generalizability of the CODA Challenge models to other settings. While some models showed promise, the overall performance decline highlights the need for continued model validation on external datasets. It also underscores the importance of developing context-specific models to account for population-specific factors that influence cough characteristics and TB prevalence.
Mem-$π$: Adaptive Memory through Learning When and What to Generate
Chao Wang
Christopher Pal
Alexandre Lacoste
We present Mem-…
Model Stealing Through the Lens of Model Multiplicity
Model stealing attacks, where adversaries create high-fidelity surrogate models, are a significant threat to the intellectual property of ma… (see more)chine learning services. Conventional wisdom suggests these surrogates could provide adversaries with economic leverage comparable to the original service providers. This paper challenges this assumption by evaluating model stealing attacks beyond mere fidelity to the target model. Because query-based extraction provides only partial supervision of the target's input-output behavior, the surrogate is not uniquely identified: many near-optimal surrogates can achieve comparable fidelity while differing in deployment-relevant properties. Instead of performing a classic learning-based model stealing attack, we compute the Rashomon Set (i.e., the set of almost-equally-accurate models) of surrogate models, and evaluate its diversity using multiplicity metrics (ambiguity, discrepancy and rashomon capcity) and group fairness metrics. Our experiments on real-world datasets reveal that despite exhibiting similar fidelity to the target model, surrogate models can display significant variances in other critical performance metrics. These findings cast doubt on the presumed equivalence between high-fidelity surrogates and the target model in practical deployment scenarios.
Representations in vision and language converge in a shared, multidimensional space of perceived similarities
Katerina M. Simkova
Adrien Doerig
Clayton Hickey
Humans can effortlessly describe what they see, yet establishing a shared representational format between vision and language remains a sign… (see more)ificant challenge. Emerging evidence suggests that human brain representations in both vision and language are well predicted by semantic feature spaces obtained from large language models (LLMs). This raises the possibility that sensory systems converge in their inherent ability to transform their inputs onto shared, embedding-like representational space. However, it remains unclear how such a space manifests in human behavior. To investigate this, 63 participants performed behavioral similarity judgments separately on 100 natural scene images and 100 corresponding sentence captions from the Natural Scenes Dataset. We found that visual and linguistic similarity judgments not only converge at the behavioral level but also predict a remarkably similar network of functional magnetic resonance imaging brain responses evoked by viewing the natural scene images. Furthermore, computational models trained to map images onto LLM-embeddings outperformed both category-trained and AlexNet controls in predicting the behavioral similarity structure. These findings demonstrate that human visual and linguistic similarity judgments are grounded in a shared, modality-agnostic representational structure that mirrors how the visual system encodes experience. The convergence between sensory and artificial systems observed here suggests a common capacity of how conceptual representations are formed-not as arbitrary products of first order, modality-specific input, but as structured representations that reflect the stable, relational properties of the external world.