Publications

Challenges in Using LLM Agents to Validate Agent Governance

Héber Hwang Arcolezi

The increasing deployment of Large Language Models (LLMs) as autonomous agents has intensified the need for credible and trustworthy methods… (see more) to evaluate governance interventions. Motivated by recent research, this work considers the use of LLM and agent-based simulations to evaluate AI agent governance mechanisms before real-world deployment. While conceptually appealing, this approach introduces various challenges. We examine three such problems: (1) obtaining ground truth for validation, (2) determining whether observed behaviors represent actual agent operations or simulation artifacts, and (3) obtaining consent for data use, and addressing ethical concerns about computational surrogates replacing real users. We also outline considerations based on documented limitations, aiming to catalyze workshop discussion on trustworthy and reliable evaluation methods for agent governance.

2026-05-08

PoliSim @ ACM Conference on Human Factors in Computing Systems (published)

openreview.net

Actor-Critic Algorithm for Dynamic Expectile and CVaR

Yudong Luo

Erick Delage

Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires tra… (see more)nsition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.

2026-05-07

arXiv (preprint)

doi.org

arxiv.org

A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots

Vincent Guan

Lazar Atanackovic

Kirill Neklyudov

The population dynamics of molecules, cells, and organisms are governed by a number of unknown forces. In the last decade, population dynami… (see more)cs have predominantly been modeled with Wasserstein gradient flows. However, since gradient flows minimize free energy, they fail to capture important dynamical properties, such as periodicity. In this work, we propose a change in perspective by considering dynamics that minimize a population-level action under a damped Wasserstein Lagrangian. By deriving the corresponding Hamiltonian equations of motion, we formalize Wasserstein Lagrangian Mechanics, a structured class of second-order dynamics that encompasses classical mechanics, quantum mechanics, and gradient flows. We then propose WLM as the first algorithm that learns these second-order dynamics from observed marginals, without specifying the Lagrangian. By directly learning the population mechanics, WLM can both forecast and interpolate unseen marginals, and outperforms existing gradient flow and flow matching methods across a wide range of dynamics, including vortex dynamics, embryonic development, and flocking.

2026-05-07

arXiv (preprint)

doi.org

arxiv.org

Estimation of head motion in structural MRI and its impact on cortical morphometry

Charles Bricout

Samira Ebrahimi Kahou

Sylvain Bouix

Motion-related artifacts are inevitable in Magnetic Resonance Imaging (MRI) and can bias automated neuroanatomical metrics such as cortical … (see more)thickness. These biases can interfere with statistical analysis which is a major concern as motion has been shown to be more prominent in certain populations such as children or individuals with ADHD. Manual review cannot objectively quantify motion in anatomical scans, and existing quantitative automated approaches often require specialized hardware or custom acquisition protocols. Here, we train a 3D convolutional neural network to estimate a summary motion metric in retrospective routine research scans by leveraging a large training dataset of synthetically motion-corrupted volumes. We validate our method with one held-out site from our training cohort and with 14 fully independent datasets, including one with manual ratings, achieving a Spearman Rank correlation of 0.71 vs. manual labels. We also tested the correlation of our predicted motion score with morphometric measurements known to be impacted by motion, achieving significant correlation on most datasets. Furthermore, our predicted motion correlates with subject age in line with prior studies. Our approach shows good generalization across scanner brands and protocols, enabling objective, scalable motion assessment in structural MRI studies without prospective motion correction. Finally, we provide empirical evidence that our motion estimator significantly improve model fitness when studying cortical thickness and volume. Our final model is made openly and freely available through “Agitation," a tool usable as a CLI, python package and integrated in Nipoppy and Boutiques. By providing reliable motion estimates, our method offers researchers a tool to assess and account for potential biases in cortical morphometric analyses.

2026-05-07

Frontiers in Neuroscience (published)

doi.org

Neurobagel: building an international network for distributed data discovery

Michelle Wang

Jean-Baptiste Poline

Yaroslav O. Halchenko

Jan G. Bjaalie

Katie M. Lavigne

Jeffrey Grethe

Max A. Laansma

Barbara Strasser-Kirchweger

Emile d’Angremont

David N. Kennedy

Neda Jahanshad

Sean N. Hatton

Nikhil Bhagwat

Tristan Glatard

Brent McPherson

Satrajit Ghosh

Gabriel Devenyi

Stéphane Lehéricy

Vincent Taschereau‐Dumouchel

Florian Hutzler … (see 18 more)

Sebastian Urchs

Michael Hanke

Christopher J. Markiewicz

Russell A. Poldrack

Francis Jeanson

Eva van Heese

David Keator

Camille Maumet

M. Mallar Chakravarty

Franco Pestilli

Julia-Katharina Pfarr

Erin W Dickie

Alyssa Dai

Arman Jahanpour

Mathieu Dugré

Lyuba Zehl

Ysbrand van der Werf

Paul Thompson

International data privacy regulations impede the pooling of research data for collaborative analysis. We introduce Neurobagel, a federated … (see more)network enabling cohort discovery across locally governed, access-controlled datasets. Through intuitive graphical tools and a decentralized query infrastructure, Neurobagel facilitates harmonization, control, and discovery of data according to local regulations. Today, Neurobagel is deployed by consortia and data platforms in Europe, North America, Asia, and Australia, supporting diverse and evolving regulatory frameworks.

2026-05-07

MetArXiv (preprint)

doi.org

RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics.

Danqi Liao

Chen Liu

Xingzhi Sun

Dié Tang

Haochen Wang

Scott Youlten

Srikar Krishna Gopinath

Haejeong Lee

Ethan C. Strayer

Antonio J. Giraldez

Smita Krishnaswamy

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains … (see more)challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.

2026-05-07

arXiv (published)

pubmed.ncbi.nlm.nih.gov

Rotation-Preserving Supervised Fine-Tuning

Mohammad Hamdaqa

Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that thi… (see more)s degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-

2026-05-07

arXiv (preprint)

doi.org

openreview.net

Simply the best – A systematic evaluation approach for third-party libraries based on mobile app quality attributes

Rubén Saborido

Rémy Raes

Rodrigo Morales

Romain Rouvoy

Foutse Khomh

Yann-Gaël Guéhéneuc

Abstract Mobile device applications (apps) are complex because they rely on integrating multiple third-party libraries (TPLs). Yet, TPLs eas… (see more)e app development by offering implementations of specific functionality. For example, app developers often use advertising libraries to generate revenue, integrate social networking libraries to simplify login, or include crash reporting libraries to monitor/report crashes in their apps. However, there are multiple TPLs with similar functionalities from which to choose, and developers often cannot foresee all the consequences of using these libraries in their apps. The sizes of apps grow with the addition and usage of TPLs, and so does the number of required permissions and resource consumption. Thus, TPLs may degrade the quality of apps and developers need help measuring and comparing them. We propose EQuAT, an approach for Evaluating Quality Attributes of TPLs that eases the comparison of TPLs. EQuAT takes as input minimal apps that integrate TPLs and playable scenarios to simulate user interaction while exercising a particular functionality of the included TPL. By collecting quality metrics and comparing them using plots, we provide app developers with a systematic approach to rank TPLs based on their preferences. We show how EQuAT helps developers make informed decisions about which libraries to integrate into their apps by validating them against nine TPLs across three categories.

2026-05-07

Empirical Software Engineering (published)

doi.org

TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature

Hanqing Zhao

The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge fr… (see more)om unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.

2026-05-07

arXiv (preprint)

doi.org

arxiv.org

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer

Pablo Samuel Castro

Glen Berseth

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowled… (see more)ge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

2026-05-06

arXiv (preprint)

doi.org

arxiv.org

Beyond the total NIHSS score: association between impaired level of consciousness and early neurological deterioration in mild large vessel occlusion stroke

Qiangze Ji

Liangliang Sun

Zenghui Liu

Hanqing Zhao

Jing Yu

Ying Zhang

Kaiyue Duan

Lili Guo

Qiuyi Zhang

2026-05-06

Neuroradiology (published)

doi.org

Diversity Curves for Graph Representation Learning

Katharina Limbeck

Nadja Häusermann

Martin Carrasco

Guy Wolf

Bastian Rieck

Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with diffe… (see more)rent cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in particular require interpretable, scalable, and reliable size-aware graph representations. Our work addresses these issues by tracking the structural diversity of a graph across coarsening levels. The resulting graph embeddings, which we denote diversity curves, are interpretable by construction, efficient, and directly comparable across coarsening hierarchies. Specifically, we track the spread of graphs, a novel isometry invariant that is inherently well-suited for encoding the metric diversity and geometry of graphs. We utilise edge contraction coarsening and prove that this improves expressivity, thus leading to more powerful graph-level representations than structural descriptors alone. Demonstrating their utility over a range of baseline methods in practice, we use diversity curves to (i) cluster and visualise simulated graphs across varying sizes, (ii) distinguish the geometry of single-cell graphs, (iii) compare the structure of molecular graph datasets, and (iv) characterise geometric shapes.

2026-05-06

arXiv (preprint)

doi.org

arxiv.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications