Publications

Challenges in Using LLM Agents to Validate Agent Governance
Héber Hwang Arcolezi
The increasing deployment of Large Language Models (LLMs) as autonomous agents has intensified the need for credible and trustworthy methods… (see more) to evaluate governance interventions. Motivated by recent research, this work considers the use of LLM and agent-based simulations to evaluate AI agent governance mechanisms before real-world deployment. While conceptually appealing, this approach introduces various challenges. We examine three such problems: (1) obtaining ground truth for validation, (2) determining whether observed behaviors represent actual agent operations or simulation artifacts, and (3) obtaining consent for data use, and addressing ethical concerns about computational surrogates replacing real users. We also outline considerations based on documented limitations, aiming to catalyze workshop discussion on trustworthy and reliable evaluation methods for agent governance.
Actor-Critic Algorithm for Dynamic Expectile and CVaR
Optimizing dynamic risk with stochastic policies is challenging in both policy updates and value learning. The former typically requires tra… (see more)nsition perturbation, while the latter may rely on model-based approaches. To address these challenges, we propose a surrogate policy gradient without transition perturbation under softmax policy parameterization. We further develop model-free value learning methods for dynamic expectile and conditional value-at-risk by leveraging elicitability. Finally, inspired by Expected SARSA and Expected Policy Gradient, a model-free off-policy actor-critic algorithm is constructed. Empirical results in domains with verifiable risk-averse behavior show that our algorithm can learn risk-averse policy and consistently outperforms other existing methods.
A Call to Lagrangian Action: Learning Population Mechanics from Temporal Snapshots
Kirill Neklyudov
The population dynamics of molecules, cells, and organisms are governed by a number of unknown forces. In the last decade, population dynami… (see more)cs have predominantly been modeled with Wasserstein gradient flows. However, since gradient flows minimize free energy, they fail to capture important dynamical properties, such as periodicity. In this work, we propose a change in perspective by considering dynamics that minimize a population-level action under a damped Wasserstein Lagrangian. By deriving the corresponding Hamiltonian equations of motion, we formalize Wasserstein Lagrangian Mechanics, a structured class of second-order dynamics that encompasses classical mechanics, quantum mechanics, and gradient flows. We then propose WLM as the first algorithm that learns these second-order dynamics from observed marginals, without specifying the Lagrangian. By directly learning the population mechanics, WLM can both forecast and interpolate unseen marginals, and outperforms existing gradient flow and flow matching methods across a wide range of dynamics, including vortex dynamics, embryonic development, and flocking.
Estimation of head motion in structural MRI and its impact on cortical morphometry
Motion-related artifacts are inevitable in Magnetic Resonance Imaging (MRI) and can bias automated neuroanatomical metrics such as cortical … (see more)thickness. These biases can interfere with statistical analysis which is a major concern as motion has been shown to be more prominent in certain populations such as children or individuals with ADHD. Manual review cannot objectively quantify motion in anatomical scans, and existing quantitative automated approaches often require specialized hardware or custom acquisition protocols. Here, we train a 3D convolutional neural network to estimate a summary motion metric in retrospective routine research scans by leveraging a large training dataset of synthetically motion-corrupted volumes. We validate our method with one held-out site from our training cohort and with 14 fully independent datasets, including one with manual ratings, achieving a Spearman Rank correlation of 0.71 vs. manual labels. We also tested the correlation of our predicted motion score with morphometric measurements known to be impacted by motion, achieving significant correlation on most datasets. Furthermore, our predicted motion correlates with subject age in line with prior studies. Our approach shows good generalization across scanner brands and protocols, enabling objective, scalable motion assessment in structural MRI studies without prospective motion correction. Finally, we provide empirical evidence that our motion estimator significantly improve model fitness when studying cortical thickness and volume. Our final model is made openly and freely available through “Agitation," a tool usable as a CLI, python package and integrated in Nipoppy and Boutiques. By providing reliable motion estimates, our method offers researchers a tool to assess and account for potential biases in cortical morphometric analyses.
Neurobagel: building an international network for distributed data discovery
Michelle Wang
Jean-Baptiste Poline
Yaroslav O. Halchenko
Jan G. Bjaalie
Katie M. Lavigne
Jeffrey Grethe
Max A. Laansma
Barbara Strasser-Kirchweger
Emile d’Angremont
David N. Kennedy
Neda Jahanshad
Sean N. Hatton
Nikhil Bhagwat
Tristan Glatard
Brent McPherson
Satrajit Ghosh
Gabriel Devenyi
Stéphane Lehéricy
Vincent Taschereau‐Dumouchel
Florian Hutzler … (see 18 more)
Sebastian Urchs
Michael Hanke
Christopher J. Markiewicz
Russell A. Poldrack
Francis Jeanson
Eva van Heese
David Keator
Camille Maumet
M. Mallar Chakravarty
Franco Pestilli
Julia-Katharina Pfarr
Erin W Dickie
Alyssa Dai
Arman Jahanpour
Mathieu Dugré
Lyuba Zehl
Ysbrand van der Werf
Paul Thompson
International data privacy regulations impede the pooling of research data for collaborative analysis. We introduce Neurobagel, a federated … (see more)network enabling cohort discovery across locally governed, access-controlled datasets. Through intuitive graphical tools and a decentralized query infrastructure, Neurobagel facilitates harmonization, control, and discovery of data according to local regulations. Today, Neurobagel is deployed by consortia and data platforms in Europe, North America, Asia, and Australia, supporting diverse and evolving regulatory frameworks.
RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics.
Danqi Liao
Chen Liu
Xingzhi Sun
Dié Tang
Haochen Wang
Scott Youlten
Srikar Krishna Gopinath
Haejeong Lee
Ethan C. Strayer
Antonio J. Giraldez
Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains … (see more)challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.
Rotation-Preserving Supervised Fine-Tuning
Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that thi… (see more)s degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-
Simply the best – A systematic evaluation approach for third-party libraries based on mobile app quality attributes
Rubén Saborido
Rémy Raes
Rodrigo Morales
Romain Rouvoy
Yann-Gaël Guéhéneuc
Abstract Mobile device applications (apps) are complex because they rely on integrating multiple third-party libraries (TPLs). Yet, TPLs eas… (see more)e app development by offering implementations of specific functionality. For example, app developers often use advertising libraries to generate revenue, integrate social networking libraries to simplify login, or include crash reporting libraries to monitor/report crashes in their apps. However, there are multiple TPLs with similar functionalities from which to choose, and developers often cannot foresee all the consequences of using these libraries in their apps. The sizes of apps grow with the addition and usage of TPLs, and so does the number of required permissions and resource consumption. Thus, TPLs may degrade the quality of apps and developers need help measuring and comparing them. We propose EQuAT, an approach for Evaluating Quality Attributes of TPLs that eases the comparison of TPLs. EQuAT takes as input minimal apps that integrate TPLs and playable scenarios to simulate user interaction while exercising a particular functionality of the included TPL. By collecting quality metrics and comparing them using plots, we provide app developers with a systematic approach to rank TPLs based on their preferences. We show how EQuAT helps developers make informed decisions about which libraries to integrate into their apps by validating them against nine TPLs across three categories.
TCMIIES: A Browser-Based LLM-Powered Intelligent Information Extraction System for Academic Literature
The exponential growth of academic publications has created an urgent need for automated tools capable of extracting structured knowledge fr… (see more)om unstructured scientific texts. While large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and information extraction, existing solutions often require specialized infrastructure, programming expertise, or fine-tuned domain-specific models that create barriers for researchers in specialized fields. This paper presents TCMIIES, a browser-based, zero-installation platform that leverages commercial LLM APIs to perform structured information extraction from academic literature. The system employs a novel schema-guided prompting framework with automatic system prompt generation, enabling researchers to define custom extraction schemas through an intuitive graphical interface without any programming. TCMIIES features a pure front-end architecture that ensures data privacy by processing all information locally in the browser, supports five major LLM providers, implements concurrent batch processing with automatic retry mechanisms, and provides intelligent field mapping for Chinese academic databases including CNKI and Wanfang. We demonstrate the system's effectiveness through comprehensive evaluation across multiple extraction scenarios in Traditional Chinese Medicine research, achieving structured output compliance rates exceeding 94\% and information extraction accuracy comparable to domain-expert annotation. The system represents a practical, accessible solution that bridges the gap between advanced LLM capabilities and domain-specific academic information extraction needs, particularly for researchers in specialized fields who require flexible, privacy-preserving, and cost-effective extraction tools.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Roger Creus Castanyer
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowled… (see more)ge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
Beyond the total NIHSS score: association between impaired level of consciousness and early neurological deterioration in mild large vessel occlusion stroke
Qiangze Ji
Liangliang Sun
Zenghui Liu
Jing Yu
Kaiyue Duan
Lili Guo
Qiuyi Zhang
Diversity Curves for Graph Representation Learning
Nadja Häusermann
Martin Carrasco
Bastian Rieck
Graph-level representations are crucial tools for characterising structural differences between graphs. However, comparing graphs with diffe… (see more)rent cardinalities, even when sampled from the same underlying distribution, remains challenging. Unsupervised tasks in particular require interpretable, scalable, and reliable size-aware graph representations. Our work addresses these issues by tracking the structural diversity of a graph across coarsening levels. The resulting graph embeddings, which we denote diversity curves, are interpretable by construction, efficient, and directly comparable across coarsening hierarchies. Specifically, we track the spread of graphs, a novel isometry invariant that is inherently well-suited for encoding the metric diversity and geometry of graphs. We utilise edge contraction coarsening and prove that this improves expressivity, thus leading to more powerful graph-level representations than structural descriptors alone. Demonstrating their utility over a range of baseline methods in practice, we use diversity curves to (i) cluster and visualise simulated graphs across varying sizes, (ii) distinguish the geometry of single-cell graphs, (iii) compare the structure of molecular graph datasets, and (iv) characterise geometric shapes.