Publications

DASB - Discrete Audio and Speech Benchmark
Jarod Duret
Darius Petermann
Anastasia Kuznetsova
Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling mult… (voir plus)imodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.
Early detection of common reed ( <i>Phragmites australis</i> ) using unoccupied aerial vehicles and deep learning
EXPRESS: Climate Communications in IPOs: Unpacking the Influence of Climate Disclosure Volume, Sender, and Message Characteristics
Alok R. Saboo
Ritesh Adhyapak
Climate disclosures have emerged as a prominent communication tool for firms facing growing pressure to address climate challenges, yet thei… (voir plus)r impact on firm performance remains unclear. This study proposes a nonlinear (U-shaped) relationship between climate disclosure volume and IPO firm performance, grounded in a damage-limitation logic. At low to moderate levels, disclosures amplify risk salience and proprietary costs, damaging valuations. At higher levels, offsetting benefits related to information, stewardship, and climate-friendly reputation outweigh these costs. Using multi-sourced data from 1,586 IPO firms, a BERT-based large language model to identify climate-related text in prospectuses, and econometric methods that address endogeneity, the authors find support for the proposed U-shaped relationship. The research further demonstrates that sender characteristics (underwriter reputation, customer concentration, and market orientation) and message characteristics (discretionary disclosure and message clarity) moderate the nonlinear relationship. Post-hoc analyses decomposing disclosure content reveal that climate risk disclosures damage valuations. In contrast, climate risk-management disclosures (governance, strategy, and metrics/targets) generate positive effects, suggesting that disclosure effectiveness depends on both volume and content composition. These effects persist in the long-term performance of firms. The findings provide actionable insights for firms developing disclosure strategies and policymakers encouraging climate-related communication.
A Mechanistic Analysis of Looped Reasoning Language Models
Hugh Blayney
Álvaro Arroyo
Johan Obando-Ceron
Michael M. Bronstein
Xiaowen Dong
Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by … (voir plus)looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.
Towards Autonomous Mechanistic Reasoning in Virtual Cells
Yunhui Jang
Lu Zhu
Jake Fawkes
Alisandra Kaye Denton
Emmanuel Noutahi
Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However,… (voir plus) their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.
Towards Brain MRI Foundation Models for the Clinic: Findings from the FOMO25 Challenge
Asbjørn Munk
Stefano Cerri
Vardan Nersesjan
Christian Hedeager Krag
Jakob Ambsdorf
Pedro García
Julia Machnio
Peirong Liu
Suhyun Ahn
Nasrin Akbari
Yasmina Al Khalil
Kimberly Amador
Sina Amirrajab
Meritxell Bach Cuadra
Ujjwal Baid
Bhakti Baheti
Jaume Banús
Kamil Barbierik
Christoph Brune … (voir 64 de plus)
步岩松
Baptiste Callard
Yuhan Chen
Corentin Dancette
Peter Drotár
Prasad Dutande
Nils D. Forkert
Saurabh K. Garg†
Jakub Gazda
Matej Gazda
Benoît Gérin
Partha Ghosh
Weikang Gong
Pedro M. Gordaliza
Sam Hashemi
Tobias Heimann
Fucang Jia
Jiexin Jiang
Chris Kang
Seung Kwan Kang
Mohammad Khazaei
Julien Khlaut
Petros Koutsouvelis
Jae Sung Lee
Yuchong Li
Mengye Lyu
Mingchen Ma
Anant Madabhushi
Klaus H. Maier-Hein
Pierre Manceron
Andrés Martínez Mora
Moona Mazher
Felix Meister
Nataliia Molchanova
Steven A. Niederer
Leonard Nürnberg
Jinah Park
Abdul Qayyum
Jonas Richiardi
Antoine Saporta
Branislav Setlak
Ning Shen
Constantin Ulrich
Puru Vaish
Vibujithan Vigneshwaran
Leroy Volmer
Zihao Wang
Siqi Wei
Anthony Winder
Jelmer M. Wolterink
Maxence Wynen
Chang YANG
Si Young Yie
Mostafa Mehdipour Ghazi
Akshay Pai
Espen Jimenez‐Solem
Sebastian Nørgaard Llambias
Mikael Boesen
Michael Eriksen Benros
Juan Eugenio Iglesias
Mads Nielsen
Clinical deployment of automated brain MRI analysis faces a fundamental challenge: clinical data is heterogeneous and noisy, and high-qualit… (voir plus)y labels are prohibitively costly to obtain. Self-supervised learning (SSL) can address this by leveraging the vast amounts of unlabeled data produced in clinical workflows to train robust \textit{foundation models} that adapt out-of-domain with minimal supervision. However, the development of foundation models for brain MRI has been limited by small pretraining datasets and in-domain benchmarking focused on high-quality, research-grade data. To address this gap, we organized the FOMO25 challenge as a satellite event at MICCAI 2025. FOMO25 provided participants with a large pretraining dataset, FOMO60K, and evaluated models on data sourced directly from clinical workflows in few-shot and out-of-domain settings. Tasks covered infarct classification, meningioma segmentation, and brain age regression, and considered both models trained on FOMO60K (method track) and any data (open track). Nineteen foundation models from sixteen teams were evaluated using a standardized containerized pipeline. Results show that (a) self-supervised pretraining improves generalization on clinical data under domain shift, with the strongest models trained \textit{out-of-domain} surpassing supervised baselines trained \textit{in-domain}. (b) No single pretraining objective benefits all tasks: MAE favors segmentation, hybrid reconstruction-contrastive objectives favor classification, and (c) strong performance was achieved by small pretrained models, and improvements from scaling model size and training duration did not yield reliable benefits.
White and Gray Matter Multiple Sclerosis Spinal Cord Lesion Characteristics and Individualized Tissue Damage Assessment Using 7 T T1 Mapping
Nilser Laines-Medina
Samira Mchinda
Benoit Testud
Arnaud Le Troter
Lauriane Pini
Bertrand Audoin
Jean Pelletier
Sarah Demortière
Virginie Callot
The aim of this exploratory study was to demonstrate how 7 T MP2RAGE T1 mapping can be used to evaluate spinal cord (SC) tissue damage and l… (voir plus)esion characteristics in multiple sclerosis (MS) at both subregional and individual levels. Fifteen patients with relapsing-remitting MS (pwRRMS; mean disease duration = 32 ± 24.9 mo) and 15 age-matched healthy controls (HC) underwent 7 T cervical 3D MP2RAGE imaging with submillimetric spatial resolution. Automatic SC and lesion segmentations were obtained and manually corrected when necessary. Images were registered to the AMU7T template space to extract T1 values from specific regions of interest (ROIs), including white matter (WM) tracts: corticospinal (CST), lateral sensory (LST), posterior sensory (PST), ventral motor (VMT), and gray matter (GM) subregions: ventral, intermediate, and dorsal. Individual Z -score maps were computed and used to derive a global index of tissue impairment (patient-specific Z -score barplot) for lesion and normal appearing tissues (NAT). Finally, MS lesions were further characterized by their relative lesion load (RLL%), frequency maps, and topography across ROIs. Lesions were predominantly located in the posterior half of the cord, with GM showing the highest RLL. However, no lesions were observed exclusively in GM. An increasing gradient in T1 values was observed, with T1_HC 0.01). Mixed GM-WM lesions exhibited higher T1 values and larger volumes than WM-only lesions. Elevated T1 values
EIAN: Explicit Interaction-aware Attention Network for Interpretable Event Modeling
Jiping Zhang
Hua Zhu
Hong Huang
Yi Zhou
Kehan Yin
Event sequences are integral to domains such as e-commerce, social networks, and healthcare. Traditional point process models, like Poisson … (voir plus)and Hawkes processes, are foundational but limited by rigid parametric assumptions, constraining their flexibility in complex real-world scenarios. Neural point processes offer a more adaptable alternative, but typically perform implicit sequence modeling, which does not fully exploit critical event interaction patterns and limits transparency. To address these challenges, we introduce the Explicit Interaction-aware Attention Network (EIAN), a novel model that enhances event modeling by explicitly capturing both intra-type and cross-type event interactions. Specifically, EIAN employs two key components: an intra-type temporal encoder that preserves the unique temporal dynamics within each event type, and a cross-type interaction decoder that highlights interactions across event types. Furthermore, two temporal encoding mechanisms are integrated into the interaction decoder to handle irregular inter-event intervals in diverse temporal scenarios. Extensive experiments show that EIAN consistently outperforms existing models in predictive performance and provides deeper insights into event interaction patterns, advancing both flexibility and interpretability. Our code is available at https://github.com/CGCL-codes/EIAN.git.
Forecasting Developer Environments with GenAI: A Research Perspective
Raula Gaikovina Kula
Christoph Treude
Xing Hu
Sebastian Baltes
Earl T. Barr
Kelly Blincoe
Fabio Calefato
J Chen
Marc Cheong
Youmei Fan
Daniel M. Germán
Marco Gerosa
Jin L.C. Guo
Shinpei Hayashi
Robert Hirschfeld
Reid Holmes
Yintong Huo
Takashi Kobayashi
Michele Lanza
Zhongxin Liu … (voir 11 de plus)
Olivier Nourry
Nicole Novielli
Denys Poshyvanyk
Shinobu Saito
Kazumasa Shimari
Igor Steinmacher
Mairieli Wessel
Markus Wagner
Annie Vella
Laurie Williams
Xin Xia
Generative Artificial Intelligence (GenAI) models are achieving remarkable performance in various tasks, including code generation, testing,… (voir plus) code review, and program repair. The ability to increase the level of abstraction away from writing code has the potential to change the Human-AI interaction within the integrated development environment (IDE). To explore the impact of GenAI on IDEs, 33 experts from the Software Engineering, Artificial Intelligence, and Human-Computer Interaction domains gathered to discuss challenges and opportunities at Shonan Meeting 222, a four-day intensive research meeting. Four themes emerged as areas of interest for researchers and practitioners.
TAPNext++: What's Next for Tracking Any Point (TAP)?
Sebastian Jung
Martin Sundermeyer
Carl Doersch
David Joseph Tan
Rudolph Triebel
Federico Tombari
Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recen… (voir plus)tly introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard (
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
Miles Q. Li
Benjamin C. M. Fung
Boyang Li
Heba Ismail
Farkhund Iqbal
The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a… (voir plus) proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.
Active search generation for nanophotonic design in the small data regime
Yuri Grinberg
Dan Kushnir
Yanlei Zhang
Dan-Xia Xu