Publications

Safe Domain Randomization via Uncertainty-Aware Out-of-Distribution Detection and Policy Adaptation
Deploying reinforcement learning (RL) policies in real-world involves significant challenges, including distribution shifts, safety concerns… (see more), and the impracticality of direct interactions during policy refinement. Existing methods, such as domain randomization (DR) and off-dynamics RL, enhance policy robustness by direct interaction with the target domain, an inherently unsafe practice. We propose Uncertainty-Aware RL (UARL), a novel framework that prioritizes safety during training by addressing Out-Of-Distribution (OOD) detection and policy adaptation without requiring direct interactions in target domain. UARL employs an ensemble of critics to quantify policy uncertainty and incorporates progressive environmental randomization to prepare the policy for diverse real-world conditions. By iteratively refining over high-uncertainty regions of the state space in simulated environments, UARL enhances robust generalization to the target domain without explicitly training on it. We evaluate UARL on MuJoCo benchmarks and a quadrupedal robot, demonstrating its effectiveness in reliable OOD detection, improved performance, and enhanced sample efficiency compared to baselines.
Towards fair decentralized benchmarking of healthcare AI algorithms with the Federated Tumor Segmentation (FeTS) challenge
Maximilian Zenk
Ujjwal Baid
Sarthak Pati
Akis Linardos
Brandon Edwards
Micah Sheller
Patrick Foley
Alejandro Aristizabal
David Zimmerer
Alexey Gruzdev
Jason Martin
Russell T. Shinohara
Annika Reinke
Fabian Isensee
Santhosh Parampottupadam
Kaushal Parekh
Ralf Floca
Hasan Kassem
Bhakti Baheti
Siddhesh Thakur … (see 332 more)
Verena Chung
Kaisar Kushibar
Karim Lekadir
Meirui Jiang
Youtan Yin
Hongzheng Yang
Quande Liu
Cheng Chen
Qi Dou
Pheng-Ann Heng
Xiaofan Zhang
Shaoting Zhang
Muhammad Irfan Khan
Mohammad Ayyaz Azeem
Mojtaba Jafaritadi
Esa Alhoniemi
Elina Kontio
Suleiman A. Khan
Leon Mächler
Ivan Ezhov
Florian Kofler
Suprosanna Shit
Johannes C. Paetzold
Timo Loehr
Benedikt Wiestler
Himashi Peiris
Kamlesh Pawar
Shenjun Zhong
Zhaolin Chen
Munawar Hayat
Gary Egan
Mehrtash Harandi
Ece Isik Polat
Gorkem Polat
Altan Kocyigit
Alptekin Temizel
Anup Tuladhar
Lakshay Tyagi
Raissa Souza
Nils D. Forkert
Pauline Mouches
Matthias Wilms
Vishruth Shambhat
Akansh Maurya
Shubham Subhas Danannavar
Rohit Kalla
Vikas Kumar Anand
Ganapathy Krishnamurthi
Sahil Nalawade
Chandan Ganesh
Ben Wagner
Divya Reddy
Yudhajit Das
Fang F. Yu
Baowei Fei
B. Fei
Ananth J. Madhuranthakam
Joseph Maldjian
Gaurav Singh
Jianxun Ren
Wei Zhang
Ning An
Qingyu Hu
Youjia Zhang
Ying Zhou
Vasilis Siomos
Giacomo Tarroni
Jonathan Passerrat-Palmbach
Ambrish Rawat
Giulio Zizzo
Swanand Ravindra Kadhe
Jonathan P. Epperlein
Stefano Braghin
Yuan Wang
Renuga Kanagavelu
Qingsong Wei
Yechao Yang
Yong Liu
Krzysztof Kotowski
Szymon Adamski
Bartosz Machura
Wojciech Malara
Lukasz Zarudzki
Jakub Nalepa
Yaying Shi
Hongjian Gao
Salman Avestimehr
Yonghong Yan
Agus S. Akbar
Ekaterina Kondrateva
Hua Yang
Zhaopei Li
Hung-Yu Wu
Johannes Roth
Camillo Saueressig
Alexandre Milesi
Quoc D. Nguyen
Nathan J. Gruenhagen
Tsung-Ming Huang
Jun Ma
Har Shwinder H. Singh
Nai-Yu Pan
Dingwen Zhang
Ramy A. Zeineldin
Michal Futrega
Yading Yuan
Gian Marco Conte
GM Conte
Xue Feng
Quan D. Pham
Yong Xia
Zhifan Jiang
Huan Minh Luu
Mariia Dobko
Alexandre Carré
Bair Tuchinov
Hassan Mohy-ud-Din
Saruar Alam
Anup Singh
Nameeta Shah
Weichung Wang
Chiharu Sako
Michel Bilello
Satyam Ghodasara
Suyash Mohan
Christos Davatzikos
Evan Calabrese
Jeffrey Rudie
Javier Villanueva-Meyer
S. Cha
Soonmee Cha
Christopher Hess
John Mongan
Madhura Ingalhalikar
Manali Jadhav
Umang Pandey
Jitender Saini
Raymond Y. Huang
Ken Chang
Minh-Son To
Sargam Bhardwaj
Chee Chong
Marc Agzarian
Michal Kozubek
Filip Lux
Jan Michálek
Petr Matula
Miloš Ker^kovský
Tereza Kopr^ivová
Marek Dostál
Václav Vybíhal
Marco C. Pinho
James Holcomb
Marie Metz
Rajan Jain
Matthew D. Lee
Yvonne W. Lui
Pallavi Tiwari
Ruchika Verma
Rohan Bareja
Ipsa Yadav
Jonathan Chen
Yuriy Gusev
Krithika Bhuvaneshwar
Anousheh Sayah
Camelia Bencheqroun
Anas Belouali
Subha Madhavan
Rivka R. Colen
Aikaterini Kotrotsou
Philipp Vollmuth
Gianluca Brugnara
Chandrakanth J. Preetha
Felix Sahm
Martin Bendszus
Wolfgang Wick
Abhishek Mahajan
Carmen Balaña
Jaume Capellades
Josep Puig
Yoon Seong Choi
Seung-Koo Lee
Jong Hee Chang
Sung Soo Ahn
Hassan F. Shaykh
Alejandro Herrera-Trujillo
Maria Trujillo
William Escobar
Ana Abello
Jose Bernal
Jhon Gómez
Pamela LaMontagne
Daniel S. Marcus
Mikhail Milchenko
Arash Nazeri
BENNETT A. LANDMAN
Karthik Ramadass
Kaiwen Xu
Silky Chotai
Lola B. Chambless
Akshitkumar Mistry
Reid C. Thompson
Ashok Srinivasan
Jayapalli R. Bapuraj
J. Rajiv Bapuraj
Arvind Rao
Nicholas Wang
Ota Yoshiaki
Toshio Moritani
Sevcan Turk
Joonsang Lee
Snehal Prabhudesai
John Garrett
Matthew Larson
Robert Jeraj
Hongwei Li
Hao Li
Tobias Weiss
Michael Weller
Andrea Bink
Bertrand Pouymayou
Sonam Sharma
Tzu-Chi Tseng
Saba Adabi
Alexandre Xavier Falcão
Samuel B. Martins
Bernardo C. A. Teixeira
Flávia Sprenger
David Menotti
Diego R. Lucio
Simone P. Niclou
Olivier Keunen
Ann-Christin Hau
Enrique Pelaez
Heydy Franco-Maldonado
Francis Loayza
Sebastian Quevedo
Richard McKinley
Johannes Slotboom
Piotr Radojewski
Raphael Meier
Roland Wiest
Johannes Trenkler
Josef Pichler
Georg Necker
Andreas Haunschmidt
Stephan Meckel
Pamela Guevara
Esteban Torche
Cristobal Mendoza
Franco Vera
Elvis Ríos
Eduardo López
Sergio A. Velastin
Stephen Baek
Yusung Kim
Heba Ismael
Bryan Allen
John M. Buatti
Peter Zampakis
Vasileios Panagiotopoulos
Panagiotis Tsiganos
Sotiris Alexiou
Ilias Haliassos
Evangelia I. Zacharaki
Konstantinos Moustakas
Christina Kalogeropoulou
Dimitrios M. Kardamakis
Bing Luo
Laila M. Poisson
Ning Wen
Mahdi A. L. Loutfi
David Fortin
Martin Lepage
Fanny Morón
Jacob Mandel
Gaurav Shukla
Spencer Liem
Gregory S. Alexandre
Joseph Lombardo
Joshua D. Palmer
Adam E. Flanders
Adam P. Dicker
Godwin Ogbole
Dotun Oyekunle
Olubunmi Odafe-Oyibotha
Babatunde Osobu
Mustapha Shu’aibu Hikima
Mayowa Soneye
Farouk Dako
Adeleye Dorcas
Derrick Murcia
Eric Fu
Rourke Haas
John A. Thompson
David Ryan Ormond
Stuart Currie
Kavi Fatania
Russell Frood
Amber L. Simpson
Jacob J. Peoples
Ricky Hu
Danielle Cutler
Fabio Y. Moraes
Anh Tran
Mohammad Hamghalam
Michael A. Boss
James Gimpel
Deepak Kattil Veettil
Kendall Schmidt
Lisa Cimino
Cynthia Price
Brian Bialecki
Sailaja Marella
Charles Apgar
Andras Jakab
Marc-André Weber
Errol Colak
Jens Kleesiek
John Freymann
Justin Kirby
Lena Maier-Hein
Jake Albrecht
Peter Mattson
Alexandros Karargyris
Prashant Shah
Bjoern Menze
Klaus Maier-Hein
Spyridon Bakas
Computational competitions are the standard for benchmarking medical image analysis algorithms, but they typically use small curated test da… (see more)tasets acquired at a few centers, leaving a gap to the reality of diverse multicentric patient data. To this end, the Federated Tumor Segmentation (FeTS) Challenge represents the paradigm for real-world algorithmic performance evaluation. The FeTS challenge is a competition to benchmark (i) federated learning aggregation algorithms and (ii) state-of-the-art segmentation algorithms, across multiple international sites. Weight aggregation and client selection techniques were compared using a multicentric brain tumor dataset in realistic federated learning simulations, yielding benefits for adaptive weight aggregation, and efficiency gains through client sampling. Quantitative performance evaluation of state-of-the-art segmentation algorithms on data distributed internationally across 32 institutions yielded good generalization on average, albeit the worst-case performance revealed data-specific modes of failure. Similar multi-site setups can help validate the real-world utility of healthcare AI algorithms in the future.
Adaptive Computation Pruning for the Forgetting Transformer
The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on… (see more)-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs *provably safe* pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70\% across different model sizes and context lengths, resulting in a roughly 50\% to 70\% reduction in attention runtime (or a 2--3
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Amirhossein Kazemnejad
Karolina Stanczak
Peter Shaw
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (see more)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning
Ahmed Masry
Abhay Puri
Masoud Hashemi
Juan A. Rodriguez
Khyati Mahajan
Vikas Yadav
Sathwik Tejaswi Madhusudhan
Alexandre Piché
David Vazquez
Enamul Hoque
Perouz Taslakian
Sai Rajeswar
BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation
Joao Monteiro
Perouz Taslakian
Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either rele… (see more)vant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose \textbf{BiXSE}, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.
Boosting LLM Reasoning via Spontaneous Self-Correction
Tengyu Xu
Xuewei Wang
Zhengxing Chen
Di Jin
Liang Tan
Yen-Ting Lin
Zishun Yu
Zhuokai Zhao
Si-Yuan Wang
Yun He
Sinong Wang
Han Fang
MetaAI
Chen Zhu
Mila - Québec
AI Institute
Polytechnique Montréal
While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one.… (see more) One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.
Clinical test cases for model-based dose calculation algorithm commissioning, QA and benchmarking, for 192Ir HDR brachytherapy of gynecologic cancers
Vasiliki Peppa
Maude Robitaille
F. Akbari
R. M. Thomson
F. Mourtada
G. P. Fonseca
KA Gifford
JL Horton
TA Wareing
Purpose: To develop clinically relevant test cases for commissioning Model-Based Dose Calculation Algorithms (MBDCAs) for 192Ir High Dose Ra… (see more)te (HDR) gynecologic brachytherapy following the workflow proposed by the TG-186 report and the WGDCAB report 372. Acquisition and Validation Methods: Two cervical cancer intracavitary HDR brachytherapy patient models were created, using either uniformly structured regions or realistic segmentation. The computed tomography (CT) images of the models were converted to DICOM CT images via MATLAB and imported into two Treatment Planning Systems (TPSs) with MBDCA capability. The clinical segmentation was expanded to include additional organs at risk. The actual clinical treatment plan was generally maintained, with the source replaced by a generic 192Ir HDR source. Dose to medium in medium calculations were performed using the MBDCA option of each TPS, and three different Monte Carlo (MC) simulation codes. MC results agreed within statistical uncertainty, while comparisons between MBDCA and MC dose distributions highlighted both strengths and limitations of the studied MBDCAs, suggesting potential approaches to overcome the challenges. Data Format and Usage Notes: The datasets for the developed cases are available online at http://doi.org/ 10.5281/zenodo.15720996. The DICOM files include the treatment plan for each case, TPS, and the corresponding reference MC dose data. The package also contains a TPS- and case-specific user guide for commissioning the MBDCAs, and files needed to replicate the MC simulations. Potential Applications: The provided datasets and proposed methodology offer a commissioning framework for TPSs using MBDCAs, and serve as a benchmark for brachytherapy researchers using MC methods. They also facilitate intercomparisons of MBDCA performance and provide a quality assurance resource for evaluating future TPS software updates.
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Abhay Puri
Gabriel Huang
Mihir Bansal
Chandra Kiran Reddy Evuru
Avinandan Bose
Maryam Fazel
Alexandre Lacoste
Jason Stanley
Krishnamurthy Dj Dvijotham
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (see more) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and
A Dynamic Security Pattern Selection Framework Using Deep Reinforcement Learning
Saeid Jamshidi
Amin Nikanjam
Kawser Wazed Nafi
The rapid expansion of the Internet of Things (IoT) has brought transformative benefits across various domains and introduced significant se… (see more)curity challenges, especially in resource-constrained edge gateways. This paper proposes an innovative Intrusion Detection System (IDS) powered by Deep Reinforcement Learning (DRL) to dynamically detect and mitigate network threats by selecting IoT security patterns. Leveraging adaptive IoT security patterns, the system addresses diverse attack scenarios (e.g., Distributed Denial of Service (DDoS), DoS GoldenEye, DoS Hulk, and Port Scanning) with significant efficiency. The system achieves an average detection accuracy of 97% and demonstrates reduced response times and efficient resource utilization, making it well-suited for edge gateways. The experimental evaluations validate the proposed model's ability to enhance security while optimizing CPU and memory usage, reducing energy consumption, and lowering carbon emissions. Furthermore, its adaptability to evolving cyber threats and alignment with green computing principles highlight its potential to support secure and sustainable IoT networks.
Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts
Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (see more)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
Ben Lipkin
Jacob Hoover Vigly
João Loula
David R. MacIver
Lei Du
Jason Eisner
Ryan Cotterell
Vikash Mansinghka
Alexander K. Lew
Tim Vieira
The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sam… (see more)pling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive---LM vocabularies often exceed 100,000 tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost---estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.