Publications

Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval

Gabriel Ben-Zenou

Benjamin Piwowarski

Habiboulaye Amadou-Boubacar

Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into acco… (see more)unt the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case -- and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. https://github.com/arthur-75/Rational-Retrieval-Acts

2025-07-12

International ACM SIGIR Conference on Research and Development in Information Retrieval (published)

doi.org

arxiv.org

Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models

Anita Kriz

Elizabeth Laura Janes

Xing Shen

Tal Arbel

Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-criti… (see more)cal settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model's stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.

2025-07-11

ArXiv (preprint)

doi.org

arxiv.org

RL, but don't do anything I wouldn't do

Michael K. Cohen

Marcus Hutter

Yoshua Bengio

Stuart Russell

In reinforcement learning (RL), if the agent's reward differs from the designers' true utility, even only rarely, the state distribution res… (see more)ulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the "Don't do anything I wouldn't do" principle with "Don't do anything I mightn't do".

2025-07-10

Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (published)

proceedings.mlr.press

Multivariate Time-Series Anomaly Detection with Contaminated Data: Application to Physiological Signals

Thi Kieu Khanh Ho

Narges Armanfard

2025-07-10

Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (published)

doi.org

proceedings.mlr.press

scE2TM improves single-cell embedding interpretability and reveals cellular perturbation signatures

Hegang Chen

Yuyin Lu

Yifan Zhao

Zhiming Dai

Fu Lee Wang

Qing Li 0001

Yanghui Rao

Yuemei Li

2025-07-10

ArXiv (preprint)

doi.org

arxiv.org

Discrete Feynman-Kac Correctors

Mohsin Hasan

Viktor Ohanesian

Artem Gazizov

Yoshua Bengio

Alán Aspuru-Guzik

Roberto Bondesan

Marta Skreta

Kirill Neklyudov

Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences.… (see more) Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.

2025-07-08

AI4MATH @ International Conference on Machine Learning (poster)

doi.org

openreview.net

Instilling Parallel Reasoning into Language Models

Matthew Macfarlane

Minseon Kim

Nebojsa Jojic

Weijia Xu

Lucas Caccia

Xingdi Yuan

WANRU ZHAO

Zhengyan Shi

Alessandro Sordoni

Sequential chain-of-thought reasoning significantly improves the performance of Large language models (LLMs) on complex tasks. However, sequ… (see more)ential reasoning has structural limitations: Long chains are expensive due to attention's quadratic complexity, and multiple diverse strategies cannot be considered simultaneously. To address this we propose a method that instills parallel reasoning capabilities in LLMs by distilling parallel reasoning traces from a teacher model. This approach enables models to decompose problems, explore diverse strategies via concurrent reasoning traces, and aggregate trace outputs for the final answer. Evaluating on a variety of math and puzzle benchmarks such as MATH 500, AIME and Countdown, we show our approach can decompose parallelizable problems, and that the performance scales with the number of parallel traces. The resulting model can dynamically allocate reasoning strategies based on problem complexity, outperforming standard sampling methods.

2025-07-08

ICML.cc/2025/Workshop/AI4MATH (poster)

openreview.net

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil

Minseon Kim

Alessandro Sordoni

Francois Beaulieu

Paul Vozila

As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most e… (see more)xisting risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model's outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and closed-source LLMs. To the best of our knowledge, this work establishes an initial foundation for safer deployment of LLMs in healthcare.

2025-07-08

ArXiv (preprint)

doi.org

arxiv.org

Learning Minimal Neural Specifications

Chuqin Geng

Zhaoyue Wang

Haolin Ye

Xujie Si

2025-07-07

Proceedings of the International Conference on Neuro-symbolic Systems (published)

proceedings.mlr.press

Safe Domain Randomization via Uncertainty-Aware Out-of-Distribution Detection and Policy Adaptation

Deploying reinforcement learning (RL) policies in real-world involves significant challenges, including distribution shifts, safety concerns… (see more), and the impracticality of direct interactions during policy refinement. Existing methods, such as domain randomization (DR) and off-dynamics RL, enhance policy robustness by direct interaction with the target domain, an inherently unsafe practice. We propose Uncertainty-Aware RL (UARL), a novel framework that prioritizes safety during training by addressing Out-Of-Distribution (OOD) detection and policy adaptation without requiring direct interactions in target domain. UARL employs an ensemble of critics to quantify policy uncertainty and incorporates progressive environmental randomization to prepare the policy for diverse real-world conditions. By iteratively refining over high-uncertainty regions of the state space in simulated environments, UARL enhances robust generalization to the target domain without explicitly training on it. We evaluate UARL on MuJoCo benchmarks and a quadrupedal robot, demonstrating its effectiveness in reliable OOD detection, improved performance, and enhanced sample efficiency compared to baselines.

2025-07-07

arXiv (preprint)

doi.org

arxiv.org

Steering Large Language Model Activations in Sparse Spaces

Reza Bayat

Ali Rahimi-Kalahroudi

Mohammad Pezeshki

A. Chandar

P Vincent

2025-07-07

Conference on Language Modeling (accepted)

doi.org

openreview.net

Towards fair decentralized benchmarking of healthcare AI algorithms with the Federated Tumor Segmentation (FeTS) challenge

Maximilian Zenk

Ujjwal Baid

Sarthak Pati

Akis Linardos

Brandon Edwards

Micah Sheller

Patrick Foley

Alejandro Aristizabal

David Zimmerer

Alexey Gruzdev

Jason Martin

Russell T. Shinohara

Annika Reinke

Fabian Isensee

Santhosh Parampottupadam

Kaushal Parekh

Ralf Floca

Hasan Kassem

Bhakti Baheti

Siddhesh Thakur … (see 332 more)

Verena Chung

Kaisar Kushibar

Karim Lekadir

Meirui Jiang

Youtan Yin

Hongzheng Yang

Quande Liu

Cheng Chen

Qi Dou

Pheng-Ann Heng

Xiaofan Zhang

Shaoting Zhang

Muhammad Irfan Khan

Mohammad Ayyaz Azeem

Mojtaba Jafaritadi

Esa Alhoniemi

Elina Kontio

Suleiman A. Khan

Leon Mächler

Ivan Ezhov

Florian Kofler

Suprosanna Shit

Johannes C. Paetzold

Timo Loehr

Benedikt Wiestler

Himashi Peiris

Kamlesh Pawar

Shenjun Zhong

Zhaolin Chen

Munawar Hayat

Gary Egan

Mehrtash Harandi

Ece Isik Polat

Gorkem Polat

Altan Kocyigit

Alptekin Temizel

Anup Tuladhar

Lakshay Tyagi

Raissa Souza

Nils D. Forkert

Pauline Mouches

Matthias Wilms

Vishruth Shambhat

Akansh Maurya

Shubham Subhas Danannavar

Rohit Kalla

Vikas Kumar Anand

Ganapathy Krishnamurthi

Sahil Nalawade

Chandan Ganesh

Ben Wagner

Divya Reddy

Yudhajit Das

Fang F. Yu

Baowei Fei

B. Fei

Ananth J. Madhuranthakam

Joseph Maldjian

Gaurav Singh

Jianxun Ren

Wei Zhang

Ning An

Qingyu Hu

Youjia Zhang

Ying Zhou

Vasilis Siomos

Giacomo Tarroni

Jonathan Passerrat-Palmbach

Ambrish Rawat

Giulio Zizzo

Swanand Ravindra Kadhe

Jonathan P. Epperlein

Stefano Braghin

Yong Wang

Renuga Kanagavelu

Qingsong Wei

Yechao Yang

Yang Liu

Krzysztof Kotowski

Szymon Adamski

Bartosz Machura

Wojciech Malara

Lukasz Zarudzki

Jakub Nalepa

Yaying Shi

Hongjian Gao

Salman Avestimehr

Yonghong Yan

Agus S. Akbar

Ekaterina Kondrateva

Hua Yang

Zhaopei Li

Hung-Yu Wu

Johannes Roth

Camillo Saueressig

Alexandre Milesi

Quoc D. Nguyen

Nathan J. Gruenhagen

Tsung-Ming Huang

Jun Ma

Har Shwinder H. Singh

Nai-Yu Pan

Dingwen Zhang

Ramy A. Zeineldin

Michal Futrega

Yading Yuan

Gian Marco Conte

GM Conte

Xue Feng

Quan D. Pham

Yong Xia

Zhifan Jiang

Huan Minh Luu

Mariia Dobko

Alexandre Carré

Bair Tuchinov

Hassan Mohy-ud-Din

Saruar Alam

Anup Singh

Nameeta Shah

Weichung Wang

Chiharu Sako

Michel Bilello

Satyam Ghodasara

Suyash Mohan

Christos Davatzikos

Evan Calabrese

Jeffrey Rudie

Javier Villanueva-Meyer

S. Cha

Soonmee Cha

Christopher Hess

John Mongan

Madhura Ingalhalikar

Manali Jadhav

Umang Pandey

Jitender Saini

Raymond Y. Huang

Ken Chang

Minh-Son To

Sargam Bhardwaj

Chee Chong

Marc Agzarian

Michal Kozubek

Filip Lux

Jan Michálek

Petr Matula

Miloš Ker^kovský

Tereza Kopr^ivová

Marek Dostál

Václav Vybíhal

Marco C. Pinho

James Holcomb

Marie Metz

Rajan Jain

Matthew D. Lee

Yvonne W. Lui

Pallavi Tiwari

Ruchika Verma

Rohan Bareja

Ipsa Yadav

Jonathan Chen

Neeraj Kumar

Yuriy Gusev

Krithika Bhuvaneshwar

Anousheh Sayah

Camelia Bencheqroun

Anas Belouali

Subha Madhavan

Rivka R. Colen

Aikaterini Kotrotsou

Philipp Vollmuth

Gianluca Brugnara

Chandrakanth J. Preetha

Felix Sahm

Martin Bendszus

Wolfgang Wick

Abhishek Mahajan

Carmen Balaña

Jaume Capellades

Josep Puig

Yoon Seong Choi

Seung-Koo Lee

Jong Hee Chang

Sung Soo Ahn

Hassan F. Shaykh

Alejandro Herrera-Trujillo

Maria Trujillo

William Escobar

Ana Abello

José Bernal

Jhon Gómez

Pamela LaMontagne

Daniel S. Marcus

Mikhail Milchenko

Arash Nazeri

Bennett A. Landman

Karthik Ramadass

Kaiwen Xu

Silky Chotai

Lola B. Chambless

Akshitkumar Mistry

Reid C. Thompson

Ashok Srinivasan

Jayapalli R. Bapuraj

J. Rajiv Bapuraj

Arvind Rao

Nicholas Wang

Ota Yoshiaki

Toshio Moritani

Sevcan Turk

Joonsang Lee

Snehal Prabhudesai

John Garrett

Matthew Larson

Robert Jeraj

Hongwei Li

Hao Li

Tobias Weiss

Michael Weller

Andrea Bink

Bertrand Pouymayou

Sonam Sharma

Tzu-Chi Tseng

Saba Adabi

Alexandre Xavier Falcão

Samuel B. Martins

Bernardo C. A. Teixeira

Flávia Sprenger

David Menotti

Diego R. Lucio

Simone P. Niclou

Olivier Keunen

Ann-Christin Hau

Enrique Pelaez

Heydy Franco-Maldonado

Francis Loayza

Sebastian Quevedo

Richard McKinley

Johannes Slotboom

Piotr Radojewski

Raphael Meier

Roland Wiest

Johannes Trenkler

Josef Pichler

Georg Necker

Andreas Haunschmidt

Stephan Meckel

Pamela Guevara

Esteban Torche

Cristobal Mendoza

Franco Vera

Elvis Ríos

Eduardo López

Sergio A. Velastin

Joseph Choi

Stephen Baek

Yusung Kim

Heba Ismael

Bryan Allen

John M. Buatti

Peter Zampakis

Vasileios Panagiotopoulos

Panagiotis Tsiganos

Sotiris Alexiou

Ilias Haliassos

Evangelia I. Zacharaki

Konstantinos Moustakas

Christina Kalogeropoulou

Dimitrios M. Kardamakis

Bing Luo

Laila M. Poisson

Ning Wen

Martin Vallières

Mahdi A. L. Loutfi

David Fortin

Martin Lepage

Fanny Morón

Jacob Mandel

Gaurav Shukla

Spencer Liem

Gregory S. Alexandre

Joseph Lombardo

Joshua D. Palmer

Adam E. Flanders

Adam P. Dicker

Godwin Ogbole

Dotun Oyekunle

Olubunmi Odafe-Oyibotha

Babatunde Osobu

Mustapha Shu’aibu Hikima

Mayowa Soneye

Farouk Dako

Adeleye Dorcas

Derrick Murcia

Eric Fu

Rourke Haas

John A. Thompson

David Ryan Ormond

Stuart Currie

Kavi Fatania

Russell Frood

Amber L. Simpson

Jacob J. Peoples

Ricky Hu

Danielle Cutler

Fabio Y. Moraes

Anh Tran

Mohammad Hamghalam

Michael A. Boss

James Gimpel

Deepak Kattil Veettil

Kendall Schmidt

Lisa Cimino

Cynthia Price

Brian Bialecki

Sailaja Marella

Charles Apgar

Andras Jakab

Marc-André Weber

Errol Colak

Jens Kleesiek

John Freymann

Justin Kirby

Lena Maier-Hein

Jake Albrecht

Peter Mattson

Alexandros Karargyris

Prashant Shah

Bjoern Menze

Klaus Maier-Hein

Spyridon Bakas

Computational competitions are the standard for benchmarking medical image analysis algorithms, but they typically use small curated test da… (see more)tasets acquired at a few centers, leaving a gap to the reality of diverse multicentric patient data. To this end, the Federated Tumor Segmentation (FeTS) Challenge represents the paradigm for real-world algorithmic performance evaluation. The FeTS challenge is a competition to benchmark (i) federated learning aggregation algorithms and (ii) state-of-the-art segmentation algorithms, across multiple international sites. Weight aggregation and client selection techniques were compared using a multicentric brain tumor dataset in realistic federated learning simulations, yielding benefits for adaptive weight aggregation, and efficiency gains through client sampling. Quantitative performance evaluation of state-of-the-art segmentation algorithms on data distributed internationally across 32 institutions yielded good generalization on average, albeit the worst-case performance revealed data-specific modes of failure. Similar multi-site setups can help validate the real-world utility of healthcare AI algorithms in the future.

2025-07-07

Nature Communications (published)

doi.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Publications