Publications

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning

David Venuto

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (see more)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through

2025-03-04

ICLR.cc/2025/Workshop/DL4C (published)

doi.org

openreview.net

Design of Ligand-Binding Proteins with Atomic Flow Matching

Junqi Liu

Shaoning Li

Chence Shi

Zhi Yang

Jian Tang

2025-03-04

ICLR.cc/2025/Workshop/GEM (published)

openreview.net

Foundation models for generalizable electrocardiogram interpretation: comparison of supervised and self-supervised electrocardiogram foundation models

Alexis Nolin-Lapalme

Achille Sowa

Jacques Delfrate

Olivier Tastet

Denis Corbin

Merve Kulbay

Derman Ozdemir

Marie-Jeanne Noël

François-Christophe Marois-Blanchet

François Harvey

Surbhi Sharma

Minhaj Ansari

I-Min Chiu

Valentina Dsouza

Sam F. Friedman

Michael Chassé

Brian J. Potter

Jonathan Afilalo

Pierre Adil Elias

Gilbert Jabbour … (see 13 more)

Mourad Bahani

Marie-Pierre Dubé

Patrick M. Boyle

Neal A. Chatterjee

Joshua Barrios

Geoffrey H. Tison

David Ouyang

Mahnaz Maddah

Shaan Khurshid

Julia Cadrin-Tourigny

Rafik Tadros

Julie Hussin

Robert Avram

The 12-lead electrocardiogram (ECG) remains a cornerstone of cardiac diagnostics, yet existing artificial intelligence (AI) solutions for au… (see more)tomated interpretation often lack generalizability, remain closed-source, and are primarily trained using supervised learning, limiting their adaptability across diverse clinical settings. To address these challenges, we developed and compared two open-source foundational ECG models: DeepECG-SSL, a self-supervised learning model, and DeepECG-SL, a supervised learning model. Both models were trained on over 1 million ECGs using a standardized preprocessing pipeline and automated free-text extraction from ECG reports to predict 77 cardiac conditions. DeepECG-SSL was pretrained using self-supervised contrastive learning and masked lead modeling. The models were evaluated on six multilingual private healthcare systems and four public datasets for ECG interpretation across 77 diagnostic categories. Fairness analyses assessed disparities in performance across age and sex groups, while also investigating fairness and resource utilization. DeepECG-SSL achieved AUROCs of 0.990 (95%CI 0.990, 0.990) on internal dataset, 0.981 (95%CI 0.981, 0.981) on external public datasets, and 0.983 (95%CI 0.983, 0.983) on external private datasets, while DeepECG-SL demonstrated AUROCs of 0.992 (95%CI 0.992, 0.992), 0.980 (95%CI 0.980, 0.980) and 0.983 (95%CI 0.983, 0.983) respectively. Fairness analyses revealed minimal disparities (true positive rate & false positive rate difference<0.010) across age and sex groups. Digital biomarker prediction (Long QT syndrome (LQTS) classification, 5-year atrial fibrillation prediction and left ventricular ejection fraction (LVEF) classification) with limited labeled data, DeepECG-SSL outperformed DeepECG-SL in predicting 5-year atrial fibrillation risk (N=132,050; AUROC 0.742 vs. 0.720; Δ=0.022; P<0.001), identifying reduced LVEF ≤40% (N=25,252; 0.928 vs. 0.900; Δ=0.028; P<0.001), and classifying LQTS syndrome subtypes (N=127; 0.931 vs. 0.853; Δ=0.078; P=0.026). By releasing model weights, preprocessing tools, and validation code, we aim to support robust, data-efficient AI diagnostics across diverse clinical environments. This study establishes self-supervised learning as a promising paradigm for ECG analysis, particularly in settings with limited annotated data, enhancing accessibility, generalizability, and fairness in AI-driven cardiac diagnostics. Can self-supervised (SSL) learning yield ECG-based AI foundational models with enhanced performance, fairness, privacy, and generalizability compared to traditional supervised learning (SL) approaches? Our evaluation of DeepECG-SL and DeepECG-SSL across seven external health center datasets and four international publicly accessible datasets demonstrated that while both models achieve comparable diagnostic accuracy for ECG interpretation, SSL outperforms SL on novel tasks with smaller datasets. We validated DeepECG-SL and DeepECG-SSL across public and private datasets and demonstrated that SSL model had a superior generalizability by addressing fairness, privacy, and efficiency, and open sourcing our models, we advance ethical, adaptable AI for equitable, real-world ECG diagnostics. Graphical abstract: DeepECG-SL and DeepECG-SSL, two open-source AI models for 12-lead ECG interpretation, were trained on over 1 million ECGs. DeepECG-SSL, utilizing self-supervised contrastive learning and masked lead modeling, outperformed DeepECG-SL in utilizing digital biomarkers to predict atrial fibrillation risk, reduced LVEF, and long QT syndrome subtypes, while both models achieved high diagnostic accuracy with minimal fairness disparities across age and sex. Validated on ten external datasets, our work provides a robust, reproducible framework for equitable, efficient ECG-based cardiac diagnostics.

2025-03-04

medRxiv (published)

doi.org

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

Ruben Weijers

Denton Wu

Hannah Betts

Tamara Jacod

Yuxiang Guan

Vidya Sujaya

Kushal Dev

Toshali Goel

William Delooze

Reihaneh Rabbany

Ying Wu

Jean-François Godbout

Kellin Pelrine

Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accur… (see more)acy and helping students become independent critical thinkers. In this study, we designed a helpful yet fallible AI "Peer" to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40\% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher—with over 20 percentage points higher normalized gain—than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group's AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts' annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.

2025-03-04

ICLR.cc/2025/Workshop/Bi-Align (poster)

doi.org

openreview.net

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

openreview.net

Learning Decision Trees as Amortized Structure Inference

Mohammed Mahfoud

Ghait Boukachab

Michał Koziarski

Alex Hernández-García

Stefan Bauer

Yoshua Bengio

Nikolay Malkin

2025-03-04

ICLR.cc/2025/Workshop/FPI (poster)

doi.org

openreview.net

Learning to Defer for Causal Discovery with Imperfect Experts

Sara Magliacane

Valentina Zantedeschi

Alexandre Drouin

Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.

2025-03-04

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (published)

doi.org

openreview.net

Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Diego Calanzone

Pierluca D'Oro

Pierre-Luc Bacon

Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (see more) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

2025-03-04

ICLR.cc/2025/Workshop/GEM (published)

doi.org

openreview.net

PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS

Emiliano Penaloza

Tianyue H. Zhang

Laurent Charlin

Mateo Espinosa Zarlenga

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (see more)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.

2025-03-04

ICLR.cc/2025/Workshop/Bi-Align (oral)

openreview.net

Refining Answer Distributions for Improved Large Language Model Reasoning

Soumyasundar Pal

Didier Chételat

Yingxue Zhang

Mark J. Coates

Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode --- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.

2025-03-04

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (published)

openreview.net

Rethinking Anti-Misinformation AI

This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (see more)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.

2025-03-04

ICLR.cc/2025/Workshop/HAIC (published)

openreview.net

Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

Prakhar Ganesh

Reza Shokri

Golnoosh Farnadi

Large language models (LLMs) are known to "hallucinate" by generating false or misleading outputs. Hallucinations pose various harms, from e… (see more)rosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on "correctness" and often overlooks "consistency", necessary to distinguish and address these harms. To bridge this gap, we introduce _prompt multiplicity_, a framework for quantifying consistency through prompt sensitivity. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely underestimated. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques capture consistency, not correctness, and (b) mitigation techniques like RAG can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

openreview.net

Mila on Udemy

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Publications

Mila on Udemy

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Popular keywords:

Publications