Publications

Foundation models for generalizable electrocardiogram interpretation: comparison of supervised and self-supervised electrocardiogram foundation models

Alexis Nolin-Lapalme

Achille Sowa

Jacques Delfrate

Olivier Tastet

Denis Corbin

Merve Kulbay

Derman Ozdemir

Marie-Jeanne Noël

François-Christophe Marois-Blanchet

François Harvey

Surbhi Sharma

Minhaj Ansari

I-Min Chiu

Valentina Dsouza

Sam F. Friedman

Michael Chassé

Brian J. Potter

Jonathan Afilalo

Pierre Adil Elias

Gilbert Jabbour … (voir 13 de plus)

Mourad Bahani

Marie-Pierre Dubé

Patrick M. Boyle

Neal A. Chatterjee

Joshua Barrios

Geoffrey H. Tison

David Ouyang

Mahnaz Maddah

Shaan Khurshid

Julia Cadrin-Tourigny

Rafik Tadros

Julie Hussin

Robert Avram

The 12-lead electrocardiogram (ECG) remains a cornerstone of cardiac diagnostics, yet existing artificial intelligence (AI) solutions for au… (voir plus)tomated interpretation often lack generalizability, remain closed-source, and are primarily trained using supervised learning, limiting their adaptability across diverse clinical settings. To address these challenges, we developed and compared two open-source foundational ECG models: DeepECG-SSL, a self-supervised learning model, and DeepECG-SL, a supervised learning model. Both models were trained on over 1 million ECGs using a standardized preprocessing pipeline and automated free-text extraction from ECG reports to predict 77 cardiac conditions. DeepECG-SSL was pretrained using self-supervised contrastive learning and masked lead modeling. The models were evaluated on six multilingual private healthcare systems and four public datasets for ECG interpretation across 77 diagnostic categories. Fairness analyses assessed disparities in performance across age and sex groups, while also investigating fairness and resource utilization. DeepECG-SSL achieved AUROCs of 0.990 (95%CI 0.990, 0.990) on internal dataset, 0.981 (95%CI 0.981, 0.981) on external public datasets, and 0.983 (95%CI 0.983, 0.983) on external private datasets, while DeepECG-SL demonstrated AUROCs of 0.992 (95%CI 0.992, 0.992), 0.980 (95%CI 0.980, 0.980) and 0.983 (95%CI 0.983, 0.983) respectively. Fairness analyses revealed minimal disparities (true positive rate & false positive rate difference<0.010) across age and sex groups. Digital biomarker prediction (Long QT syndrome (LQTS) classification, 5-year atrial fibrillation prediction and left ventricular ejection fraction (LVEF) classification) with limited labeled data, DeepECG-SSL outperformed DeepECG-SL in predicting 5-year atrial fibrillation risk (N=132,050; AUROC 0.742 vs. 0.720; Δ=0.022; P<0.001), identifying reduced LVEF ≤40% (N=25,252; 0.928 vs. 0.900; Δ=0.028; P<0.001), and classifying LQTS syndrome subtypes (N=127; 0.931 vs. 0.853; Δ=0.078; P=0.026). By releasing model weights, preprocessing tools, and validation code, we aim to support robust, data-efficient AI diagnostics across diverse clinical environments. This study establishes self-supervised learning as a promising paradigm for ECG analysis, particularly in settings with limited annotated data, enhancing accessibility, generalizability, and fairness in AI-driven cardiac diagnostics. Can self-supervised (SSL) learning yield ECG-based AI foundational models with enhanced performance, fairness, privacy, and generalizability compared to traditional supervised learning (SL) approaches? Our evaluation of DeepECG-SL and DeepECG-SSL across seven external health center datasets and four international publicly accessible datasets demonstrated that while both models achieve comparable diagnostic accuracy for ECG interpretation, SSL outperforms SL on novel tasks with smaller datasets. We validated DeepECG-SL and DeepECG-SSL across public and private datasets and demonstrated that SSL model had a superior generalizability by addressing fairness, privacy, and efficiency, and open sourcing our models, we advance ethical, adaptable AI for equitable, real-world ECG diagnostics. Graphical abstract: DeepECG-SL and DeepECG-SSL, two open-source AI models for 12-lead ECG interpretation, were trained on over 1 million ECGs. DeepECG-SSL, utilizing self-supervised contrastive learning and masked lead modeling, outperformed DeepECG-SL in utilizing digital biomarkers to predict atrial fibrillation risk, reduced LVEF, and long QT syndrome subtypes, while both models achieved high diagnostic accuracy with minimal fairness disparities across age and sex. Validated on ten external datasets, our work provides a robust, reproducible framework for equitable, efficient ECG-based cardiac diagnostics.

2025-03-04

medRxiv (prépublication)

doi.org

From Intuition to Understanding: Using AI Peers to Overcome Physics Misconceptions

Ruben Weijers

Denton Wu

Hannah Betts

Tamara Jacod

Yuxiang Guan

Vidya Sujaya

Kushal Dev

Toshali Goel

William Delooze

Reihaneh Rabbany

Ying Wu

Jean-François Godbout

Kellin Pelrine

Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accur… (voir plus)acy and helping students become independent critical thinkers. In this study, we designed a helpful yet fallible AI "Peer" to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40\% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher—with over 20 percentage points higher normalized gain—than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group's AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts' annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.

2025-03-04

ICLR.cc/2025/Workshop/Bi-Align (poster)

doi.org

openreview.net

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (voir plus)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepté)

openreview.net

Learning Decision Trees as Amortized Structure Inference

Mohammed Mahfoud

Ghait Boukachab

Michał Koziarski

Alex Hernández-García

Stefan Bauer

Yoshua Bengio

Nikolay Malkin

2025-03-04

ICLR.cc/2025/Workshop/FPI (poster)

doi.org

openreview.net

Learning to Defer for Causal Discovery with Imperfect Experts

Sara Magliacane

Valentina Zantedeschi

Alexandre Drouin

Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (voir plus) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.

2025-03-04

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (publié)

doi.org

openreview.net

Mol-MoE: Training Preference-Guided Routers for Molecule Generation

Diego Calanzone

Pierluca D'Oro

Pierre-Luc Bacon

Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on… (voir plus) single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.

2025-03-04

ICLR.cc/2025/Workshop/GEM (publié)

doi.org

openreview.net

PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS

Emiliano Penaloza

Tianyue H. Zhang

Laurent Charlin

Mateo Espinosa Zarlenga

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (voir plus)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.

2025-03-04

ICLR.cc/2025/Workshop/Bi-Align (présentation orale)

openreview.net

Refining Answer Distributions for Improved Large Language Model Reasoning

Soumyasundar Pal

Didier Chételat

Yingxue Zhang

Mark J. Coates

Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (voir plus)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Refined Answer Distributions, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode --- the most likely answer. Empirical evaluation on several reasoning benchmarks demonstrates the superiority of the proposed approach.

2025-03-04

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (publié)

openreview.net

Rethinking Anti-Misinformation AI

This paper takes a position on how anti-misinformation AI works should be developed for the online misinformation context. We observe that t… (voir plus)he current literature is dominated by works that produce more information for users to process and that this function faces various challenges in bringing meaningful effects to reality. We use anti-misinformation insights from other domains to suggest a redirection of the existing line of work and identify an under-explored opportunity AI can facilitate exploring.

2025-03-04

ICLR.cc/2025/Workshop/HAIC (publié)

openreview.net

Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

Prakhar Ganesh

Reza Shokri

Golnoosh Farnadi

Large language models (LLMs) are known to "hallucinate" by generating false or misleading outputs. Hallucinations pose various harms, from e… (voir plus)rosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on "correctness" and often overlooks "consistency", necessary to distinguish and address these harms. To bridge this gap, we introduce _prompt multiplicity_, a framework for quantifying consistency through prompt sensitivity. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely underestimated. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques capture consistency, not correctness, and (b) mitigation techniques like RAG can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepté)

openreview.net

Scaling Deep Learning Solutions for Transition Path Sampling

Jungyoon Lee

Michael Plainer

Yuanqi Du

Lars Holdijk

Rob Brekelmans

Carla P Gomes

Dominique Beaini

Kirill Neklyudov

Transition path sampling (TPS) is an important method for studying rare events, such as they happen in chemical reactions or protein folding… (voir plus). These events occur so infrequently that traditional simulations are often impractical, and even recent machine-learning approaches struggle to address this issue for larger systems. In this paper, we propose using modern deep learning techniques to improve the scalability of TPS methods significantly. We highlight the need for better evaluations in the existing literature and start by formulating TPS as a sampling problem over an unnormalized target density and introduce relevant evaluation metrics to assess the effectiveness of TPS solutions from this perspective. To develop a scalable approach, we explore several design choices, including a problem-informed neural network architecture, simulated annealing, the integration of prior knowledge into the sampling process, and attention mechanisms. Finally, we conduct a comprehensive empirical study and compare these design choices with other recently developed deep-learning methods for rare event sampling.

2025-03-04

ICLR.cc/2025/Workshop/GEM (publié)

openreview.net

Societal Alignment Frameworks Can Improve LLM Alignment

Karolina Stanczak

Nicholas Meade

Mehar Bhatia

Hattie Zhou

Konstantin Böttinger

Jeremy Barnes

Jason Stanley

Jessica Montgomery

Richard Zemel

Nicolas Papernot

Nicolas Chapados

Denis Therien

Timothy P Lillicrap

Ana Marasovic

Sylvie Delacroix

Gillian K. Hadfield

Siva Reddy

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values… (voir plus) - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.

2025-03-04

Bi-Align @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Publications