Publications

Low Compute Unlearning via Sparse Representations

Vedant Shah

Frederik Träuble

Ashish Malik

Hugo Larochelle

Michael Curtis Mozer

Sanjeev Arora

Yoshua Bengio

Anirudh Goyal

Machine unlearning, which involves erasing knowledge about a \emph{forget set} from a trained model, can prove to be costly and infeasible … (voir plus)using existing techniques. We propose a low-compute unlearning technique based on a discrete representational bottleneck. We show that the proposed technique efficiently unlearns the forget set and incurs negligible damage to the model's performance on the rest of the dataset. We evaluate the proposed technique on the problem of class unlearning using four datasets: CIFAR-10, CIFAR-100, LACUNA-100 and ImageNet-1k. We compare the proposed technique to SCRUB, a state-of-the-art approach which uses knowledge distillation for unlearning. Across all four datasets, the proposed technique performs as well as, if not better than SCRUB while incurring almost no computational cost.

2025-09-15

TMLR (accepté)

openreview.net

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Hiroki Naganuma

Xinzhi Zhang

Man-Chung Yue

Ioannis Mitliagkas

Russell J. Hewett

Philipp Andre Witte

Yin Tat Lee

2025-09-15

TMLR (accepté)

openreview.net

Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses

3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurologica… (voir plus)l conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.

2025-09-12

ArXiv (prépublication)

arxiv.org

SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets

Alzheimer's disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensi… (voir plus)ve research in applying deep learning models to Alzheimer's prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer's prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.

2025-09-12

ArXiv (prépublication)

arxiv.org

Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples

Daniel Agyapong

Briana H. Beatty

Peter G. Kennedy

Toby Dylan Hocking

Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithm… (voir plus)s typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

2025-09-11

ArXiv (prépublication)

arxiv.org

OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection

Akshatha Arodi

Ga'etan Marceau Caron

Jean-François Godbout

Reihaneh Rabbany

Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically … (voir plus)sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.

2025-09-11

ArXiv (prépublication)

arxiv.org

FairFLRep: Fairness aware fault localization and repair of Deep Neural Networks

Moses Openja

Paolo Arcaini

Foutse Khomh

Fuyuki Ishikawa

2025-09-10

ACM Transactions on Software Engineering and Methodology (publié)

doi.org

arxiv.org

EPISeg: Automated segmentation of the spinal cord on echo planar images using open-access multi-center data

Rohan Banerjee

Merve Kaptan

Alexandra Tinnermann

Ali Khatibi

Alice Dabbagh

Christian Büchel

Christian W. Kündig

Christine S. W. Law

Csw Law

Dario Pfyffer

David J. Lythgoe

Dimitra Tsivaka

Dimitri Van De Ville

Falk Eippert

Fauziyya Muhammad

Gary H. Glover

Gergely David

Grace Haynes

Jan Haaker

Jonathan C. W. Brooks … (voir 23 de plus)

Jürgen Finsterbusch

Katherine T. Martucci

Kimberly J. Hemmerling

Mahdi Mobarak-Abadi

Mark A. Hoggarth

Matthew A. Howard

Molly G. Bright

Nawal Kinany

Olivia S. Kowalczyk

Patrick Freund

Robert L. Barry

Sean Mackey

Shahabeddin Vahdat

Simon Schading

Stephen B. McMahon

Todd Parish

Véronique Marchand-Pauvert

Yufen Chen

Zachary A. Smith

KA Weber

Kenneth A. Weber

Benjamin De Leener

Julien Cohen-Adad

Functional magnetic resonance imaging (fMRI) of the spinal cord is relevant for studying sensation, movement, and autonomic function. Prepro… (voir plus)cessing of spinal cord fMRI data involves segmentation of the spinal cord on gradient-echo echo planar imaging (EPI) images. Current automated segmentation methods do not work well on these data, due to the low spatial resolution, susceptibility artifacts causing distortions and signal drop-out, ghosting, and motion-related artifacts. Consequently, this segmentation task demands a considerable amount of manual effort which takes time and is prone to user bias. In this work, we (i) gathered a multi-center dataset of spinal cord gradient-echo EPI with ground-truth segmentations and shared it on OpenNeuro https://openneuro.org/datasets/ds005143/versions/1.3.0, and (ii) developed a deep learning-based model, EPISeg, for the automatic segmentation of the spinal cord on gradient-echo EPI data. We observe a significant improvement in terms of segmentation quality compared to other available spinal cord segmentation models. Our model is resilient to different acquisition protocols as well as commonly observed artifacts in fMRI data. The training code is available at https://github.com/sct-pipeline/fmri-segmentation/, and the model has been integrated into the Spinal Cord Toolbox as a command-line tool.

2025-09-09

Imaging neuroscience (publié)

doi.org

RL Fine-Tuning Heals OOD Forgetting in SFT

Hangzhan Jin

Sicheng Lyu

Mohammad Hamdaqa

2025-09-08

ArXiv (prépublication)

arxiv.org

An AI system to help scientists write expert-level empirical software

Eser Aygün

Anastasiya Belyaeva

Gheorghe Comanici

Marc Coram

Hao Cui

Jake Garrison

Renee Johnston Anton Kast

Cory Y. McLean

Peter C. Norgaard

Zahra Shamsi

David Smalling

James Thompson

Subhashini Venugopalan

Brian P Williams

Chujun He

Sarah Martinson

Martyna Plomecka

Lai Wei

Yuchen Zhou

Qian-Ze Zhu … (voir 21 de plus)

Matthew Abraham

Erica Brand

Anna Bulanova

Jeff Cardille

Chris Co

Scott Ellsworth

Grace Joseph

Malcolm Kane

Ryan K. Krueger

Johan Kartiwa

D. Liebling

Jan-Matthis Lueckmann

Paul Raccuglia

Xuefei Wang

Katherine Chou

James Manyika

Yossi Matias

J.C. Platt

Lizzie Dorfman

Shibl Mourad

Michael P. Brenner

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments. … (voir plus)To address this, we present an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS) to systematically improve the quality metric and intelligently navigate the large space of possible solutions. The system achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a wide range of benchmarks. In bioinformatics, it discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Our method also produced state-of-the-art software for geospatial analysis, neural activity prediction in zebrafish, time series forecasting and numerical solution of integrals. By devising and implementing novel solutions to diverse tasks, the system represents a significant step towards accelerating scientific progress.

2025-09-08

ArXiv (prépublication)

arxiv.org

Discrete Audio Tokens: More Than a Survey!

Pooneh Mousavi

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Cem Subakan

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (voir 1 de plus)

Mirco Ravanelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-09-07

TMLR (accepté)

doi.org

openreview.net

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Mohamed Mohamed

Brennan Nichyporuk

Douglas Arnold

Tal Arbel

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however the impressive pe… (voir plus)rformance of these models in 2D is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained foundation models do not exist for 3D, significantly limiting progress in this domain. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language descriptions remains completely unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression scenarios, and enhanced medical training by visualizing hypothetical medical conditions in realistic detail. Our work takes a meaningful step toward addressing this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this represents the first demonstration of a language-guided native-3D diffusion model applied specifically to neurological imaging data, where faithful three-dimensional modeling is essential to represent the brain's three-dimensional structure. Through results on two distinct neurological MRI datasets, our framework successfully simulates varying counterfactual lesion loads in Multiple Sclerosis (MS), and cognitive states in Alzheimer's disease, generating high-quality images while preserving subject fidelity in synthetically generated medical images. Our results lay the groundwork for prompt-driven disease progression analysis within 3D medical imaging.

2025-09-07

ArXiv (prépublication)

arxiv.org

Perspectives sur l’IA pour les responsables des politiques

Hugo Larochelle nommé directeur scientifique de Mila

Programme d’apprentissage IA sur mesure

Mil'Haq Fest 2025

Communauté de pratique de Mila

Publications

Perspectives sur l’IA pour les responsables des politiques

Hugo Larochelle nommé directeur scientifique de Mila

Programme d’apprentissage IA sur mesure

Mil'Haq Fest 2025

Communauté de pratique de Mila

Mots-clés populaires:

Publications