Publications

UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS

Alireza Dehghanpour Farashah

Aditi Khandelwal

As multilingual generative models become more widely used, most safety and fairness evaluation techniques still focus on English-language re… (see more)sources, while overlooking important cross-cultural factors. This limitation raises concerns about fairness and safety, particularly regarding geoculturally situated stereotypes that hinder the models’ global inclusivity. In this work, we present preliminary findings on the impact of stereotype unlearning across languages, specifically in English, French, and Hindi. Using an adapted version of the SeeGULL dataset, we analyze how unlearning stereotypes in one language influences other languages within multilingual large language models. Our study evaluates two model families, Llama-3.1-8B and Aya-Expanse-8B, to assess whether unlearning in one linguistic context transfers across languages, potentially mitigating or exacerbating biases in multilingual settings.

2025-03-05

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

openreview.net

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Rabiul Awal

Mahsa Massoud

Zichao Li

Aarash Feizi

Suyuchen Wang

Chris Pal

Aishwarya Agrawal

David Vazquez

Siva Reddy

Juan A. Rodriguez

Perouz Taslakian

Spandana Gella

Sai Rajeswar

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (see more)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.

2025-03-05

ICLR.cc/2025/Workshop/DL4C (published)

openreview.net

Automated diagnosis of usual interstitial pneumonia on chest CT via the mean curvature of isophotes

Peter Savadjiev

Morteza Rezanejad

Sahir Bhatnagar

David Camirand

Claude Kauffmann

Kaleem Siddiqi

Ronald J Dandurand

Patrick Bourgouin

Carl Chartrand-Lefebvre

Alexandre Semionov

2025-03-04

medRxiv (preprint)

doi.org

AI Automatons: AI Systems Intended to Imitate Humans

Alexandra Olteanu

Solon Barocas

Su Lin Blodgett

Lisa Egede

Alicia DeVrio

Myra Cheng

There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness -- systems we … (see more)dub AI automatons. Individuals, groups, or generic humans are being simulated to produce creative work in their styles, to respond to surveys in their places, to probe how they would use a new system before deployment, to provide users with assistance and companionship, and to anticipate their possible future behavior and interactions with others, just to name a few applications. The research, design, deployment, and availability of such AI systems have, however, also prompted growing concerns about a wide range of possible legal, ethical, and other social impacts. To both 1) facilitate productive discussions about whether, when, and how to design and deploy such systems, and 2) chart the current landscape of existing and prospective AI automatons, we need to tease apart determinant design axes and considerations that can aid our understanding of whether and how various design choices along these axes could mitigate -- or instead exacerbate -- potential adverse impacts that the development and use of AI automatons could give rise to. In this paper, through a synthesis of related literature and extensive examples of existing AI systems intended to mimic humans, we develop a conceptual framework to help foreground key axes of design variations and provide analytical scaffolding to foster greater recognition of the design choices available to developers, as well as the possible ethical implications these choices might have.

2025-03-04

ArXiv (preprint)

arxiv.org

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh

Paul Janson

Paria Mehrbod

Adam Ibrahim

Irina Rish

Eugene Belilovsky

Benjamin Thérien

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

2025-03-04

ArXiv (preprint)

arxiv.org

Considerations and recommendations from the <scp>ISMRM</scp> diffusion study group for preclinical diffusion <scp>MRI</scp>: Part 2—Ex vivo imaging: Added value and acquisition

Kurt G Schilling

Francesco Grussu

Andrada Ianus

Brian Hansen

Amy F. D. Howard

Rachel L. C. Barrett

Manisha Aggarwal

Stijn Michielse

Fatima Nasrallah

Warda Syeda

Nian Wang

Jelle Veraart

Alard Roebroeck

Andrew F. Bagdasarian

Cornelius Eichner

Farshid Sepehrband

Jan Zimmermann

Lucas Soustelle

Christien Bowman

Benjamin C. Tendler … (see 38 more)

Andreea Hertanu

Ben Jeurissen

Marleen Verhoye

Lucio Frydman

Yohan van de Looij

David Hike

Jeff F. Dunn

Karla Miller

Bennett Landman

Noam Shemesh

Arthur Anderson

Emilie McKinnon

Shawna Farquharson

Flavio Dell’Acqua

Carlo Pierpaoli

Ivana Drobnjak

Alexander Leemans

Kevin D. Harkins

Maxime Descoteaux

Duan Xu

Hao Huang

Mathieu D. Santin

Samuel C. Grant

Andre Obenaus

Gene S. Kim

Dan Wu

Denis Le Bihan

Stephen J. Blackband

Luisa Ciobanu

Els Fieremans

Ruiliang Bai

Trygve B. Leergaard

Jiangyang Zhang

Tim B. Dyrby

G. Allan Johnson

Julien Cohen-Adad

Matthew D. Budde

Ileana O. Jelescu

2025-03-04

Magnetic Resonance in Medicine (published)

doi.org

Considerations and recommendations from the ISMRM diffusion study group for preclinical diffusion MRI: Part 2-Ex vivo imaging: Added value and acquisition.

Kurt G Schilling

Francesco Grussu

Andrada Ianus

Brian Hansen

Amy F. D. Howard

Rachel L. C. Barrett

Fatima Nasrallah

Manisha Aggarwal

Stijn Michielse

Warda Syeda

Nian Wang

Andrew F. Bagdasarian

Jelle Veraart

Alard Roebroeck

Cornelius Eichner

Farshid Sepehrband

Jan Zimmermann

Lucas Soustelle

Christien Bowman

Benjamin C. Tendler … (see 38 more)

Andreea Hertanu

Ben Jeurissen

Marleen Verhoye

Lucio Frydman

Yohan van de Looij

David Hike

Jeff F. Dunn

Karla Miller

Bennett Landman

Noam Shemesh

Arthur Anderson

Emilie McKinnon

Shawna Farquharson

Mathieu D. Santin

Flavio Dell’Acqua

Carlo Pierpaoli

Samuel C. Grant

Ivana Drobnjak

Andre Obenaus

Alexander Leemans

Kevin D. Harkins

Maxime Descoteaux

Duan Xu

Hao Huang

Gene S. Kim

Dan Wu

Denis Le Bihan

Stephen J. Blackband

Matthew D. Budde

Luisa Ciobanu

Els Fieremans

Ruiliang Bai

Trygve B. Leergaard

Jiangyang Zhang

Tim B. Dyrby

G. Allan Johnson

Julien Cohen-Adad

Ileana O. Jelescu

The value of preclinical diffusion MRI (dMRI) is substantial. While dMRI enables in vivo non-invasive characterization of tissue, ex vivo d… (see more)MRI is increasingly being used to probe tissue microstructure and brain connectivity. Ex vivo dMRI has several experimental advantages including higher SNR and spatial resolution compared to in vivo studies, and enabling more advanced diffusion contrasts for improved microstructure and connectivity characterization. Another major advantage of ex vivo dMRI is the direct comparison with histological data, as a crucial methodological validation. However, there are a number of considerations that must be made when performing ex vivo experiments. The steps from tissue preparation, image acquisition and processing, and interpretation of results are complex, with many decisions that not only differ dramatically from in vivo imaging of small animals, but ultimately affect what questions can be answered using the data. This work represents "Part 2" of a three-part series of recommendations and considerations for preclinical dMRI. We describe best practices for dMRI of ex vivo tissue, with a focus on the value that ex vivo imaging adds to the field of dMRI and considerations in ex vivo image acquisition. We first give general considerations and foundational knowledge that must be considered when designing experiments. We briefly describe differences in specimens and models and discuss why some may be more or less appropriate for different studies. We then give guidelines for ex vivo protocols, including tissue fixation, sample preparation, and MR scanning. In each section, we attempt to provide guidelines and recommendations, but also highlight areas for which no guidelines exist (and why), and where future work should lie. An overarching goal herein is to enhance the rigor and reproducibility of ex vivo dMRI acquisitions and analyses, and thereby advance biomedical knowledge.

2025-03-04

Magnetic Resonance in Medicine (published)

doi.org

arxiv.org

EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision

Diego Velazquez

Pau Rodriguez

Sergio Alonso

Josep M. Gonfaus

Jordi Gonzalez

Gerardo Richarte

Javier Marin

Yoshua Bengio

Alexandre Lacoste

This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhanc… (see more)e deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.

2025-03-04

2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (published)

doi.org

arxiv.org

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts

Marta Skreta

Tara Akhound-Sadegh

Viktor Ohanesian

Roberto Bondesan

Alan Aspuru-Guzik

Arnaud Doucet

Rob Brekelmans

Alexander Tong

Kirill Neklyudov

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (see more)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional 'corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.

2025-03-04

ArXiv (preprint)

arxiv.org

Hardware Synthesizable Exceptions using Continuations

Paul Teng

Christophe Dubach

2025-03-04

Proceedings of the 30th Asia and South Pacific Design Automation Conference (published)

doi.org

LLM-Safety Evaluations Lack Robustness

Tim Beyer

Sophie Xhonneux

Simon Geisler

Gauthier Gidel

Leo Schwinn

Stephan Günnemann

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

2025-03-04

ArXiv (preprint)

arxiv.org

CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Prashant Govindarajan

Mathieu Reymond

Antoine Clavaud

Mariano Phielipp

Santiago Miret

Sarath Chandar

*In silico* design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional the… (see more)ory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose **CrystalGym**, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. Furthermore, we introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

2025-03-03

ICLR.cc/2025/Workshop/AI4MAT (spotlight)

openreview.net

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Publications