Yoshua Bengio

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Cassidy MacNeil, Senior Assistant and Operation Lead at cassidy.macneil@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Jamal Abou Haibeh

Collaborating Alumni - McGill University

Berkes Anaïs

Collaborating researcher - Cambridge University

Principal supervisor :

Rim Assouel

PhD - Université de Montréal

Shahana Chatterjee

Collaborating researcher - N/A

Principal supervisor :

PhD - Université de Montréal

Sanghyeok Choi

Collaborating researcher - KAIST

PhD - Université de Montréal

Collaborating Alumni - Université de Montréal

Co-supervisor :

Loubna Benabbou

Desmond Elliott

Independent visiting researcher

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Jean-Pierre Falet

PhD - Université de Montréal

PhD

PhD - Université de Montréal

Moksh Jain

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

Collaborating Alumni - Université de Montréal

Hyeonah Kim

Postdoctorate - Université de Montréal

Principal supervisor :

Minsu Kim

Research Intern - Université de Montréal

Postdoctorate - Université de Montréal

Principal supervisor :

Collaborating Alumni

Song LIU

Collaborating researcher - s.o.

Nikolay Malkin

Collaborating researcher - Université de Montréal

Cristian Dragos Manta

PhD - Université de Montréal

Co-supervisor :

Dhanya Sridhar

Sarthak Mittal

PhD - Université de Montréal

Principal supervisor :

Independent visiting researcher - Université de Montréal

Padideh Nouri

PhD - Université de Montréal

Principal supervisor :

Ali Parviz

Collaborating researcher - Ying Wu Coll of Computing

Lena Podina

Collaborating researcher - University of Waterloo

Principal supervisor :

David Rolnick

Nassim Rahaman

Collaborating Alumni - Max-Planck-Institute for Intelligent Systems

Jarrid Rector-Brooks

PhD - Université de Montréal

Danyal REHMAN

Postdoctorate - Université de Montréal

Oli RICHARDSON

Postdoctorate - Université de Montréal

Camille Rochefort-Boulanger

PhD - Université de Montréal

Principal supervisor :

Julie Hussin

Dragos Secrieru

Collaborating Alumni - Université de Montréal

Divya Sharma

Postdoctorate

Co-supervisor :

Mélisande Astrid Crystal Teng

Vincent Taboga

Collaborating Alumni - Polytechnique Montréal

Co-supervisor :

Collaborating Alumni - Université de Montréal

Co-supervisor :

Hugo Larochelle

Ivan Titov

Collaborating researcher

Principal supervisor :

Siva Reddy

Alex Tong

Collaborating Alumni - Université de Montréal

Collaborating Alumni - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Principal supervisor :

Collaborating researcher

Collaborating researcher - Université de Montréal

Skipper: Combining Spatial and Temporal Abstraction for Better Generalization

Tianyu Zhang

PhD - Université de Montréal

PhD - McGill University

Principal supervisor :

Harry Zhao

Collaborating Alumni - McGill University

Principal supervisor :

Blog Posts

Generic thumbnail for Mila Blog articles.

February 22, 2024

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Scaling in the Service of Reasoning & Model-Based ML

April 4, 2023

Yoshua Bengio

Edward J. Hu

A collaboration between Mila and Relation Therapeutics to discover novel synergistic combinations of drugs in vitro

March 23, 2022

Paul Bertin

Jake P. Taylor-King

Yoshua Bengio

March 15, 2022

Generative Flow Networks

Yoshua Bengio

Publications

Offline Model-Based Optimization: Comprehensive Review

Jiayao Gu

Zixuan Liu

Can Chen

2025-03-20

ArXiv (preprint)

What makes a theory of consciousness unscientific?

Derek H. Mark G. Tristan A. Yoshua James W. Jacob Dean D Arnold Baxter Bekinschtein Bengio Bisley Browning

Derek H. Arnold

Mark G. Baxter

Tristan A. Bekinschtein

Vincent Taschereau-Dumouchel

James W. Bisley

Jacob Browning

Dean Buonomano

David Carmel

Marisa Carrasco

Peter Carruthers

Olivia Carter

Dorita H. F. Chang

Ian Charest

Mouslim Cherkaoui

Axel Cleeremans

Michael A. Cohen

Philip R. Corlett

Kalina Christoff

Sam Cumming … (see 84 more)

Cody A. Cushing

Beatrice de Gelder

Felipe De Brigard

Daniel C. Dennett

Nadine Dijkstra

Adrien Doerig

Paul E. Dux

Stephen M. Fleming

Keith Frankish

Chris D. Frith

Sarah Garfinkel

Melvyn A. Goodale

Jacqueline Gottlieb

Jake R. Hanson

Ran R. Hassin

Michael H. Herzog

Cecilia Heyes

Po-Jang Hsieh

Shao-Min Hung

Robert Kentridge

Tomas Knapen

Nikos Konstantinou

Konrad Kording

Timo L. Kvamme

Sze Chai Kwok

Renzo C. Lanfranco

Hakwan Lau

Joseph LeDoux

Alan L. F. Lee

Camilo Libedinsky

Matthew D. Lieberman

Ying-Tung Lin

Ka-Yuet Liu

Maro G. Machizawa

Julio Martinez-Trujillo

Janet Metcalfe

Matthias Michel

Kenneth D. Miller

Partha P. Mitra

Dean Mobbs

Robert M. Mok

Jorge Morales

Myrto Mylopoulos

Brian Odegaard

Charles C.-F. Or

Adrian M. Owen

David Pereplyotchik

Franco Pestilli

Megan A. K. Peters

Ian Phillips

Rosanne L. Rademaker

Dobromir Rahnev

Geraint Rees

Dario L. Ringach

Adina Roskies

Daniela Schiller

Aaron Schurger

D. Samuel Schwarzkopf

Ryan B. Scott

Aaron R. Seitz

Joshua Shepherd

Juha Silvanto

Heleen A. Slagter

Barry C. Smith

Guillermo Solovey

David Soto

Hugo Spiers

Timo Stein

Frank Tong

Peter U. Tse

Jonas Vibell

Sebastian Watzl

Taylor Webb

Josh Weisberg

Thalia Wheatley

Michael H. Herzog

Martijn E. Wokke

Hakwan Lau

Michał Klincewicz

Tony Cheng

Michael Schmitz

Miguel Ángel Sebastián

Joel S. Snyder

2025-03-09

Nature Neuroscience (unknown)

Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

Thomas Jiralerspong

Berton Earnshaw

Jason Hartford

Luca Scimeca

Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks… (see more). In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

2025-03-05

ICLR.cc/2025/Workshop/DeLTa (poster)

Laurence Perreault-Levasseur

Solving Bayesian Inverse Problems with Diffusion Priors and Off-Policy RL

Glen Berseth

Nikolay Malkin

This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (R… (see more)L) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

2025-03-05

ICLR.cc/2025/Workshop/DeLTa (poster)

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Juan A. Rodriguez

Chao Wang

Akshay Kalkunte Suresh

Abhay Puri

Xiangru Jian

Pierre-Andre Noel

Sathwik Tejaswi Madhusudhan

Enamul Hoque

Christopher Pal

Issam H. Laradji

David Vázquez

Perouz Taslakian … (see 2 more)

Spandana Gella

Sai Rajeswar

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges… (see more) on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

2025-03-04

ICLR.cc/2025/Workshop/Re-Align (poster)

Learning Decision Trees as Amortized Structure Inference

Mohammed Mahfoud

Ghait Boukachab

Michał Koziarski

Stefan Bauer

Nikolay Malkin

2025-03-04

ICLR.cc/2025/Workshop/FPI (poster)

EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision

Diego Velazquez

Pau Rodríguez

Sergio Alonso

Josep M. Gonfaus

Jordi Gonzalez

Gerardo Richarte

Javier Marin

Alexandre Lacoste

This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhanc… (see more)e deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.

2025-03-03

2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (published)

A physics-based data-driven model for CO$_2$ gas diffusion electrodes to drive automated laboratories

Ivan Grega

Félix Therrien

Abhishek Soni

Karry Ocean

Kevan Dettelbach

Ribwar Ahmadi

Mehrdad Mokhtari

Curtis P. Berlinguette

The electrochemical reduction of atmospheric CO…

2025-03-02

ICLR.cc/2025/Workshop/AI4MAT (poster)

OBELiX: a curated dataset of crystal structures and experimentally measured ionic conductivities for lithium solid-state electrolytes

Félix Therrien

Jamal Abou Haibeh

Divya Sharma

Rhiannon Hendley

Leah Wairimu Mungai

Sun Sun

Alain Tchagang

Jiang Su

Samuel Huberman

Hongyu Guo

Homin Shin

OBELiX is a database of 599 synthesized solid electrolyte materials and their experimentally measured room temperature ionic conductivities … (see more)gathered from literature and curated by domain experts.

2025-02-19

arXiv (preprint)

In-Context Parametric Inference: Point or Distribution Estimators?

Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random vari… (see more)ables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

2025-02-16

ArXiv (preprint)

Laurence Perreault-Levasseur

Causal Discovery in Astrophysics: Unraveling Supermassive Black Hole and Galaxy Coevolution

Zehao Jin

Mario Pasquato

Benjamin L. Davis

Tristan Deleu

Yu Luo

Changhyun Cho

Pablo Lemos

Xi Kang

Andrea Valerio Maccio

Yashar Hezaveh

Correlation does not imply causation, but patterns of statistical association between variables can be exploited to infer a causal structure… (see more) (even with purely observational data) with the burgeoning field of causal discovery. As a purely observational science, astrophysics has much to gain by exploiting these new methods. The supermassive black hole (SMBH)--galaxy interaction has long been constrained by observed scaling relations, that is low-scatter correlations between variables such as SMBH mass and the central velocity dispersion of stars in a host galaxy's bulge. This study, using advanced causal discovery techniques and an up-to-date dataset, reveals a causal link between galaxy properties and dynamically-measured SMBH masses. We apply a score-based Bayesian framework to compute the exact conditional probabilities of every causal structure that could possibly describe our galaxy sample. With the exact posterior distribution, we determine the most likely causal structures and notice a probable causal reversal when separating galaxies by morphology. In elliptical galaxies, bulge properties (built from major mergers) tend to influence SMBH growth, while in spiral galaxies, SMBHs are seen to affect host galaxy properties, potentially through feedback in gas-rich environments. For spiral galaxies, SMBHs progressively quench star formation, whereas in elliptical galaxies, quenching is complete, and the causal connection has reversed. Our findings support theoretical models of hierarchical assembly of galaxies and active galactic nuclei feedback regulating galaxy evolution. Our study suggests the potentiality for further exploration of causal links in astrophysical and cosmological scaling relations, as well as any other observational science.

2025-01-27

The Astrophysical Journal (published)

Action Abstractions for Amortized Sampling

Lena Nehale Ezzine

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignm… (see more)ent and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

2025-01-21

International Conference on Learning Representations (poster)