Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Founder and Scientific Advisor, Leadership Team
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Marie-Josée Beauchamp, Administrative Assistant at marie-josee.beauchamp@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating Alumni - Université de Montréal
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université du Québec à Rimouski
Independent visiting researcher
Co-supervisor :
PhD - Université de Montréal
Collaborating Alumni - UQAR
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Research Intern - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
Co-supervisor :
Collaborating Alumni - Université de Montréal
Research Intern - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Collaborating Alumni
Collaborating Alumni - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
PhD - University of Waterloo
Principal supervisor :
Collaborating Alumni - Max-Planck-Institute for Intelligent Systems
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Independent visiting researcher - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
Research Intern - Université de Montréal
Master's Research - Université de Montréal
Postdoctorate
Independent visiting researcher - Technical University of Munich
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - RWTH Aachen University (Rheinisch-Westfälische Technische Hochschule Aachen)
Principal supervisor :
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating researcher
Collaborating researcher - KAIST
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :

Publications

Dynamic Inference with Neural Interpreters
Nasim Rahaman
Muhammad Waleed Gondal
Shruti Joshi
Peter Vincent Gehler
Francesco Locatello
Bernhard Schölkopf
Modern neural network architectures can leverage large amounts of data to generalize well within the training distribution. However, they ar… (see more)e less capable of systematic generalization to data drawn from unseen but related distributions, a feat that is hypothesized to require compositional reasoning and reuse of knowledge. In this work, we present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules, which we call _functions_. Inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. The proposed architecture can flexibly compose computation along width and depth, and lends itself well to capacity extension after training. To demonstrate the versatility of Neural Interpreters, we evaluate it in two distinct settings: image classification and visual abstract reasoning on Raven Progressive Matrices. In the former, we show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner. In the latter, we find that Neural Interpreters are competitive with respect to the state-of-the-art in terms of systematic generalization.
Episodes Meta Sequence S 2 Fast Update Slow Update Fast Update Slow Update
Kanika Madan
Nan Rosemary Ke
Anirudh Goyal
Bernhard Schölkopf
Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning age… (see more)nt interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, metaparameters.We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.
Explaining by Analogy: Case-based Abductive Natural Language Inference
Ruben Cartuyvels
Graham Spinks
Marie Francine
Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Rajarshi Das
Ameya Godbole
Shehzaad Dhuliawala
Manzil Zaheer
Andrew McCallum
Dung Ngoc Thai
Ameya
Ethan Godbole
Jay-Yoon Perez
Lee
Lizhen
Ramón López De Mántaras
David Mcsherry … (see 37 more)
David Bridge
Barry Leake
Susan Smyth
Craw.
Boi
Maryalice Faltings
Michael T Maher
Ken-552 Cox
Dorottya Demszky
Kelvin Guu
Percy Liang
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Daniel Fried
Peter Jansen
Gus Hahn-Powell
Higher-575
Rebecca Emilie Sharp
M. Surdeanu
Zhengnan Xie
Sebastian Thiem
Jaycie Ryrholm Martin
Eliz-721 abeth Wainwright
Steven Marmorstein
Wenhan Xiong
Xiang Lorraine Li
Srini Iyer
Jingfei Du
Vikas Yadav
Steven Bethard
Zhilin Yang
Peng Qi
Saizheng Zhang
William W Cohen
Russ Salakhutdinov
Existing accounts of explanation emphasise 001 the role of prior experience and analogy in 002 the solution of new problems. However, most 0… (see more)03 of the contemporary models for multi-hop tex-004 tual inference construct explanations consider-005 ing each test case in isolation. This paradigm 006 is known to suffer from semantic drift, which 007 causes the construction of spurious explana-008 tions leading to wrong predictions. In con-009 trast, we propose an abductive framework for 010 multi-hop inference that adopts the retrieve - 011 reuse - revise paradigm largely studied in case-012 based reasoning . Specifically, we present 013 ETNA ( E xplana t io n by A nalogy), a novel 014 model that addresses unseen inference prob-015 lems by retrieving and adapting prior expla-016 nations from similar training examples. We 017 empirically evaluate the case-based abductive 018 framework on downstream commonsense and 019 scientific reasoning tasks. Our experiments 020 demonstrate that ETNA can be effectively in-021 tegrated with sparse and dense encoding mech-022 anisms or downstream transformers, achiev-023 ing strong performance when compared to ex-024 isting explainable approaches. Moreover, we 025 study the impact of the retrieve - reuse - revise 026 paradigm on explainability and semantic drift, 027 showing that it boosts the quality of the con-028 structed explanations, resulting in improved 029 downstream inference performance. 030
Exploring the Wasserstein metric for time-to-event analysis.
Tristan Sylvain
Margaux Luck
Joseph Paul Cohen
Heloise Cardinal
Andrea Lodi
Exploring the Wasserstein metric for survival analysis
Tristan Sylvain
Margaux Luck
Joseph Paul Cohen
Andrea Lodi
Survival analysis is a type of semi-supervised task where the target output (the survival time) is often right-censored. Utilizing this info… (see more)rmation is a challenge because it is not obvious how to correctly incorporate these censored examples into a model. We study how three categories of loss functions can take advantage of this information: partial likelihood methods, rank methods, and our own classification method based on a Wasserstein metric (WM) and the non-parametric Kaplan Meier (KM) estimate of the probability density to impute the labels of censored examples. The proposed method predicts the probability distribution of an event, letting us compute survival curves and expected times of survival that are easier to interpret than the rank. We also demonstrate that this approach directly optimizes the expected C-index which is the most common evaluation metric for survival models.
Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments
Anirudh Goyal
Alex Lamb
Phanideep Gampa
Philippe Beaudoin
Charles Blundell
Sergey Levine
Michael Curtis Mozer
Fast and Slow Learning of Recurrent Independent Mechanisms
Kanika Madan
Nan Rosemary Ke
Anirudh Goyal
Bernhard Schölkopf
Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning age… (see more)nt interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.
Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation
Moksh J. Jain
Maksym Korablyov
This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions… (see more), such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e.g., in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.
Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization
Kartik Ahuja
Ethan Caballero
Dinghuai Zhang
Jean-Christophe Gagnon-Audet
The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address… (see more) out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.
Learning Neural Generative Dynamics for Molecular Conformation Generation
Minkai Xu
Shitong Luo
Jian Peng
We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamic… (see more)s, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.
Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon
Andrea Lodi
Antoine Prouvost
Multi-Domain Balanced Sampling Improves Out-of-Generalization of Chest X-ray Pathology Prediction Models
Enoch Amoatey Tetteh
Joseph D Viviano
Joseph Paul Cohen
Learning models that generalize under different distribution shifts in medical imaging has been a long-standing research challenge. There ha… (see more)ve been several proposals for efficient and robust visual representation learning among vision research practitioners, especially in the sensitive and critical biomedical domain. In this paper, we propose an idea for out-of-distribution generalization of chest X-ray pathologies that uses a simple balanced batch sampling technique. We observed that balanced sampling between the multiple training datasets improves the performance over baseline models trained without balancing. Code for this work is available on Github. 1