Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Cassidy MacNeil, Senior Assistant and Operation Lead at cassidy.macneil@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
PhD - Université de Montréal
Independent visiting researcher
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni
Collaborating researcher - s.o.
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
Collaborating researcher - University of Waterloo
Principal supervisor :
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Postdoctorate
Co-supervisor :
Collaborating Alumni - Université de Montréal
Co-supervisor :
Collaborating researcher
Principal supervisor :
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating researcher
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
Collaborating Alumni - McGill University
Principal supervisor :

Publications

GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu
Bonaventure F. P. Dossou
Qianli Shen
Nikolay Malkin
Chris C. Emezue
Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and general… (see more)ization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Ermon
Stephen Baccus
Christopher Re
Learning GFlowNets From Partial Episodes For Improved Convergence And Stability
Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized … (see more)target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD(
MixupE: Understanding and improving Mixup from directional derivative perspective
Yingtian Zou
Wai Hoh Tang
Hieu Pham
Juho Kannala
Arno Solin
NEURAL NETWORK-BASED SOLVERS FOR PDES
M. Cameron
Ian G Goodfellow
(1) N (x; θ) = Ll+1 ○ σl ○Ll ○ σl−1 ○ . . . ○ σ1 ○L1. The symbol Lk denotes the k’s affine operator of the form Lk(x) = … (see more)Akx + bk, while σk denotes a nonlinear function called an activation function. The activation functions are chosen by the user. The matrices Ak and shift vectors (or bias vectors) bk are encoded into the argument θ: θ = {Ak, bk} l+1 k=1. The term training neural network means finding {Ak, bk} l+1 k=1 such that N (x; θ) satisfies certain conditions. These conditions are described by the loss function chosen by the user. For example, one might want the neural network to assume certain values fj at certain points xj , j = 1, . . . ,N . These points x are called the training data. In this case, a common choice of the loss function is the least squares error:
Stochastic Generative Flow Networks
Supplementary Material for MixupE
Yingtian Zou
Wai Hoh Tang
Hieu Pham
Juho Kannala
Arno Solin
We denote by z = (x,y) the input and output pair where x ∈ X ⊆ R and y ∈ Y ⊆ R . Let fθ(x) ∈ R be the output of the logits (i.e.,… (see more) the last layer before the softmax or sigmoid) of the model parameterized by θ. We use l(θ, z) = h(fθ(x)) − yfθ(x) to denote the loss function. Let g(·) be the activation function. We use x(i) to index i-th element of the vector x and xj to represent j-th variable in a set. The notation list is:
A Theory of Continuous Generative Flow Networks
Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target dist… (see more)ributions over compositional objects. A key limitation of GFlowNets until this time has been that they are restricted to discrete spaces. We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces, and perform experiments with two goals in mind. First, we illustrate critical points of the theory and the importance of various assumptions. Second, we empirically demonstrate how observations about discrete GFlowNets transfer to the continuous case and show strong results compared to non-GFlowNet baselines on several previously studied tasks. This work greatly widens the perspectives for the application of GFlowNets in probabilistic inference and various modeling settings.
Tree Cross Attention
Frederick Tung
Hossein Hajimirsadeghi
Mohamed Osama Ahmed
Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for e… (see more)ach prediction, Cross Attention scans the full set of
Rethinking Learning Dynamics in RL using Adversarial Networks
Recent years have seen tremendous progress in methods of reinforcement learning. However, most of these approaches have been trained in a st… (see more)raightforward fashion and are generally not robust to adversity, especially in the meta-RL setting. To the best of our knowledge, our work is the first to propose an adversarial training regime for Multi-Task Reinforcement Learning, which requires no manual intervention or domain knowledge of the environments. Our experiments on multiple environments in the Multi-Task Reinforcement learning domain demonstrate that the adversarial process leads to a better exploration of numerous solutions and a deeper understanding of the environment. We also adapt existing measures of causal attribution to draw insights from the skills learned, facilitating easier re-purposing of skills for adaptation to unseen environments and tasks.
Bayesian Dynamic Causal Discovery
Learning the causal structure of observable variables is a central focus for scientific discovery. Bayesian causal discovery methods tackle … (see more)this problem by learning a posterior over the set of admissible graphs that are equally likely given our priors and observations. Existing methods primarily consider observations from static systems and assume the underlying causal structure takes the form of a directed acyclic graph (DAG). In settings with dynamic feedback mechanisms that regulate the trajectories of individual variables, this acyclicity assumption fails unless we account for time. We treat causal discovery in the unrolled causal graph as a problem of sparse identification of a dynamical system. This imposes a natural temporal causal order between variables and captures cyclic feedback loops through time. Under this lens, we propose a new framework for Bayesian causal discovery for dynamical systems and present a novel generative flow network architecture (Dyn-GFN) tailored for this task. Dyn-GFN imposes an edge-wise sparse prior to sequentially build a k -sparse causal graph. Through evaluation on temporal data, our results show that the posterior learned with Dyn-GFN yields improved Bayes coverage of admissible causal structures relative to state of the art Bayesian causal discovery methods.
Object-centric causal representation learning