Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Scientific Director, Leadership Team
Observer, Board of Directors, Mila

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Julie Mongeau, executive assistant at julie.mongeau@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific director of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Professional Master's - Université de Montréal
Co-supervisor :
Professional Master's - Université de Montréal
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Collaborating researcher - Université Paris-Saclay
Principal supervisor :
Professional Master's - Université de Montréal
Independent visiting researcher - MIT
PhD - École Polytechnique Montréal Fédérale de Lausanne
Research Intern - Université du Québec à Rimouski
Collaborating researcher
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Co-supervisor :
Professional Master's - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
PhD - Université de Montréal
Research Intern - Université de Montréal
Collaborating Alumni
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
Professional Master's - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
PhD - Massachusetts Institute of Technology
PhD - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher - Technical University Munich (TUM)
Independent visiting researcher - Hong Kong University of Science and Technology (HKUST)
DESS - Université de Montréal
Independent visiting researcher - UQAR
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Research Intern - Université de Montréal
Independent visiting researcher - Technical University of Munich
Research Intern - Imperial College London
PhD - Université de Montréal
Co-supervisor :
Postdoctorate - Université de Montréal
PhD - McGill University
Principal supervisor :
Professional Master's - Université de Montréal
Collaborating researcher - Université de Montréal
Research Intern - Université de Montréal
Research Intern - Université de Montréal
PhD - Max-Planck-Institute for Intelligent Systems
PhD - McGill University
Principal supervisor :
Collaborating Alumni - Université de Montréal
Professional Master's - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating researcher
Professional Master's - Université de Montréal
Collaborating researcher - Valence
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Research Intern - Université de Montréal
Collaborating researcher - Université de Montréal
Independent visiting researcher
Co-supervisor :
Postdoctorate - Université de Montréal
Research Intern - McGill University
Professional Master's - Université de Montréal
Collaborating researcher
Principal supervisor :
Master's Research - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
Collaborating researcher - RWTH Aachen University (Rheinisch-Westfälische Technische Hochschule Aachen)
Principal supervisor :
Undergraduate - Université de Montréal
PhD - Université de Montréal
Professional Master's - Université de Montréal
Professional Master's - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Professional Master's - Université de Montréal
Postdoctorate - Université de Montréal

Publications

VIM: Variational Independent Modules for Video Prediction
Rim Assouel
Lluis Castrejon
Nicolas Ballas
We introduce a variational inference model called VIM, for Variational Independent Modules, for sequential data that learns and infers laten… (see more)t representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step, our model infers from a low-level input sequence a high-level sequence of categorical latent variables to select which transition modules to apply to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.
On Neural Architecture Inductive Biases for Relational Tasks
Current deep learning approaches have shown good in-distribution generalization performance, but struggle with out-of-distribution generaliz… (see more)ation. This is especially true in the case of tasks involving abstract relations like recognizing rules in sequences, as we find in many intelligence tests. Recent work has explored how forcing relational representations to remain distinct from sensory representations, as it seems to be the case in the brain, can help artificial systems. Building on this work, we further explore and formalize the advantages afforded by 'partitioned' representations of relations and sensory details, and how this inductive bias can help recompose learned relational structure in newly encountered settings. We introduce a simple architecture based on similarity scores which we name Compositional Relational Network (CoRelNet). Using this model, we investigate a series of inductive biases that ensure abstract relations are learned and represented distinctly from sensory data, and explore their effects on out-of-distribution generalization for a series of relational psychophysics tasks. We find that simple architectural choices can outperform existing models in out-of-distribution generalization. Together, these results show that partitioning relational representations from other information streams may be a simple way to augment existing network architectures' robustness when performing out-of-distribution relational computations.
Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL
Akram Erraqabi
Marlos C. Machado
Harry Zhao
Mingde Zhao
Sainbayar Sukhbaatar
Alessandro Lazaric
Ludovic Denoyer
In reinforcement learning, the graph Laplacian has proved to be a valuable tool in the task-agnostic setting, with applications ranging from… (see more) skill discovery to reward shaping. Recently, learning the Laplacian representation has been framed as the optimization of a temporally-contrastive objective to overcome its computational limitations in large (or continuous) state spaces. However, this approach requires uniform access to all states in the state space, overlooking the exploration problem that emerges during the representation learning process. In this work, we propose an alternative method that is able to recover, in a non-uniform-prior setting, the expressiveness and the desired properties of the Laplacian representation. We do so by combining the representation learning with a skill-based covering policy, which provides a better training distribution to extend and refine the representation. We also show that a simple augmentation of the representation objective with the learned temporal abstractions improves dynamics-awareness and helps exploration. We find that our method succeeds as an alternative to the Laplacian in the non-uniform setting and scales to challenging continuous control environments. Finally, even if our method is not optimized for skill discovery, the learned skills can successfully solve difficult continuous navigation tasks with sparse rewards, where standard skill discovery approaches are no so effective.
E VALUATING G ENERALIZATION IN GF LOW N ETS FOR M OLECULE D ESIGN
Andrei Cristian Nica
Moksh J. Jain
Emmanuel Bengio
Cheng-Hao Liu
Maksym Korablyov
Michael M. Bronstein
Deep learning bears promise for drug discovery problems such as de novo molecular design. Generating data to train such models is a costly a… (see more)nd time-consuming process, given the need for wet-lab experiments or expensive simulations. This problem is compounded by the notorious data-hungriness of machine learning algorithms. In small molecule generation the recently proposed GFlowNet method has shown good performance in generating diverse high-scoring candidates, and has the interesting advantage of being an off-policy offline method. Finding an appropriate generalization evaluation metric for such models, one predictive of the desired search performance (i.e. finding high-scoring diverse candidates), will help guide online data collection for such an algorithm. In this work, we develop techniques for evaluating GFlowNet performance on a test set, and identify the most promising metric for predicting generalization. We present empirical results on several small-molecule design tasks in drug discovery, for several GFlowNet training setups, and we find a metric strongly correlated with diverse high-scoring batch generation. This metric should be used to identify the best generative model from which to sample batches of molecules to be evaluated.
Inductive Biases for Relational Tasks
Current deep learning approaches have shown good in-distribution performance but struggle in out-of-distribution settings. This is especiall… (see more)y true in the case of tasks involving abstract relations like recognizing rules in sequences, as required in many intelligence tests. In contrast, our brains are remarkably flexible at such tasks, an attribute that is likely linked to anatomical constraints on computations. Inspired by this, recent work has explored how enforcing that relational representations remain distinct from sensory representations can help artificial systems. Building on this work, we further explore and formalize the advantages afforded by ``partitioned'' representations of relations and sensory details. We investigate inductive biases that ensure abstract relations are learned and represented distinctly from sensory data across several neural network architectures and show that they outperform existing architectures on out-of-distribution generalization for various relational tasks. These results show that partitioning relational representations from other information streams may be a simple way to augment existing network architectures' robustness when performing relational computations.
A New Era: Intelligent Tutoring Systems Will Transform Online Learning for Millions
Francois St-Hilaire
Dung D. Vu
Antoine Frau
Nathan J. Burns
Farid Faraji
Joseph Potochny
Stephane Robert
Arnaud Roussel
Selene Zheng
Taylor Glazier
Junfel Vincent Romano
Robert Belfer
Muhammad Shayan
Ariella Smofsky
Tommy Delarosbil
Seulmin Ahn
Simon Eden-Walker
Kritika Sony
Ansona Onyi Ching
Sabina Elkins … (see 11 more)
A. Stepanyan
Adela Matajova
Victor Chen
Hossein Sahraei
Robert Larson
N. Markova
Andrew Barkett
Iulian V. Serban
Ekaterina Kochmar
Tackling Climate Change with Machine Learning
Priya L. Donti
Lynn H. Kaack
Kelly Kochanski
Alexandre Lacoste
Kris Sankaran
Andrew Slavin Ross
Nikola Milojevic-Dupont
Natasha Jaques
Anna Waldman-Brown
Alexandra Luccioni
Evan David Sherwin
S. Karthik Mukkavilli
Konrad Paul Kording
Carla P. Gomes
Andrew Y. Ng
Demis Hassabis
John C. Platt
Felix Creutzig … (see 2 more)
Jennifer T Chayes
Climate change is one of the greatest challenges facing humanity, and we, as machine learning (ML) experts, may wonder how we can help. Here… (see more) we describe how ML can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by ML, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the ML community to join the global effort against climate change.
Compositional Attention: Disentangling Search and Retrieval
Sarthak Mittal
Sharath Chandra Raparthy
Multi-head, key-value attention is the backbone of transformer-like model architectures which have proven to be widely successful in recent … (see more)years. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interaction, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval and is easy to implement in a variety of established network architectures.
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information
Eric Larsen
Sébastien Lachapelle
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a method… (see more)ology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.
From Machine Learning to Robotics: Challenges and Opportunities for Embodied Intelligence
Nicholas Roy
Ingmar Posner
T. Barfoot
Philippe Beaudoin
Jeannette Bohg
Oliver Brock
Isabelle Depatie
Dieter Fox
D. Koditschek
Tom'as Lozano-p'erez
Vikash K. Mansinghka
Dorsa Sadigh
Stefan Schaal
G. Sukhatme
Denis Therien
Marc Emile Toussaint
Michiel van de Panne
Comparative Study of Learning Outcomes for Online Learning Platforms
Francois St-Hilaire
Nathan J. Burns
Robert Belfer
Muhammad Shayan
Ariella Smofsky
Dung D. Vu
Antoine Frau
Joseph Potochny
Farid Faraji
Vincent Pavero
Neroli Ko
Ansona Onyi Ching
Sabina Elkins
A. Stepanyan
Adela Matajova
Iulian V. Serban
Ekaterina Kochmar
Meta-learning framework with applications to zero-shot time-series forecasting
Boris Oreshkin
Dmitri Carpov
Can meta-learning discover generic ways of processing time series (TS) from a diverse dataset so as to greatly improve generalization on new… (see more) TS coming from different datasets? This work provides positive evidence to this using a broad meta-learning framework which we show subsumes many existing meta-learning algorithms. Our theoretical analysis suggests that residual connections act as a meta-learning adaptation mechanism, generating a subset of task-specific parameters based on a given TS input, thus gradually expanding the expressive power of the architecture on-the-fly. The same mechanism is shown via linearization analysis to have the interpretation of a sequential update of the final linear layer. Our empirical results on a wide range of data emphasize the importance of the identified meta-learning mechanisms for successful zero-shot univariate forecasting, suggesting that it is viable to train a neural network on a source TS dataset and deploy it on a different target TS dataset without retraining, resulting in performance that is at least as good as that of state-of-practice univariate forecasting models.