Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Founder and Scientific Advisor, Leadership Team
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Marie-Josée Beauchamp, Administrative Assistant at marie-josee.beauchamp@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating Alumni - Université de Montréal
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Independent visiting researcher
Co-supervisor :
PhD - Université de Montréal
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
PhD - Université de Montréal
PhD - Université de Montréal
Research Intern - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Collaborating Alumni
Collaborating Alumni - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
PhD - University of Waterloo
Principal supervisor :
Collaborating Alumni - Max-Planck-Institute for Intelligent Systems
Research Intern - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Independent visiting researcher - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
Master's Research - Université de Montréal
Postdoctorate
Independent visiting researcher - Technical University of Munich
PhD - Université de Montréal
Co-supervisor :
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating researcher
Collaborating researcher - KAIST
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Principal supervisor :

Publications

LATTER M INIMA WITH SGD
Stanisław Jastrzębski
Zac Kenton
Devansh Arpit
Nicolas Ballas
Asja Fischer
Amos Storkey
LATTER M INIMA WITH SGD
Stanisław Jastrzębski
Zac Kenton
Devansh Arpit
Nicolas Ballas
Asja Fischer
Amos Storkey
It has been discussed that over-parameterized deep neural networks (DNNs) trained using stochastic gradient descent (SGD) with smaller batch… (see more) sizes generalize better compared with those trained with larger batch sizes. Additionally, model parameters found by small batch size SGD tend to be in flatter regions. We extend these empirical observations and experimentally show that both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well. Conversely, we find that small learning rates and large batch sizes lead to sharper minima that correlate with poor generalization in DNNs.
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
Sandeep Subramanian
Adam Trischler
A lot of the recent success in natural language processing (NLP) has been driven by distributed vector representations of words trained on l… (see more)arge amounts of text in an unsupervised manner. These representations are typically used as general purpose features for words across a range of NLP problems. However, extending this success to learning representations of sequences of words, such as sentences, remains an open problem. Recent work has explored unsupervised as well as supervised learning techniques with different training objectives to learn general purpose fixed-length sentence representations. In this work, we present a simple, effective multi-task learning framework for sentence representations that combines the inductive biases of diverse training objectives in a single model. We train this model on several data sources with multiple training objectives on over 100 million sentences. Extensive experiments demonstrate that sharing a single recurrent sentence encoder across weakly related tasks leads to consistent improvements over previous methods. We present substantial improvements in the context of transfer learning and low-resource settings using our learned general-purpose representations.
MetaGAN: An Adversarial Approach to Few-Shot Learning
Ruixiang ZHANG
Tong Che
Zoubin Ghahramani
Yangqiu Song
In this paper, we propose a conceptually simple and general framework called MetaGAN for few-shot learning problems. Most state-of-the-art f… (see more)ew-shot classification models can be integrated with MetaGAN in a principled and straightforward way. By introducing an adversarial generator conditioned on tasks, we augment vanilla few-shot classification models with the ability to discriminate between real and fake data. We argue that this GAN-based approach can help few-shot classifiers to learn sharper decision boundary, which could generalize better. We show that with our MetaGAN framework, we can extend supervised few-shot learning models to naturally cope with unlabeled data. Different from previous work in semi-supervised few-shot learning, our algorithms can deal with semi-supervision at both sample-level and task-level. We give theoretical justifications of the strength of MetaGAN, and validate the effectiveness of MetaGAN on challenging few-shot image classification benchmarks.
Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding
Nan Rosemary Ke
Anirudh Goyal
Olexa Bilaniuk
Jonathan Binas
Michael Curtis Mozer
Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common me… (see more)thod for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.
Twin Networks: Matching the Future for Sequence Generation
Dmitriy Serdyuk
Nan Rosemary Ke
Adam Trischler
We propose a simple technique for encouraging generative RNNs to plan ahead. We train a "backward" recurrent network to generate a given seq… (see more)uence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases modeling of long-term dependencies by implicitly forcing the forward states to hold information about the longer-term future (as contained in the backward states). We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task.
Universal Successor Representations for Transfer Reinforcement Learning
Chen Ma
Junfeng Wen
The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus o… (see more)n the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.
Dendritic error backpropagation in deep cortical microcircuits
João Sacramento
Rui Ponte Costa
Walter Senn
Animal behaviour depends on learning to associate sensory stimuli with the desired motor command. Understanding how the brain orchestrates t… (see more)he necessary synaptic modifications across different brain areas has remained a longstanding puzzle. Here, we introduce a multi-area neuronal network model in which synaptic plasticity continuously adapts the network towards a global desired output. In this model synaptic learning is driven by a local dendritic prediction error that arises from a failure to predict the top-down input given the bottom-up activities. Such errors occur at apical dendrites of pyramidal neurons where both long-range excitatory feedback and local inhibitory predictions are integrated. When local inhibition fails to match excitatory feedback an error occurs which triggers plasticity at bottom-up synapses at basal dendrites of the same pyramidal neurons. We demonstrate the learning capabilities of the model in a number of tasks and show that it approximates the classical error backpropagation algorithm. Finally, complementing this cortical circuit with a disinhibitory mechanism enables attention-like stimulus denoising and generation. Our framework makes several experimental predictions on the function of dendritic integration and cortical microcircuits, is consistent with recent observations of cross-area learning, and suggests a biological implementation of deep learning.
ObamaNet: Photo-realistic lip-sync from text
Rithesh Kumar
Jose Sotelo
Kundan Kumar
Alexandre De Brébisson
We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text. Co… (see more)ntrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods. More precisely, we use three main modules: a text-to-speech network based on Char2Wav, a time-delayed LSTM to generate mouth-keypoints synced to the audio, and a network based on Pix2Pix to generate the video frames conditioned on the keypoints.
Use machine learning to find energy materials.
Phil De Luna
Jennifer N. Wei
Al'an Aspuru-guzik
E. Sargent
Measuring the tendency of CNNs to Learn Surface Statistical Regularities
Jason Jo
Deep CNNs are known to exhibit the following peculiarity: on the one hand they generalize extremely well to a test set, while on the other h… (see more)and they are extremely sensitive to so-called adversarial perturbations. The extreme sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are learning high level abstractions in the dataset. We are concerned with the following question: How can a deep CNN that does not learn any high level semantics of the dataset manage to generalize so well? The goal of this article is to measure the tendency of CNNs to learn surface statistical regularities of the dataset. To this end, we use Fourier filtering to construct datasets which share the exact same high level abstractions but exhibit qualitatively different surface statistical regularities. For the SVHN and CIFAR-10 datasets, we present two Fourier filtered variants: a low frequency variant and a randomly filtered variant. Each of the Fourier filtering schemes is tuned to preserve the recognizability of the objects. Our main finding is that CNNs exhibit a tendency to latch onto the Fourier image statistics of the training dataset, sometimes exhibiting up to a 28% generalization gap across the various test sets. Moreover, we observe that significantly increasing the depth of a network has a very marginal impact on closing the aforementioned generalization gap. Thus we provide quantitative evidence supporting the hypothesis that deep CNNs tend to learn surface statistical regularities in the dataset rather than higher-level abstract concepts.
Design of a Recognition System Automatic Vehicle License Plate through a Convolution Neural Network
P. Rajendra
K. Sudheer
Rahul Boadh
TE Campos
BR Babu
M. Varma
Ian J Goodfellow
Aaron
The present work is a study on the practical application of Learning process (Deep Learning) in the development of a system of Automatic rec… (see more)ognition of vehicle license plates. These systems commonly referred to as ALPR (Automatic License Plate Recognition) - are able to recognize the content of vehicles from the images captured by a camera. The system proposed in this work is based on an image classifier developed through supervised learning techniques with convolution neural network. These networks are one of the most profound learning architectures and are specifically designed to solve artificial vision, such as pattern recognition and classification of images. This paper also examines basic processing techniques and Image segmentation - such as smoothing filters, contour detection - necessary for the proposed system to be able to extract the contents of the license plates for further analysis and classification. This paper demonstrates the feasibility of an ALPR system based on a convolution neural network, noting the critical importance it has to design a network architecture and training data set appropriate to the problem to be solved.