Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Marie-Josée Beauchamp, adjointe administrative à marie-josee.beauchamp@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Étudiants actuels

Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - Cambridge University
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Visiteur de recherche indépendant
Co-superviseur⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - N/A
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - KAIST
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Doctorat
Doctorat - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Stagiaire de recherche - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Ying Wu Coll of Computing
Doctorat - University of Waterloo
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - Max-Planck-Institute for Intelligent Systems
Doctorat - UdeM
Postdoctorat - UdeM
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Visiteur de recherche indépendant - Technical University of Munich
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche
Collaborateur·rice de recherche - KAIST
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Interpolation Consistency Training for Semi-Supervised Learning
Vikas Verma
Alex Lamb
Juho Kannala
David Lopez-Paz
Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future
Nan Rosemary Ke
Amanpreet Singh
Ahmed Touati
Anirudh Goyal
Devi Parikh
Dhruv Batra
In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably inte… (voir plus)rtwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.
Equivalence of Equilibrium Propagation and Recurrent Backpropagation
Benjamin Scellier
Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which di… (voir plus)ffer in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.
Maximum Entropy Generators for Energy-Based Models
Rithesh Kumar
Anirudh Goyal
Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In t… (voir plus)his work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.
The Benefits of Over-parameterization at Initialization in Deep ReLU Networks
Devansh Arpit
It has been noted in existing literature that over-parameterization in ReLU networks generally improves performance. While there could be se… (voir plus)veral factors involved behind this, we prove some desirable theoretical properties at initialization which may be enjoyed by ReLU networks. Specifically, it is known that He initialization in deep ReLU networks asymptotically preserves variance of activations in the forward pass and variance of gradients in the backward pass for infinitely wide networks, thus preserving the flow of information in both directions. Our paper goes beyond these results and shows novel properties that hold under He initialization: i) the norm of hidden activation of each layer is equal to the norm of the input, and, ii) the norm of weight gradient of each layer is equal to the product of norm of the input vector and the error at output layer. These results are derived using the PAC analysis framework, and hold true for finitely sized datasets such that the width of the ReLU network only needs to be larger than a certain finite lower bound. As we show, this lower bound depends on the depth of the network and the number of samples, and by the virtue of being a lower bound, over-parameterized ReLU networks are endowed with these desirable properties. For the aforementioned hidden activation norm property under He initialization, we further extend our theory and show that this property holds for a finite width network even when the number of data samples is infinite. Thus we overcome several limitations of existing papers, and show new properties of deep ReLU networks at initialization.
Adversarial Domain Adaptation for Stable Brain-Machine Interfaces
Ali Farshchian
Juan A. Gallego
Joseph Paul Cohen
Lee Miller
Sara Solla
Brain-Machine Interfaces (BMIs) have recently emerged as a clinically viable option to restore voluntary movements after paralysis. These de… (voir plus)vices are based on the ability to extract information about movement intent from neural signals recorded using multi-electrode arrays chronically implanted in the motor cortices of the brain. However, the inherent loss and turnover of recorded neurons requires repeated recalibrations of the interface, which can potentially alter the day-to-day user experience. The resulting need for continued user adaptation interferes with the natural, subconscious use of the BMI. Here, we introduce a new computational approach that decodes movement intent from a low-dimensional latent representation of the neural data. We implement various domain adaptation methods to stabilize the interface over significantly long times. This includes Canonical Correlation Analysis used to align the latent variables across days; this method requires prior point-to-point correspondence of the time series across domains. Alternatively, we match the empirical probability distributions of the latent variables across days through the minimization of their Kullback-Leibler divergence. These two methods provide a significant and comparable improvement in the performance of the interface. However, implementation of an Adversarial Domain Adaptation Network trained to match the empirical probability distribution of the residuals of the reconstructed neural signals outperforms the two methods based on latent variables, while requiring remarkably few data points to solve the domain adaptation problem.
On Adversarial Mixup Resynthesis
Christopher Beckham
Sina Honari
Alex Lamb
Vikas Verma
Farnoosh Ghadiri
In this paper, we explore new approaches to combining information encoded within the learned representations of auto-encoders. We explore mo… (voir plus)dels that are capable of combining the attributes of multiple inputs such that a resynthesised output is trained to fool an adversarial discriminator for real versus synthesised data. Furthermore, we explore the use of such an architecture in the context of semi-supervised learning, where we learn a mixing function whose objective is to produce interpolations of hidden states, or masked combinations of latent representations that are consistent with a conditioned class label. We show quantitative and qualitative evidence that such a formulation is an interesting avenue of research.
Artificial Intelligence Cytometer in Blood
Geoffrey Hinton
Deep Graph Infomax
Petar Veličković
William Fedus
William L. Hamilton
Pietro Lio
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised ma… (voir plus)nner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.
Deep Graph Infomax
Petar Veličković
William Fedus
William L. Hamilton
Pietro Lio
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised ma… (voir plus)nner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.
An Empirical Study of Example Forgetting during Deep Neural Network Learning
Mariya Toneva*
Remi Tachet des Combes
Adam Trischler
Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single class… (voir plus)ification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a “forgetting event” to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set’s (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.
Gradient based sample selection for online continual learning
Rahaf Aljundi
Min Lin
Baptiste Goujaud
A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overc… (voir plus)ome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous works often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameters gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.