Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Marie-Josée Beauchamp, adjointe administrative à marie-josee.beauchamp@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Étudiants actuels

Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - Cambridge University
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice alumni - Université du Québec à Rimouski
Visiteur de recherche indépendant
Co-superviseur⋅e :
Doctorat - UdeM
Collaborateur·rice alumni - UQAR
Collaborateur·rice de recherche - N/A
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - KAIST
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Stagiaire de recherche - UdeM
Doctorat
Doctorat - UdeM
Maîtrise recherche - UdeM
Co-superviseur⋅e :
Collaborateur·rice alumni - UdeM
Stagiaire de recherche - UdeM
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni
Collaborateur·rice alumni - Imperial College London
Doctorat - UdeM
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Ying Wu Coll of Computing
Doctorat - University of Waterloo
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - Max-Planck-Institute for Intelligent Systems
Doctorat - UdeM
Postdoctorat - UdeM
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Stagiaire de recherche - UdeM
Maîtrise recherche - UdeM
Visiteur de recherche indépendant - Technical University of Munich
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - RWTH Aachen University (Rheinisch-Westfälische Technische Hochschule Aachen)
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche
Collaborateur·rice de recherche - KAIST
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Compositional generalization in a deep seq2seq model by separating syntax and semantics
Jacob Russin
Jason Jo
R. O’Reilly
Standard methods in deep learning for natural language processing fail to capture the compositional structure of human language that allows … (voir plus)for systematic generalization outside of the training distribution. However, human learners readily generalize in this way, e.g. by applying known grammatical rules to novel words. Inspired by work in neuroscience suggesting separate brain systems for syntactic and semantic processing, we implement a modification to standard approaches in neural machine translation, imposing an analogous separation. The novel model, which we call Syntactic Attention, substantially outperforms standard methods in deep learning on the SCAN dataset, a compositional generalization task, without any hand-engineered features or additional supervision. Our work suggests that separating syntactic from semantic learning may be a useful heuristic for capturing compositional structure.
GradMask: Reduce Overfitting by Regularizing Saliency
Becks Simpson
Francis Dutil
Joseph Paul Cohen
With too few samples or too many model parameters, overfitting can inhibit the ability to generalise predictions to new data. Within medical… (voir plus) imaging, this can occur when features are incorrectly assigned importance such as distinct hospital specific artifacts, leading to poor performance on a new dataset from a different institution without those features, which is undesirable. Most regularization methods do not explicitly penalize the incorrect association of these features to the target class and hence fail to address this issue. We propose a regularization method, GradMask, which penalizes saliency maps inferred from the classifier gradients when they are not consistent with the lesion segmentation. This prevents non-tumor related features to contribute to the classification of unhealthy samples. We demonstrate that this method can improve test accuracy between 1-3% compared to the baseline without GradMask, showing that it has an impact on reducing overfitting.
Reinforced Imitation in Heterogeneous Action Space
Konrad Żołna
Sungjin Ahn
Pedro O. Pinheiro
Imitation learning is an effective alternative approach to learn a policy when the reward function is sparse. In this paper, we consider a c… (voir plus)hallenging setting where an agent and an expert use different actions from each other. We assume that the agent has access to a sparse reward function and state-only expert observations. We propose a method which gradually balances between the imitation learning cost and the reinforcement learning objective. In addition, this method adapts the agent's policy based on either mimicking expert behavior or maximizing sparse reward. We show, through navigation scenarios, that (i) an agent is able to efficiently leverage sparse rewards to outperform standard state-only imitation learning, (ii) it can learn a policy even when its actions are different from the expert, and (iii) the performance of the agent is not bounded by that of the expert, due to the optimized usage of sparse rewards.
Gated Orthogonal Recurrent Units: On Learning to Forget
Li Jing
Caglar Gulcehre
John Peurifoy
Yichen Shen
Max Tegmark
Marin Soljacic
We present a novel recurrent neural network (RNN)–based model that combines the remembering ability of unitary evolution RNNs with the abi… (voir plus)lity of gated RNNs to effectively forget redundant or irrelevant information in its memory. We achieve this by extending restricted orthogonal evolution RNNs with a gating mechanism similar to gated recurrent unit RNNs with a reset gate and an update gate. Our model is able to outperform long short-term memory, gated recurrent units, and vanilla unitary or orthogonal RNNs on several long-term-dependency benchmark tasks. We empirically show that both orthogonal and unitary RNNs lack the ability to forget. This ability plays an important role in RNNs. We provide competitive results along with an analysis of our model on many natural sequential tasks, including question answering, speech spectrum prediction, character-level language modeling, and synthetic tasks that involve long-term dependencies such as algorithmic, denoising, and copying tasks.
Towards Standardization of Data Licenses: The Montreal Data License
Misha Benjamin
P. Gagnon
Alex Shee
This paper provides a taxonomy for the licensing of data in the fields of artificial intelligence and machine learning. The paper's goal is … (voir plus)to build towards a common framework for data licensing akin to the licensing of open source software. Increased transparency and resolving conceptual ambiguities in existing licensing language are two noted benefits of the approach proposed in the paper. In parallel, such benefits may help foster fairer and more efficient markets for data through bringing about clearer tools and concepts that better define how data can be used in the fields of AI and ML. The paper's approach is summarized in a new family of data license language - \textit{the Montreal Data License (MDL)}. Alongside this new license, the authors and their collaborators have developed a web-based tool to generate license language espousing the taxonomies articulated in this paper.
Online continual learning with no task boundaries
Rahaf Aljundi
Min Lin
Baptiste Goujaud
Continual learning is the ability of an agent to learn online with a non-stationary and never-ending stream of data. A key component for suc… (voir plus)h never-ending learning process is to overcome the catastrophic forgetting of previously seen data, a problem that neural networks are well known to suffer from. The solutions developed so far often relax the problem of continual learning to the easier task-incremental setting, where the stream of data is divided into tasks with clear boundaries. In this paper, we break the limits and move to the more challenging online setting where we assume no information of tasks in the data stream. We start from the idea that each learning step should not increase the losses of the previously learned examples through constraining the optimization process. This means that the number of constraints grows linearly with the number of examples, which is a serious limitation. We develop a solution to select a fixed number of constraints that we use to approximate the feasible region defined by the original constraints. We compare our approach against the methods that rely on task boundaries to select a fixed set of examples, and show comparable or even better results, especially when the boundaries are blurry or when the data distributions are imbalanced.
Automated segmentation of cortical layers in BigBrain reveals divergent cortical and laminar thickness gradients in sensory and motor cortices.
Konrad Wagstyl
Stéphanie Larocque
Guillem Cucurull
Claude Lepage
Joseph Paul Cohen
Sebastian Bludau
Nicola Palomero-Gallagher
L. Lewis
Thomas Funck
Hannah Spitzer
Timo Dicksheid
Paul C Fletcher
Karl Zilles
Katrin Amunts
Alan C. Evans
Abstract Large-scale in vivo neuroimaging datasets offer new possibilities for reliable, well-powered measures of interregional structural d… (voir plus)ifferences and biomarkers of pathological changes in a wide variety of neurological and psychiatric diseases. However, so far studies have been structurally and functionally imprecise, being unable to relate pathological changes to specific cortical layers or neurobiological processes. We developed artificial neural networks to segment cortical and laminar surfaces in the BigBrain, a 3D histological model of the human brain. We sought to test whether previously-reported thickness gradients, as measured by MRI, in sensory and motor processing cortices, were present in a histological atlas of cortical thickness, and which cortical layers were contributing to these gradients. Identifying common gradients of cortical organisation enables us to meaningfully relate microstructural, macrostructural and functional cortical parameters. Analysis of thickness gradients across sensory cortices, using our fully segmented six-layered model, was consistent with MRI findings, showing increasing thickness moving up the processing hierarchy. In contrast, fronto-motor cortices showed the opposite pattern with changes in thickness of layers III, V and VI being the primary drivers of these gradients. As well as identifying key differences between sensory and motor gradients, our findings show how the use of this laminar atlas offers insights that will be key to linking single-neuron morphological changes, mesoscale cortical layers and macroscale cortical thickness.
Interpolation Consistency Training for Semi-Supervised Learning
Vikas Verma
Alex Lamb
Juho Kannala
David Lopez-Paz
Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future
Nan Rosemary Ke
Amanpreet Singh
Ahmed Touati
Anirudh Goyal
Devi Parikh
Dhruv Batra
In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably inte… (voir plus)rtwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.
Equivalence of Equilibrium Propagation and Recurrent Backpropagation
Benjamin Scellier
Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which di… (voir plus)ffer in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.
Maximum Entropy Generators for Energy-Based Models
Rithesh Kumar
Anirudh Goyal
Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In t… (voir plus)his work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.
The Benefits of Over-parameterization at Initialization in Deep ReLU Networks
Devansh Arpit
It has been noted in existing literature that over-parameterization in ReLU networks generally improves performance. While there could be se… (voir plus)veral factors involved behind this, we prove some desirable theoretical properties at initialization which may be enjoyed by ReLU networks. Specifically, it is known that He initialization in deep ReLU networks asymptotically preserves variance of activations in the forward pass and variance of gradients in the backward pass for infinitely wide networks, thus preserving the flow of information in both directions. Our paper goes beyond these results and shows novel properties that hold under He initialization: i) the norm of hidden activation of each layer is equal to the norm of the input, and, ii) the norm of weight gradient of each layer is equal to the product of norm of the input vector and the error at output layer. These results are derived using the PAC analysis framework, and hold true for finitely sized datasets such that the width of the ReLU network only needs to be larger than a certain finite lower bound. As we show, this lower bound depends on the depth of the network and the number of samples, and by the virtue of being a lower bound, over-parameterized ReLU networks are endowed with these desirable properties. For the aforementioned hidden activation norm property under He initialization, we further extend our theory and show that this property holds for a finite width network even when the number of data samples is infinite. Thus we overcome several limitations of existing papers, and show new properties of deep ReLU networks at initialization.