Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Marie-Josée Beauchamp, adjointe administrative à marie-josee.beauchamp@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Étudiants actuels

Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - Cambridge University
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Visiteur de recherche indépendant - KAIST
Visiteur de recherche indépendant
Co-superviseur⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - N/A
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - KAIST
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Doctorat - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni
Doctorat - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Ying Wu Coll of Computing
Doctorat - University of Waterloo
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - Max-Planck-Institute for Intelligent Systems
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Postdoctorat - UdeM
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Visiteur de recherche indépendant - Technical University of Munich
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche
Collaborateur·rice de recherche - KAIST
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Fast and Slow Learning of Recurrent Independent Mechanisms
Kanika Madan
Nan Rosemary Ke
Anirudh Goyal
Bernhard Schölkopf
Decomposing knowledge into interchangeable pieces promises a generalization advantage when there are changes in distribution. A learning age… (voir plus)nt interacting with its environment is likely to be faced with situations requiring novel combinations of existing pieces of knowledge. We hypothesize that such a decomposition of knowledge is particularly relevant for being able to generalize in a systematic manner to out-of-distribution changes. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs and its reward function are stationary and can be re-used across tasks. An attention mechanism dynamically selects which modules can be adapted to the current task, and the parameters of the selected modules are allowed to change quickly as the learner is confronted with variations in what it experiences, while the parameters of the attention mechanisms act as stable, slowly changing, meta-parameters. We focus on pieces of knowledge captured by an ensemble of modules sparsely communicating with each other via a bottleneck of attention. We find that meta-learning the modular aspects of the proposed system greatly helps in achieving faster adaptation in a reinforcement learning setup involving navigation in a partially observed grid world with image-level input. We also find that reversing the role of parameters and meta-parameters does not work nearly as well, suggesting a particular role for fast adaptation of the dynamically selected modules.
Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation
Moksh J. Jain
Maksym Korablyov
This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions… (voir plus), such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e.g., in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.
Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization
Kartik Ahuja
Ethan Caballero
Dinghuai Zhang
Jean-Christophe Gagnon-Audet
The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address… (voir plus) out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.
Learning Neural Generative Dynamics for Molecular Conformation Generation
Minkai Xu
Shitong Luo
Jian Peng
We study how to generate molecule conformations (i.e., 3D structures) from a molecular graph. Traditional methods, such as molecular dynamic… (voir plus)s, sample conformations via computationally expensive simulations. Recently, machine learning methods have shown great potential by training on a large collection of conformation data. Challenges arise from the limited model capacity for capturing complex distributions of conformations and the difficulty in modeling long-range dependencies between atoms. Inspired by the recent progress in deep generative models, in this paper, we propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph. We propose a method combining the advantages of both flow-based and energy-based models, enjoying: (1) a high model capacity to estimate the multimodal conformation distribution; (2) explicitly capturing the complex long-range dependencies between atoms in the observation space. Extensive experiments demonstrate the superior performance of the proposed method on several benchmarks, including conformation generation and distance modeling tasks, with a significant improvement over existing generative models for molecular conformation sampling.
Machine Learning for Combinatorial Optimization: a Methodological Tour d'Horizon
Andrea Lodi
Antoine Prouvost
Multi-Domain Balanced Sampling Improves Out-of-Generalization of Chest X-ray Pathology Prediction Models
Enoch Amoatey Tetteh
Joseph D Viviano
Joseph Paul Cohen
Learning models that generalize under different distribution shifts in medical imaging has been a long-standing research challenge. There ha… (voir plus)ve been several proposals for efficient and robust visual representation learning among vision research practitioners, especially in the sensitive and critical biomedical domain. In this paper, we propose an idea for out-of-distribution generalization of chest X-ray pathologies that uses a simple balanced batch sampling technique. We observed that balanced sampling between the multiple training datasets improves the performance over baseline models trained without balancing. Code for this work is available on Github. 1
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding
Dmitriy Serdyuk
Yongqiang Wang
Christian Fue-730
Anuj Kumar
Baiyang Liu
Edwin Simonnet
Sahar Ghannay
Nathalie Camelin
Tandem spoken language understanding 001 (SLU) systems suffer from the so-called 002 automatic speech recognition (ASR) error 003 propagatio… (voir plus)n problem. Additionally, as the 004 ASR is not optimized to extract semantics, but 005 solely the linguistic content, relevant semantic 006 cues might be left out of its transcripts. In 007 this work, we propose a multimodal language 008 understanding (MLU) architecture to mitigate 009 these problems. Our solution is based on 010 two compact unidirectional long short-term 011 memory (LSTM) models that encode speech 012 and text information. A fusion layer is also 013 used to fuse audio and text embeddings. 014 Two fusion strategies are explored: a simple 015 concatenation of these embeddings and a 016 cross-modal attention mechanism that learns 017 the contribution of each modality. The first 018 approach showed to be the optimal solution 019 to robustly extract semantic information from 020 audio-textual data. We found that attention 021 is less effective at testing time when the text 022 modality is corrupted. Our model is evaluated 023 on three SLU datasets and robustness is tested 024 using ASR outputs from three off-the-shelf 025 ASR engines. Results show that the proposed 026 approach effectively mitigates the ASR error 027 propagation problem for all datasets. 028
Optimization of Artificial Neural Network Hyperparameters For Processing Retrospective Information
A. Rogachev
F. Scholle
Yann LeCun
I. L. Kashirin
M. Demchenko
. Justification of the selection of the architecture and hyperparameters of artificial neural networks (ANN), focused on solving various cla… (voir plus)sses of applied problems, is a scientific and methodological problem. Optimizing the selection of ANN hyperparameters allows you to improve the quality and speed of ANN training. Various methods of optimizing the selection of ANN hyper-parameters are known – the use of evolutionary calculations, genetic algorithms, etc., but they require the use of additional software. To optimize the process of selecting ANN hyperparameters, Google Research has developed the KerasTuner software tool. It is a platform for automated search of a set of optimal combinations of hyperparameters. In Kerastuner, you can use various methods - random search, Bayesian optimization, or Hyperband. In the numerical experiments conducted by the author, 14 hyperparameters were varied, including the number of blocks of convolutional layers and the filters forming them, the type of activation function, the parameters of the "dropout" layers, and others. The studied tools demonstrated high efficiency while simultaneously varying more than a dozen optimized parameters of the convolutional network. The calculation time on the Colaboratory platform for the various combined ANN architectures studied, including recurrent RNN networks, was several hours, even with the use of GPU graphics accelerators. For ANN, focused on the processing and recognition of retrospective information, an increase in the quality of recognition was achieved to 80 ... 95%.
Predicting Unreliable Predictions by Shattering a Neural Network
Xu Ji
Andrea Vedaldi
Balaji Lakshminarayanan
Piecewise linear neural networks can be split into subfunctions, each with its own activation pattern, domain, and empirical error. Empirica… (voir plus)l error for the full network can be written as an expectation over empirical error of subfunctions. Constructing a generalization bound on subfunction empirical error indicates that the more densely a subfunction is surrounded by training samples in representation space, the more reliable its predictions are. Further, it suggests that models with fewer activation regions generalize better, and models that abstract knowledge to a greater degree generalize better, all else equal. We propose not only a theoretical framework to reason about subfunction error bounds but also a pragmatic way of approximately evaluating it, which we apply to predicting which samples the network will not successfully generalize to. We test our method on detection of misclassification and out-of-distribution samples, finding that it performs competitively in both cases. In short, some network activation patterns are associated with higher reliability than others, and these can be identified using subfunction error bounds.
Saliency is a Possible Red Herring When Diagnosing Poor Generalization
Joseph D Viviano
Becks Simpson
Francis Dutil
Joseph Paul Cohen
Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only … (voir plus)in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert have labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as 'attribution priors' as well as saliency maps for explainable predictions.
Seeing things or seeing scenes: Investigating the capabilities of V&L models to align scene descriptions to images
Matt D Anderson
Erich W Graf
James H Elder
Peter Anderson
Xiaodong He
Chris Buehler
Mark Teney
Stephen Johnson
Gould Lei
Emily M. Bender
Timnit Gebru
Angelina McMillan-575
Alexander Koller. 2020
Climb-582
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Joyce Chai
Angeliki Lazaridou … (voir 32 de plus)
Jonathan May
Aleksandr
Thomas Unterthiner
Mostafa Dehghani
Georg Minderer
Sylvain Heigold
Jakob Gelly
Uszkoreit Neil
Houlsby. 2020
An
Lisa Anne Hendricks
Gabriel Ilharco
Rowan Zellers
Ali Farhadi
John M. Henderson
Contextual
Thomas L. Griffiths. 2021
Are Convolutional
Neu-827
Melissa L.-H. Võ
Jeremy M. Wolfe
Differen-830
Jianfeng Wang
Xiaowei Hu
Xiu-834 Pengchuan Zhang
Roy Schwartz
Bolei Zhou
Àgata Lapedriza
Jianxiong Xiao
Hang Zhao
Xavier Puig
Sanja Fidler
Images can be described in terms of the objects 001 they contain, or in terms of the types of scene 002 or place that they instantiate. In t… (voir plus)his paper we 003 address to what extent pretrained Vision and 004 Language models can learn to align descrip-005 tions of both types with images. We com-006 pare 3 state-of-the-art models, VisualBERT, 007 LXMERT and CLIP. We find that (i) V&L 008 models are susceptible to stylistic biases ac-009 quired during pretraining; (ii) only CLIP per-010 forms consistently well on both object-and 011 scene-level descriptions. A follow-up ablation 012 study shows that CLIP uses object-level infor-013 mation in the visual modality to align with 014 scene-level textual descriptions
A Simple and Effective Model for Multi-Hop Question Generation
Jimmy Lei Ba
Jamie Ryan Kiros
Geoffrey E Hin-602
Peter W. Battaglia
Jessica Blake
Chandler Hamrick
Vic-613 tor Bapst
Alvaro Sanchez
Vinicius Zambaldi
M. Malinowski
Andrea Tacchetti
David Raposo
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam … (voir 72 de plus)
Girish Sastry
Koustuv Sinha
Shagun Sodhani
Jin Dong
William L. Hamilton
Clutrr
Nitish Srivastava
Geoffrey Hinton
Alex Krizhevsky
Ilya Sutskever
Ruslan Salakhutdinov. 2014
Gabriel Stanovsky
Julian Michael
Luke Zettlemoyer
Dan Su
Yan Xu
Wenliang Dai
Ziwei Ji
Tiezheng Yu
Minghao Tu
Kevin Huang
Guangtao Wang
Jing Huang
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Łukasz Kaiser
Illia Polosukhin. 2017
Attention
Petar Veliˇckovi´c
Guillem Cucurull
Arantxa Casanova
Pietro Lio’
Johannes Welbl
Pontus Stenetorp
Yonghui Wu
Mike Schuster
Quoc Zhifeng Chen
Mohammad Le
Wolfgang Norouzi
Macherey
M. Krikun
Yuan Cao
Qin Gao
William W. Cohen
Jianxing Yu
Xiaojun Quan
Qinliang Su
Jian Yin
Yuyu Zhang
Hanjun Dai
Zornitsa Kozareva
Chen Zhao
Chenyan Xiong
Corby Rosset
Xia
Paul Song
Bennett Saurabh
Tiwary
Yao Zhao
Xiaochuan Ni
Yuanyuan Ding
Qingyu Zhou
Nan Yang
Furu Wei
Chuanqi Tan
Previous research on automated question gen-001 eration has almost exclusively focused on gen-002 erating factoid questions whose answers ca… (voir plus)n 003 be extracted from a single document. How-004 ever, there is an increasing interest in develop-005 ing systems that are capable of more complex 006 multi-hop question generation (QG), where an-007 swering the question requires reasoning over 008 multiple documents. In this work, we pro-009 pose a simple and effective approach based on 010 the transformer model for multi-hop QG. Our 011 approach consists of specialized input repre-012 sentations, a supporting sentence classification 013 objective, and training data weighting. Prior 014 work on multi-hop QG considers the simpli-015 fied setting of shorter documents and also ad-016 vocates the use of entity-based graph struc-017 tures as essential ingredients in model design. 018 On the contrary, we showcase that our model 019 can scale to the challenging setting of longer 020 documents as input, does not rely on graph 021 structures, and substantially outperforms the 022 state-of-the-art approaches as measured by au-023 tomated metrics and human evaluation. 024