Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Founder and Scientific Advisor, Leadership Team
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Marie-Josée Beauchamp, Administrative Assistant at marie-josee.beauchamp@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating Alumni - Université de Montréal
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université du Québec à Rimouski
Independent visiting researcher
Co-supervisor :
PhD - Université de Montréal
Collaborating Alumni - UQAR
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Research Intern - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
Co-supervisor :
Collaborating Alumni - Université de Montréal
Research Intern - Université de Montréal
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Collaborating Alumni
Collaborating Alumni - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
PhD - University of Waterloo
Principal supervisor :
Collaborating Alumni - Max-Planck-Institute for Intelligent Systems
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Independent visiting researcher - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
Research Intern - Université de Montréal
Master's Research - Université de Montréal
Postdoctorate
Independent visiting researcher - Technical University of Munich
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - RWTH Aachen University (Rheinisch-Westfälische Technische Hochschule Aachen)
Principal supervisor :
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating researcher
Collaborating researcher - KAIST
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Principal supervisor :

Publications

Optimization of Artificial Neural Network Hyperparameters For Processing Retrospective Information
A. Rogachev
F. Scholle
Yann LeCun
I. L. Kashirin
M. Demchenko
. Justification of the selection of the architecture and hyperparameters of artificial neural networks (ANN), focused on solving various cla… (see more)sses of applied problems, is a scientific and methodological problem. Optimizing the selection of ANN hyperparameters allows you to improve the quality and speed of ANN training. Various methods of optimizing the selection of ANN hyper-parameters are known – the use of evolutionary calculations, genetic algorithms, etc., but they require the use of additional software. To optimize the process of selecting ANN hyperparameters, Google Research has developed the KerasTuner software tool. It is a platform for automated search of a set of optimal combinations of hyperparameters. In Kerastuner, you can use various methods - random search, Bayesian optimization, or Hyperband. In the numerical experiments conducted by the author, 14 hyperparameters were varied, including the number of blocks of convolutional layers and the filters forming them, the type of activation function, the parameters of the "dropout" layers, and others. The studied tools demonstrated high efficiency while simultaneously varying more than a dozen optimized parameters of the convolutional network. The calculation time on the Colaboratory platform for the various combined ANN architectures studied, including recurrent RNN networks, was several hours, even with the use of GPU graphics accelerators. For ANN, focused on the processing and recognition of retrospective information, an increase in the quality of recognition was achieved to 80 ... 95%.
Predicting Unreliable Predictions by Shattering a Neural Network
Xu Ji
Andrea Vedaldi
Balaji Lakshminarayanan
Piecewise linear neural networks can be split into subfunctions, each with its own activation pattern, domain, and empirical error. Empirica… (see more)l error for the full network can be written as an expectation over empirical error of subfunctions. Constructing a generalization bound on subfunction empirical error indicates that the more densely a subfunction is surrounded by training samples in representation space, the more reliable its predictions are. Further, it suggests that models with fewer activation regions generalize better, and models that abstract knowledge to a greater degree generalize better, all else equal. We propose not only a theoretical framework to reason about subfunction error bounds but also a pragmatic way of approximately evaluating it, which we apply to predicting which samples the network will not successfully generalize to. We test our method on detection of misclassification and out-of-distribution samples, finding that it performs competitively in both cases. In short, some network activation patterns are associated with higher reliability than others, and these can be identified using subfunction error bounds.
Saliency is a Possible Red Herring When Diagnosing Poor Generalization
Joseph D Viviano
Becks Simpson
Francis Dutil
Joseph Paul Cohen
Poor generalization is one symptom of models that learn to predict target variables using spuriously-correlated image features present only … (see more)in the training distribution instead of the true image features that denote a class. It is often thought that this can be diagnosed visually using attribution (aka saliency) maps. We study if this assumption is correct. In some prediction tasks, such as for medical images, one may have some images with masks drawn by a human expert, indicating a region of the image containing relevant information to make the prediction. We study multiple methods that take advantage of such auxiliary labels, by training networks to ignore distracting features which may be found outside of the region of interest. This mask information is only used during training and has an impact on generalization accuracy depending on the severity of the shift between the training and test distributions. Surprisingly, while these methods improve generalization performance in the presence of a covariate shift, there is no strong correspondence between the correction of attribution towards the features a human expert have labelled as important and generalization performance. These results suggest that the root cause of poor generalization may not always be spatially defined, and raise questions about the utility of masks as 'attribution priors' as well as saliency maps for explainable predictions.
Seeing things or seeing scenes: Investigating the capabilities of V&L models to align scene descriptions to images
Matt D Anderson
Erich W Graf
James H Elder
Peter Anderson
Xiaodong He
Chris Buehler
Mark Teney
Stephen Johnson
Gould Lei
Emily M. Bender
Timnit Gebru
Angelina McMillan-575
Alexander Koller. 2020
Climb-582
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Joyce Chai
Angeliki Lazaridou … (see 32 more)
Jonathan May
Aleksandr
Thomas Unterthiner
Mostafa Dehghani
Georg Minderer
Sylvain Heigold
Jakob Gelly
Uszkoreit Neil
Houlsby. 2020
An
Lisa Anne Hendricks
Gabriel Ilharco
Rowan Zellers
Ali Farhadi
John M. Henderson
Contextual
Thomas L. Griffiths. 2021
Are Convolutional
Neu-827
Melissa L.-H. Võ
Jeremy M. Wolfe
Differen-830
Jianfeng Wang
Xiaowei Hu
Xiu-834 Pengchuan Zhang
Roy Schwartz
Bolei Zhou
Àgata Lapedriza
Jianxiong Xiao
Hang Zhao
Xavier Puig
Sanja Fidler
Images can be described in terms of the objects 001 they contain, or in terms of the types of scene 002 or place that they instantiate. In t… (see more)his paper we 003 address to what extent pretrained Vision and 004 Language models can learn to align descrip-005 tions of both types with images. We com-006 pare 3 state-of-the-art models, VisualBERT, 007 LXMERT and CLIP. We find that (i) V&L 008 models are susceptible to stylistic biases ac-009 quired during pretraining; (ii) only CLIP per-010 forms consistently well on both object-and 011 scene-level descriptions. A follow-up ablation 012 study shows that CLIP uses object-level infor-013 mation in the visual modality to align with 014 scene-level textual descriptions
A Simple and Effective Model for Multi-Hop Question Generation
Jimmy Lei Ba
Jamie Ryan Kiros
Geoffrey E Hin-602
Peter W. Battaglia
Jessica Blake
Chandler Hamrick
Vic-613 tor Bapst
Alvaro Sanchez
Vinicius Zambaldi
M. Malinowski
Andrea Tacchetti
David Raposo
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam … (see 72 more)
Girish Sastry
Koustuv Sinha
Shagun Sodhani
Jin Dong
William L. Hamilton
Clutrr
Nitish Srivastava
Geoffrey Hinton
Alex Krizhevsky
Ilya Sutskever
Ruslan Salakhutdinov. 2014
Gabriel Stanovsky
Julian Michael
Luke Zettlemoyer
Dan Su
Yan Xu
Wenliang Dai
Ziwei Ji
Tiezheng Yu
Minghao Tu
Kevin Huang
Guangtao Wang
Jing Huang
Ashish Vaswani
Noam M. Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan N. Gomez
Łukasz Kaiser
Illia Polosukhin. 2017
Attention
Petar Veliˇckovi´c
Guillem Cucurull
Arantxa Casanova
Pietro Lio’
Johannes Welbl
Pontus Stenetorp
Yonghui Wu
Mike Schuster
Quoc Zhifeng Chen
Mohammad Le
Wolfgang Norouzi
Macherey
M. Krikun
Yuan Cao
Qin Gao
William W. Cohen
Jianxing Yu
Xiaojun Quan
Qinliang Su
Jian Yin
Yuyu Zhang
Hanjun Dai
Zornitsa Kozareva
Chen Zhao
Chenyan Xiong
Corby Rosset
Xia
Paul Song
Bennett Saurabh
Tiwary
Yao Zhao
Xiaochuan Ni
Yuanyuan Ding
Qingyu Zhou
Nan Yang
Furu Wei
Chuanqi Tan
Previous research on automated question gen-001 eration has almost exclusively focused on gen-002 erating factoid questions whose answers ca… (see more)n 003 be extracted from a single document. How-004 ever, there is an increasing interest in develop-005 ing systems that are capable of more complex 006 multi-hop question generation (QG), where an-007 swering the question requires reasoning over 008 multiple documents. In this work, we pro-009 pose a simple and effective approach based on 010 the transformer model for multi-hop QG. Our 011 approach consists of specialized input repre-012 sentations, a supporting sentence classification 013 objective, and training data weighting. Prior 014 work on multi-hop QG considers the simpli-015 fied setting of shorter documents and also ad-016 vocates the use of entity-based graph struc-017 tures as essential ingredients in model design. 018 On the contrary, we showcase that our model 019 can scale to the challenging setting of longer 020 documents as input, does not rely on graph 021 structures, and substantially outperforms the 022 state-of-the-art approaches as measured by au-023 tomated metrics and human evaluation. 024
SPE: Symmetrical Prompt Enhancement for Factual Knowledge Retrieval
James M. Crawford
Matthew L. Ginsberg
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Xavier Glorot
Antoine Bordes
Alex Graves
Abdel rahman Mohamed
Adi Haviv
Jonathan Berant
Amir Globerson
Chloe Kiddon
Pedro M. Domingos
Brian Lester
Rami Al-rfou'
Noah Constant. 2021
Pengfei Liu
Weizhe Yuan … (see 6 more)
Jinlan Fu
Zhengbao Jiang
Xiao Liu
Yanan Zheng
Zhengxiao Du
Ming Ding
Pretrained language models (PLMs) have 001 been shown to accumulate factual knowledge 002 from their unsupervised pretraining proce-003 dure… (see more)s (Petroni et al., 2019). Prompting is an 004 effective way to query such knowledge from 005 PLMs. Recently, continuous prompt methods 006 have been shown to have a larger potential 007 than discrete prompt methods in generating ef-008 fective queries (Liu et al., 2021a). However, 009 these methods do not consider symmetry of 010 the task. In this work, we propose Symmet-011 rical Prompt Enhancement (SPE), a continu-012 ous prompt-based method for fact retrieval that 013 leverages the symmetry of the task. Our results 014 on LAMA, a popular fact retrieval dataset, 015 show significant improvement of SPE over pre-016 vious prompt methods
Systematic generalisation with group invariant predictions
Faruk Ahmed
Harm van Seijen
We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-train… (see more)ed neural network to be less reliant on more persistently correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are recombined with different objects. We perform an empirical study on three synthetic datasets, showing that group invariance methods across inferred partitionings of the training set can lead to significant improvements at such test-time situations. We also suggest a simple invariance penalty, showing with experiments on our setups that it can perform better than alternatives. We find that even without assuming access to any systematically shifted validation sets, one can still find improvements over an ERM-trained reference model.
Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model
−. i.eUT
R´ejean Ducharme
Morgan Kaufmann
Yen-Chun Chen
Linjie Li
Licheng Yu
Matthew Henderson
Blaise Thomson
Ehsan Hosseini-Asl
Bryan McCann
Chien-Sheng Wu
Samuel Humeau
Kurt Shuster
Marie-Anne Lachaux
The Situated Interactive Multi-Modal Conver-001 sations (SIMMC) 2.0 aims to create virtual 002 shopping assistants that can accept complex 0… (see more)03 multi-modal inputs, i.e. visual appearances of 004 objects and user utterances. It consists of four 005 subtasks, multi-modal disambiguation (MM-006 Disamb), multi-modal coreference resolution 007 (MM-Coref), multi-modal dialog state tracking 008 (MM-DST), and response retrieval and genera-009 tion. While many task-oriented dialog systems 010 usually tackle each subtask separately, we pro-011 pose a jointly learned encoder-decoder that per-012 forms all four subtasks at once for efficiency. 013 Moreover, we handle the multi-modality of the 014 challenge by representing visual objects as spe-015 cial tokens whose joint embedding is learned 016 via auxiliary tasks. This approach won the MM-017 Coref and response retrieval subtasks and nom-018 inated runner-up for the remaining subtasks 019 using a single unified model. In particular, 020 our model achieved 81.5% MRR, 71.2% R@1, 021 95.0% R@5, 98.2% R@10, and 1.9 mean rank 022 in response retrieval task, setting a high bar for 023 the state-of-the-art result in the SIMMC 2.0 024 track of the Dialog Systems Technology Chal-025 lenge 10 (DSTC10). 026
Unifying Likelihood-free Inference with Black-box Sequence Design and Beyond
Dinghuai Zhang
Jie Fu
What Makes Machine Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types
Susan Bartlett
Grzegorz Kondrak
Max Bartolo
Alastair Roberts
Johannes Welbl
Steven Bird
Ewan Klein
Edward Loper
Samuel R. Bowman
George Dahl. 2021
What
Chao Pang
Junyuan Shang
Jiaxiang Liu
Xuyi Chen
Yanbin Zhao
Yuxiang Lu
Weixin Liu
Zhi-901 hua Wu
Weibao Gong … (see 21 more)
Jianzhong Liang
Zhizhou Shang
Peng Sun
Ouyang Xuan
Dianhai
Hao Tian
Hua Wu
Haifeng Wang
Adam Trischler
Tong Wang
Xingdi Yuan
Justin Har-908
Philip Bachman
Adina Williams
Nikita Nangia
Zhilin Yang
Peng Qi
Saizheng Zhang
ing. In
For a natural language understanding bench-001 mark to be useful in research, it has to con-002 sist of examples that are diverse and diffi… (see more)-003 cult enough to discriminate among current and 004 near-future state-of-the-art systems. However, 005 we do not yet know how best to select pas-006 sages to collect a variety of challenging exam-007 ples. In this study, we crowdsource multiple-008 choice reading comprehension questions for 009 passages taken from seven qualitatively dis-010 tinct sources, analyzing what attributes of pas-011 sages contribute to the difficulty and question 012 types of the collected examples. To our sur-013 prise, we find that passage source, length, and 014 readability measures do not significantly affect 015 question difficulty. Through our manual anno-016 tation of seven reasoning types, we observe 017 several trends between passage sources and 018 reasoning types, e.g., logical reasoning is more 019 often required in questions written for techni-020 cal passages. These results suggest that when 021 creating a new benchmark dataset, selecting a 022 diverse set of passages can help ensure a di-023 verse range of question types, but that passage 024 difficulty need not be a priority. 025
Machine Learning for Glacier Monitoring in the Hindu Kush Himalaya
Shimaa Baraka
Benjamin Akera
Bibek Aryal
Tenzing Chogyal Sherpa
Finu Shresta
Anthony Ortiz
Kris Sankaran
J. Ferres
M. Matin
Inductive biases for deep learning of higher-level cognition
Anirudh Goyal
A fascinating hypothesis is that human and animal intelligence could be explained by a few principles (rather than an encyclopaedic list of … (see more)heuristics). If that hypothesis was correct, we could more easily both understand our own intelligence and build intelligent machines. Just like in physics, the principles themselves would not be sufficient to predict the behaviour of complex systems like brains, and substantial computation might be needed to simulate human-like intelligence. This hypothesis would suggest that studying the kind of inductive biases that humans and animals exploit could help both clarify these principles and provide inspiration for AI research and neuroscience theories. Deep learning already exploits several key inductive biases, and this work considers a larger list, focusing on those which concern mostly higher-level and sequential conscious processing. The objective of clarifying these particular principles is that they could potentially help us build AI systems benefiting from humans’ abilities in terms of flexible out-of-distribution and systematic generalization, which is currently an area where a large gap exists between state-of-the-art machine learning and human intelligence.