Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Marie-Josée Beauchamp, adjointe administrative à marie-josee.beauchamp@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Étudiants actuels

Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - Cambridge University
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Visiteur de recherche indépendant - KAIST
Visiteur de recherche indépendant
Co-superviseur⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - N/A
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - KAIST
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Doctorat - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni
Doctorat - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Ying Wu Coll of Computing
Doctorat - University of Waterloo
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - Max-Planck-Institute for Intelligent Systems
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Postdoctorat - UdeM
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Visiteur de recherche indépendant - Technical University of Munich
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche
Collaborateur·rice de recherche - KAIST
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Biasly: a machine learning based platform for automatic racial discrimination detection in online texts
David Bamman
Chris Dyer
Noah A. Smith. 2014
Steven Bird
Ewan Klein
Edward Loper
Nat-527
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova. 2019
Bert
Samuel Gehman
Suchin Gururangan
Maarten Sap
Dan Hendrycks
Kevin Gimpel. 2020
Gaussian
Alex Lamb
Di He … (voir 22 de plus)
Anirudh Goyal
Guolin Ke
Feng Liao
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Yann LeCun
Bernhard E. Boser
J. Denker
Don-608 nie Henderson
Robin Howard
Wayne Hubbard
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
Mike Lewis
Warning : this paper contains content that may 001 be offensive or upsetting. 002 Detecting hateful, toxic, and otherwise racist 003 or sexi… (voir plus)st language in user-generated online con-004 tents has become an increasingly important task 005 in recent years. Indeed, the anonymity, the 006 transience, the size of messages, and the dif-007 ficulty of management, facilitate the diffusion 008 of racist or hateful messages across the Inter-009 net. The critical influence of this cyber-racism 010 is no longer limited to social media, but also 011 has a significant effect on our society : corpo-012 rate business operation, users’ health, crimes, 013 etc. Traditional racist speech reporting chan-014 nels have proven inadequate due to the enor-015 mous explosion of information, so there is an 016 urgent need for a method to automatically and 017 promptly detect texts with racial discrimination. 018 We propose in this work, a machine learning-019 based approach to enable automatic detection 020 of racist text content over the internet. State-of-021 the-art machine learning models that are able 022 to grasp language structures are adapted in this 023 study. Our main contribution include 1) a large 024 scale racial discrimination data set collected 025 from three distinct sources and annotated ac-026 cording to a guideline developed by specialists, 027 2) a set of machine learning models with vari-028 ous architectures for racial discrimination de-029 tection, and 3) a web-browser-based software 030 that assist users to debias their texts when us-031 ing the internet. All these resources are made 032 publicly available.
Chunked Autoregressive GAN for Conditional Waveform Synthesis
Max Morrison
Rithesh Kumar
Kundan Kumar
Prem Seetharaman
Compositional Attention: Disentangling Search and Retrieval
Sarthak Mittal
Sharath Chandra Raparthy
Multi-head, key-value attention is the backbone of transformer-like model architectures which have proven to be widely successful in recent … (voir plus)years. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interaction, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval and is easy to implement in a variety of established network architectures.
Contrastive introspection (ConSpec) to rapidly identify invariant prototypes for success in RL
Chen Sun
Mila
Wannan Yang
Benjamin Alsbury-Nealy
Thomas Jiralerspong
†. BlakeRichards
Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-ter… (voir plus)m credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.
Discrete Compositional Representations as an Abstraction for Goal Conditioned Reinforcement Learning
Riashat Islam
Hongyu Zang
Anirudh Goyal
Alex Lamb
Kenji Kawaguchi
Xin Li
Romain Laroche
Remi Tachet des Combes
Goal-conditioned reinforcement learning (RL) is a promising direction for training agents that are capable of solving multiple tasks and rea… (voir plus)ch a diverse set of objectives. How to \textit{specify} and \textit{ground} these goals in such a way that we can both reliably reach goals during training as well as generalize to new goals during evaluation remains an open area of research. Defining goals in the space of noisy, high-dimensional sensory inputs is one possibility, yet this poses a challenge for training goal-conditioned agents, or even for generalization to novel goals. We propose to address this by learning compositional representations of goals and processing the resulting representation via a discretization bottleneck, for coarser specification of goals, through an approach we call DGRL. We show that discretizing outputs from goal encoders through a bottleneck can work well in goal-conditioned RL setups, by experimentally evaluating this method on tasks ranging from maze environments to complex robotic navigation and manipulation tasks. Additionally, we show a theoretical result which bounds the expected return for goals not observed during training, while still allowing for specifying goals with expressive combinatorial structure.
Discrete-Valued Neural Communication in Structured Architectures Enhances Generalization
Dianbo Liu
Alex Lamb
Kenji Kawaguchi
Anirudh Goyal
Chen Sun
Michael Curtis Mozer
In this appendix, as a complementary to Theorems 1–2, we provide additional theorems, Theorems 3–4, which further illustrate the two adv… (voir plus)antages of the discretization process by considering an abstract model with the discretization bottleneck. For the advantage on the sensitivity, the error due to potential noise and perturbation without discretization — the third term ξ(w, r′,M′, d) > 0 in Theorem 4 — is shown to be minimized to zero with discretization in Theorems 3. For the second advantage, the underlying dimensionality of N(M′,d′)(r,H) + ln(N(M,d)(r,Θ)/δ) without discretization (in the bound of Theorem 4) is proven to be reduced to the typically much smaller underlying dimensionality of L + ln(N(M,d)(r, E ×Θ) with discretization in Theorems 3. Here, for any metric space (M, d) and subset M ⊆ M, the r-converging number of M is defined by N(M,d)(r,M) = min { |C| : C ⊆ M,M ⊆ ∪c∈CB(M,d)[c, r]} where the (closed) ball of radius r at centered at c is denoted by B(M,d)[c, r] = {x ∈M : d(x, c) ≤ r}. See Appendix C.1 for a simple comparison between the bound of Theorem 3 and that of Theorem 4 when the metric spaces (M, d) and (M′, d′) are chosen to be Euclidean spaces.
Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual Embeddings
Iz Beltagy
Kyle Lo
Arman Cohan. 2019
Scib-500
R´ejean Ducharme
Rishi Bommasani
Kelly Davis
Claire Cardie
Billy Chiu
Sampo Pyysalo
Ivan Vuli´c
Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003… (voir plus) static word embeddings to uncover "latent 004 knowledge" contained within domain-specific 005 scientific corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later verified using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
There has been significant recent progress in causal representation learning that has showed a variety of settings in which we can disentang… (voir plus)le latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are d − dimensional vectors, and (2) that the observations are the output of some injective observation function of these latent variables. While these assumptions appear benign—they amount to assuming that any changes in the latent space are reflected in the observation space, and that we can use standard encoders to infer the latent variables—we show that when the observations are of multiple objects, the observation function is no longer injective, and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture (Locatello et al., 2020b), we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object’s properties. We argue that this approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and, we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
S5 Framework: A Review of Self-Supervised Shared Semantic Space Optimization for Multimodal Zero-Shot Learning
Clst
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Ja-740 cob
Joyce Chai
Angeliki Lapata
Jonathan Lazaridou
Alek-742 May
Nicolas sandr Nisnevich
P. PintoJoseph
Turian
Ting Chen
Simon Kornblith
Mohammad Norouzi
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El … (voir 89 de plus)
Faisal Kholy
Zhe Ahmed
Yu Gan
Cheng
Zihan Dai
Hanxiao Liu
Quoc V. Le
Jia Deng
Wei Dong
Richard Socher
Li-Jia Li
K. Liu
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Jesse Dodge
Maarten Sap
Ana Marasovic
Gabriel Agnew
Dirk Ilharco
Groeneveld Matt
Li Dong
Nan Yang
Wenhui Wang
Furu Wei
Yu Liu
Jianfeng Wang
Ming Gao
Zhou
Xiaoyi Dong
Jia Bao
Tinglu Zhang
Dongdong
Weiming Chen
Lu Zhang
Dong Yuan
Fang Chen
Da-cheng Juan
Chuntian Lu
Zhen Li
Futang Peng
Aleksei Timofeev
Yi-Ting Chen
Yaxi Gao
Tom
Andrew Duerig
Tomkins Sujith
Ravi
Lukasz Kaiser
Aidan N. Gomez
Noam M. Shazeer
Niki Vaswani
Llion Parmar
Jones Jakob
Uszko-850
Alex G. Kendall
Yarin Gal
Roberto Cipolla
Salman H. Khan
Muzammal Naseer
Munawar Hayat
Waqas Zamir
Fahad Shahbaz
Khan
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin John-867
Kenji Hata
Joshua Kravitz
Stephanie Chen
Mike Lewis
Yinhan Liu
Marjan Naman Goyal
Abdelrahman Ghazvininejad
Omer Mohamed
Levy
Luke Zettlemoyer
Bohan Li
Hao Zhou
Jun-Tao He
Mingxuan Wang
Liunian Harold
Mark Li
Da Yatskar
Yin
Cho-Jui
Kai-Wei Chang
Visualbert
In this review, we aim to inspire research into 001 S elf-S upervised S hared S emantic S pace ( S5 ) 002 multimodal learning problems. We e… (voir plus)quip non-003 expert researchers with a framework of in-004 formed modeling decisions via an extensive 005 literature review, an actionable modeling check-006 list, as well as a series of novel zero-shot eval-007 uation tasks. The core idea for our S5 check-008 list lies in learning contextual multimodal in-009 teractions at various granularity levels via a 010 shared Transformer encoder with a denoising 011 loss term, which is also regularized by a con-012 trastive loss term to induce a semantic align-013 ment prior on the contextual embedding space. 014 Essentially, we aim to model human concept 015 understanding and thus learn to “put a name to 016 a face”. This ultimately enables interpretable 017 zero-shot S5 generalization on a variety of 018 novel downstream tasks. In summary, this re-019 view provides sufficient background and ac-020 tionable strategies for training cutting-edge S5 021 multimodal networks. 022
Harvesting Mature Relation Extraction Models from Limited Seed Knowledge: A Self-Development Framework for DS Rule Expansion
Raphael Hoffmann
Congle Zhang
Xiao Ling
Yankai Lin
Shiqi Shen
Zhiyuan Liu
Huanbo Luan
Christopher D Manning
M. Surdeanu
John Bauer
Pietro Lio’
Xuanhui Wang
Cheng Li
Nadav Golbandi
Bendersky Marc
Najork. 2018
The
Wentao Wu … (voir 2 de plus)
Hongsong Li
Haixun Wang
Distantly-supervised relation extraction 001 (DSRE) is an effective method to scale relation 002 extraction (RE) to large unlabeled corpora … (voir plus)003 with the utilization of knowledge bases (KBs), 004 but suffers from the scale of KBs and the 005 introduced noise. 006 To alleviate the above two problems, we 007 propose a novel framework called S elf-008 devel O pment r U le ex P ansion ( SOUP ), which 009 starts from limited amount of labeled data 010 and continuously produces low-noise labels on 011 large-scaled unlabeled data by a growing learn-012 able logical rules set. 013 Specifically, SOUP achieves a mutual enhance-014 ment of RE model and logical rules set, first 015 a RE model is trained on the labeled data to 016 summarize the knowledge, then the knowledge 017 is utilized to explore candidate rules from unla-018 beled data, finally high-quality candidates are 019 selected in a graph-based ranking manner to ex-020 tend the logical rules set and new rule-labeled 021 data are provided for better RE model training. 022 Experiments on wiki20 dataset demonstrate 023 that, with limited seed knowledge from small-024 scaled manually labeled data, SOUP achieves 025 significant improvement compared to baselines 026 by producing continuous growth of both logical 027 rules and the RE model, and that labeling noise 028 of SOUP is much less than DS. Furthermore, 029 RE model enhanced by SOUP with 1.6k logical 030 rules learned from prior knowledge could pro-031 duce an equivalent performance to the model 032 trained on data labeled in DS manner by 72k 033 relational facts of KBs. 034
Is a Modular Architecture Enough?
Inspired from human cognition, machine learning systems are gradually revealing advantages of sparser and more modular architectures. Recent… (voir plus) work demonstrates that not only do some modular architectures generalize well, but they also lead to better out of distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparse modular connections, and endowing models with similar inductive biases will be helpful. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. In this work, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. In doing so, we propose evaluation metrics that highlight the benefits of modularity, the regimes in which these benefits are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential.