Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Cassidy MacNeil, adjointe principale et responsable des opérations cassidy.macneil@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Publications

Contrastive introspection (ConSpec) to rapidly identify invariant prototypes for success in RL
Chen Sun
Mila
Wannan Yang
†. BlakeRichards
Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-ter… (voir plus)m credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.
Discrete Factorial Representations as an Abstraction for Goal Conditioned RL
Hongyu Zang
Xin Li
Romain Laroche
Remi Tachet des Combes
Discrete-Valued Neural Communication in Structured Architectures Enhances Generalization
Dianbo Liu
Chen Sun
Michael C. Mozer
Deep learning has advanced from fully connected architectures to structured models organized into components, e.g., the transformer composed… (voir plus) of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes. In structured models, an interesting question is how to conduct dynamic and possibly sparse communication among the separate components. Here, we explore the hypothesis that restricting the transmitted information among components to discrete representations is a beneficial bottleneck. The motivating intuition is human language in which communication occurs through discrete symbols. Even though individuals have different understandings of what a "cat" is based on their specific experiences, the shared discrete token makes it possible for communication among individuals to be unimpeded by individual differences in internal representation. To discretize the values of concepts dynamically communicated among specialist components, we extend the quantization mechanism from the Vector-Quantized Variational Autoencoder to multi-headed discretization with shared codebooks and use it for discrete-valued neural communication (DVNC). Our experiments show that DVNC substantially improves systematic generalization in a variety of architectures -- transformers, modular architectures, and graph neural networks. We also show that the DVNC is robust to the choice of hyperparameters, making the method very useful in practice. Moreover, we establish a theoretical justification of our discretization process, proving that it has the ability to increase noise robustness and reduce the underlying dimensionality of the model.
Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual Embeddings
Iz Beltagy
Kyle Lo
Arman Cohan. 2019
Scib-500
R´ejean Ducharme
P Vincent
Rishi Bommasani
Kelly Davis
Claire Cardie
Billy Chiu
Sampo Pyysalo
Ivan Vuli´c
Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003… (voir plus) static word embeddings to uncover "latent 004 knowledge" contained within domain-specific 005 scientific corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later verified using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts
Extended Abstract Track
Jason Hartford
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
S5 Framework: A Review of Self-Supervised Shared Semantic Space Optimization for Multimodal Zero-Shot Learning
Clst
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Ja-740 cob
Angeliki Lapata
Jonathan Lazaridou
Alek-742 May
Nicolas sandr Nisnevich
P. PintoJoseph
Turian
Ting Chen
Simon Kornblith
Mohammad Norouzi
Yen-Chun Chen
Linjie Li
Licheng Yu
Ahmed El … (voir 89 de plus)
Faisal Kholy
Zhe Ahmed
Yu Gan
Cheng
Zihan Dai
Hanxiao Liu
Quoc V. Le
Jia Deng
Wei Dong
Richard Socher
Li-Jia Li
K. Liu
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Jesse Dodge
Maarten Sap
Ana Marasovic
Gabriel Agnew
Dirk Ilharco
Groeneveld Matt
Li Dong
Nan Yang
Wenhui Wang
Furu Wei
Yang Liu
Jianfeng Wang
Ming Gao
Zhou
Xiaoyi Dong
Jia Bao
Ting Zhang
Dongdong
Weiming Chen
Lu Zhang
Dong Yuan
Fang Chen
Da-cheng Juan
Chuntian Lu
Zhen Li
Futang Peng
Aleksei Timofeev
Yi-Ting Chen
Yaxi Gao
Tom
Andrew Duerig
Tomkins Sujith
Ravi
Lukasz Kaiser
Aidan N. Gomez
Noam M. Shazeer
Niki Vaswani
Llion Parmar
Jones Jakob
Uszko-850
Alex G. Kendall
Yarin Gal
Roberto Cipolla
Salman H. Khan
Muzammal Naseer
Munawar Hayat
Waqas Zamir
Fahad Shahbaz
Khan
Ranjay Krishna
Yuke Zhu
Oliver Groth
Justin John-867
Kenji Hata
Joshua Kravitz
Stephanie Chen
Mike Lewis
Yinhan Liu
Marjan Naman Goyal
Abdelrahman Ghazvininejad
Omer Mohamed
Levy
Luke Zettlemoyer
Bohan Li
Hao Zhou
Jun-Tao He
Mingxuan Wang
Liunian Harold
Mark Li
Da Yatskar
Yin
Cho-Jui
Kai-Wei Chang
Visualbert
In this review, we aim to inspire research into 001 S elf-S upervised S hared S emantic S pace ( S5 ) 002 multimodal learning problems. We e… (voir plus)quip non-003 expert researchers with a framework of in-004 formed modeling decisions via an extensive 005 literature review, an actionable modeling check-006 list, as well as a series of novel zero-shot eval-007 uation tasks. The core idea for our S5 check-008 list lies in learning contextual multimodal in-009 teractions at various granularity levels via a 010 shared Transformer encoder with a denoising 011 loss term, which is also regularized by a con-012 trastive loss term to induce a semantic align-013 ment prior on the contextual embedding space. 014 Essentially, we aim to model human concept 015 understanding and thus learn to “put a name to 016 a face”. This ultimately enables interpretable 017 zero-shot S5 generalization on a variety of 018 novel downstream tasks. In summary, this re-019 view provides sufficient background and ac-020 tionable strategies for training cutting-edge S5 021 multimodal networks. 022
Graph-Based Active Machine Learning Method for Diverse and Novel Antimicrobial Peptides Generation and Selection
Bonaventure F. P. Dossou
Dianbo Liu
Dianbo Liu
Almer M. van der Sloot
Roger Palou
Michael Tyers
As antibiotic-resistant bacterial strains are rapidly spreading worldwide, infections caused by these strains are emerging as a global crisi… (voir plus)s causing the death of millions of people every year. Antimicrobial Peptides (AMPs) are one of the candidates to tackle this problem because of their potential diversity, and ability to favorably modulate the host immune response. However, large-scale screening of new AMP candidates is expensive, time-consuming, and now affordable in developing countries, which need the treatments the most. In this work, we propose a novel active machine learning-based framework that statistically minimizes the number of wet-lab experiments needed to design new AMPs, while ensuring a high diversity and novelty of generated AMPs sequences, in multi-rounds of wet-lab AMP screening settings. Combining recurrent neural network models and a graph-based filter (GraphCC), our proposed approach delivers novel and diverse candidates and demonstrates better performances according to our defined metrics.
Harvesting Mature Relation Extraction Models from Limited Seed Knowledge: A Self-Development Framework for DS Rule Expansion
Raphael Hoffmann
Congle Zhang
Xiao Ling
Yankai Lin
Shiqi Shen
Zhiyuan Liu
Huanbo Luan
Christopher D Manning
M. Surdeanu
John Bauer
Adriana Romero
Pietro Lio’
Xuanhui Wang
Cheng Li
Nadav Golbandi
Bendersky Marc
Najork. 2018
The
Wentao Wu … (voir 2 de plus)
Hongsong Li
Haixun Wang
Distantly-supervised relation extraction 001 (DSRE) is an effective method to scale relation 002 extraction (RE) to large unlabeled corpora … (voir plus)003 with the utilization of knowledge bases (KBs), 004 but suffers from the scale of KBs and the 005 introduced noise. 006 To alleviate the above two problems, we 007 propose a novel framework called S elf-008 devel O pment r U le ex P ansion ( SOUP ), which 009 starts from limited amount of labeled data 010 and continuously produces low-noise labels on 011 large-scaled unlabeled data by a growing learn-012 able logical rules set. 013 Specifically, SOUP achieves a mutual enhance-014 ment of RE model and logical rules set, first 015 a RE model is trained on the labeled data to 016 summarize the knowledge, then the knowledge 017 is utilized to explore candidate rules from unla-018 beled data, finally high-quality candidates are 019 selected in a graph-based ranking manner to ex-020 tend the logical rules set and new rule-labeled 021 data are provided for better RE model training. 022 Experiments on wiki20 dataset demonstrate 023 that, with limited seed knowledge from small-024 scaled manually labeled data, SOUP achieves 025 significant improvement compared to baselines 026 by producing continuous growth of both logical 027 rules and the RE model, and that labeling noise 028 of SOUP is much less than DS. Furthermore, 029 RE model enhanced by SOUP with 1.6k logical 030 rules learned from prior knowledge could pro-031 duce an equivalent performance to the model 032 trained on data labeled in DS manner by 72k 033 relational facts of KBs. 034
Is a Modular Architecture Enough?
Inspired from human cognition, machine learning systems are gradually revealing advantages of sparser and more modular architectures. Recent… (voir plus) work demonstrates that not only do some modular architectures generalize well, but they also lead to better out-of-distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparsely interacting parts, and endowing models with similar inductive biases will be helpful. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. In this work, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. In doing so, we propose evaluation metrics that highlight the benefits of modularity, the regimes in which these benefits are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential.
Neural Attentive Circuits
Nasim Rahaman
Francesco Locatello
Bernhard Schölkopf
Li Erran Li
Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modali… (voir plus)ties. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.
(Private)-Retroactive Carbon Pricing [(P)ReCaP]: A Market-based Approach for Climate Finance and Risk Assessment
Prateek Gupta
Dylan Radovic
Maarten Scholl
Christian Schroeder de Witt
Yang Zhang
Insufficient Social Cost of Carbon (SCC) estimation methods and short-term decision-making horizons have hindered the ability of carbon emit… (voir plus)ters to properly correct for the negative externalities of climate change, as well as the capacity of nations to balance economic and climate policy. To overcome these limitations, we introduce Retrospective Social Cost of Carbon Updating (ReSCCU), a novel mechanism that corrects for these limitations as empirically measured evidence is collected. To implement ReSCCU in the context of carbon taxation, we propose Retroactive Carbon Pricing (ReCaP), a market mechanism in which polluters offload the payment of ReSCCU adjustments to insurers. To alleviate systematic risks and minimize government involvement, we introduce the Private ReCaP (PReCaP) prediction market, which could see real-world implementation based on the engagement of a few high net-worth individuals or independent institutions.
TaHiD: Tackling Data Hiding in Fake News Detection with News Propagation Networks
Adrien Benamira
Benjamin Devillers
Etienne Lesot
Ayush K. Ray
Manal Saadi
Fragkiskos D 587
Steven Bird
Ewan Klein
Edward Loper
Nat-593
Carlos Castillo
Marcelo Mendoza
Barbara Poblete
Daryna Dementieva
Alexander Panchenko
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Ashish Vaswani
Noam M. Shazeer … (voir 8 de plus)
Niki Parmar
Adriana Romero
Pietro Lio’
Yaqing Wang
Fenglong Ma
Zhiwei Jin
Fake news with detrimental societal effects has 001 attracted extensive attention and research. De-002 spite early success, the state-of-the… (voir plus)-art meth-003 ods fall short of considering the propagation 004 of news. News propagates at different times 005 through different mediums, including users, 006 comments, and sources, which form the news 007 propagation network. Moreover, the serious 008 problem of data hiding arises, which means 009 that fake news publishers disguise fake news 010 as real to confuse users by deleting comments 011 that refute the rumor or deleting the news itself 012 when it has been spread widely. Existing meth-013 ods do not consider the propagation of news 014 and fail to identify what matters in the process, 015 which leads to fake news hiding in the prop-016 agation network and escaping from detection. 017 Inspired by the propagation of news, we pro-018 pose a novel fake news detection framework 019 named TaHiD, which models the propagation 020 as a heterogeneous dynamic graph and contains 021 the propagation attention module to measure 022 the influence of different propagation. Exper-023 iments demonstrate that TaHiD extracts use-024 ful information from the news propagation net-025 work and outperforms state-of-the-art methods 026 on several benchmark datasets for fake news 027 detection. Additional studies also show that 028 TaHiD is capable of identifying fake news in 029 the case of data hiding. 030