Portrait de Yoshua Bengio

Yoshua Bengio

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Fondateur et Conseiller scientifique, Équipe de direction
Sujets de recherche
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Modélisation moléculaire
Neurosciences computationnelles
Raisonnement
Réseaux de neurones en graphes
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique
Traitement du langage naturel

Biographie

*Pour toute demande média, veuillez écrire à medias@mila.quebec.

Pour plus d’information, contactez Marie-Josée Beauchamp, adjointe administrative à marie-josee.beauchamp@mila.quebec.

Reconnu comme une sommité mondiale en intelligence artificielle, Yoshua Bengio s’est surtout distingué par son rôle de pionnier en apprentissage profond, ce qui lui a valu le prix A. M. Turing 2018, le « prix Nobel de l’informatique », avec Geoffrey Hinton et Yann LeCun. Il est professeur titulaire à l’Université de Montréal, fondateur et conseiller scientifique de Mila – Institut québécois d’intelligence artificielle, et codirige en tant que senior fellow le programme Apprentissage automatique, apprentissage biologique de l'Institut canadien de recherches avancées (CIFAR). Il occupe également la fonction de conseiller spécial et directeur scientifique fondateur d’IVADO.

En 2018, il a été l’informaticien qui a recueilli le plus grand nombre de nouvelles citations au monde. En 2019, il s’est vu décerner le prestigieux prix Killam. Depuis 2022, il détient le plus grand facteur d’impact (h-index) en informatique à l’échelle mondiale. Il est fellow de la Royal Society de Londres et de la Société royale du Canada, et officier de l’Ordre du Canada.

Soucieux des répercussions sociales de l’IA et de l’objectif que l’IA bénéficie à tous, il a contribué activement à la Déclaration de Montréal pour un développement responsable de l’intelligence artificielle.

Étudiants actuels

Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche - Cambridge University
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Visiteur de recherche indépendant - KAIST
Visiteur de recherche indépendant
Co-superviseur⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - N/A
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - KAIST
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Doctorat - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - UdeM
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni
Doctorat - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Postdoctorat - UdeM
Superviseur⋅e principal⋅e :
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Ying Wu Coll of Computing
Doctorat - University of Waterloo
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - Max-Planck-Institute for Intelligent Systems
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Postdoctorat - UdeM
Visiteur de recherche indépendant - UdeM
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Maîtrise recherche - UdeM
Visiteur de recherche indépendant - Technical University of Munich
Doctorat - UdeM
Co-superviseur⋅e :
Postdoctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche
Stagiaire de recherche - UdeM
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Harvesting Mature Relation Extraction Models from Limited Seed Knowledge: A Self-Development Framework for DS Rule Expansion
Raphael Hoffmann
Congle Zhang
Xiao Ling
Yankai Lin
Shiqi Shen
Zhiyuan Liu
Huanbo Luan
Christopher D Manning
M. Surdeanu
John Bauer
Pietro Lio’
Xuanhui Wang
Cheng Li
Nadav Golbandi
Bendersky Marc
Najork. 2018
The
Wentao Wu … (voir 2 de plus)
Hongsong Li
Haixun Wang
Distantly-supervised relation extraction 001 (DSRE) is an effective method to scale relation 002 extraction (RE) to large unlabeled corpora … (voir plus)003 with the utilization of knowledge bases (KBs), 004 but suffers from the scale of KBs and the 005 introduced noise. 006 To alleviate the above two problems, we 007 propose a novel framework called S elf-008 devel O pment r U le ex P ansion ( SOUP ), which 009 starts from limited amount of labeled data 010 and continuously produces low-noise labels on 011 large-scaled unlabeled data by a growing learn-012 able logical rules set. 013 Specifically, SOUP achieves a mutual enhance-014 ment of RE model and logical rules set, first 015 a RE model is trained on the labeled data to 016 summarize the knowledge, then the knowledge 017 is utilized to explore candidate rules from unla-018 beled data, finally high-quality candidates are 019 selected in a graph-based ranking manner to ex-020 tend the logical rules set and new rule-labeled 021 data are provided for better RE model training. 022 Experiments on wiki20 dataset demonstrate 023 that, with limited seed knowledge from small-024 scaled manually labeled data, SOUP achieves 025 significant improvement compared to baselines 026 by producing continuous growth of both logical 027 rules and the RE model, and that labeling noise 028 of SOUP is much less than DS. Furthermore, 029 RE model enhanced by SOUP with 1.6k logical 030 rules learned from prior knowledge could pro-031 duce an equivalent performance to the model 032 trained on data labeled in DS manner by 72k 033 relational facts of KBs. 034
Is a Modular Architecture Enough?
Inspired from human cognition, machine learning systems are gradually revealing advantages of sparser and more modular architectures. Recent… (voir plus) work demonstrates that not only do some modular architectures generalize well, but they also lead to better out of distribution generalization, scaling properties, learning speed, and interpretability. A key intuition behind the success of such systems is that the data generating system for most real-world settings is considered to consist of sparse modular connections, and endowing models with similar inductive biases will be helpful. However, the field has been lacking in a rigorous quantitative assessment of such systems because these real-world data distributions are complex and unknown. In this work, we provide a thorough assessment of common modular architectures, through the lens of simple and known modular data distributions. We highlight the benefits of modularity and sparsity and reveal insights on the challenges faced while optimizing modular systems. In doing so, we propose evaluation metrics that highlight the benefits of modularity, the regimes in which these benefits are substantial, as well as the sub-optimality of current end-to-end learned modular systems as opposed to their claimed potential.
Neural Attentive Circuits
Nasim Rahaman
Francesco Locatello
Bernhard Schölkopf
Li Erran Li
Nicolas Ballas
Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modali… (voir plus)ties. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a method… (voir plus)ology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.
TaHiD: Tackling Data Hiding in Fake News Detection with News Propagation Networks
Adrien Benamira
Benjamin Devillers
Etienne Lesot
Ayush K. Ray
Manal Saadi
Fragkiskos D 587
Steven Bird
Ewan Klein
Edward Loper
Nat-593
Carlos Castillo
Marcelo Mendoza
Barbara Poblete
Daryna Dementieva
Alexander Panchenko
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Ashish Vaswani
Noam M. Shazeer … (voir 8 de plus)
Niki Parmar
Pietro Lio’
Yaqing Wang
Fenglong Ma
Zhiwei Jin
Fake news with detrimental societal effects has 001 attracted extensive attention and research. De-002 spite early success, the state-of-the… (voir plus)-art meth-003 ods fall short of considering the propagation 004 of news. News propagates at different times 005 through different mediums, including users, 006 comments, and sources, which form the news 007 propagation network. Moreover, the serious 008 problem of data hiding arises, which means 009 that fake news publishers disguise fake news 010 as real to confuse users by deleting comments 011 that refute the rumor or deleting the news itself 012 when it has been spread widely. Existing meth-013 ods do not consider the propagation of news 014 and fail to identify what matters in the process, 015 which leads to fake news hiding in the prop-016 agation network and escaping from detection. 017 Inspired by the propagation of news, we pro-018 pose a novel fake news detection framework 019 named TaHiD, which models the propagation 020 as a heterogeneous dynamic graph and contains 021 the propagation attention module to measure 022 the influence of different propagation. Exper-023 iments demonstrate that TaHiD extracts use-024 ful information from the news propagation net-025 work and outperforms state-of-the-art methods 026 on several benchmark datasets for fake news 027 detection. Additional studies also show that 028 TaHiD is capable of identifying fake news in 029 the case of data hiding. 030
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning
Aniket Rajiv Didolkar
Kshitij Gupta
Anirudh Goyal
Nitesh Bharadwaj Gundavarapu
Alex Lamb
Nan Rosemary Ke
Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution
Anthony Zador
Bence Ölveczky
Sean Escola
Kwabena Boahen
Matthew Botvinick
Dmitri Chklovskii
Anne Churchland
Claudia Clopath
James DiCarlo
Surya Ganguli
Jeff Hawkins
Konrad Paul Kording
Alexei Koulakov
Yann LeCun
Timothy P. Lillicrap
Adam Marblestone
Bruno Olshausen
Alexandre Pouget … (voir 7 de plus)
Cristina Savin
Terrence Sejnowski
Eero Simoncelli
Sara Solla
David Sussillo
Andreas S. Tolias
Doris Tsao
Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution
Anthony Zador
Bence Ölveczky
Sean Escola
Kwabena Boahen
Matthew Botvinick
Dmitri Chklovskii
Anne Churchland
Claudia Clopath
James DiCarlo
Surya Ganguli
Jeff Hawkins
Konrad Paul Kording
Alexei Koulakov
Yann LeCun
Timothy P. Lillicrap
Adam Marblestone
Bruno Olshausen
Alexandre Pouget … (voir 7 de plus)
Cristina Savin
Terrence Sejnowski
Eero Simoncelli
Sara Solla
David Sussillo
Andreas S. Tolias
Doris Tsao
Trajectory Balance: Improved Credit Assignment in GFlowNets
Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or s… (voir plus)trings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.
Unifying Likelihood-free Inference with Black-box Optimization and Beyond
Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on th… (voir plus)e pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be"reinvented"in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.
Weakly Supervised Representation Learning with Sparse Perturbations
Kartik Ahuja
Jason Hartford
The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge o… (voir plus)r any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.
Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models
Enoch Amoatey Tetteh
Joseph D Viviano
Joseph Paul Cohen
Learning models that generalize under different distribution shifts in medical imaging has been a long-standing research challenge. There ha… (voir plus)ve been several proposals for efficient and robust visual representation learning among vision research practitioners, especially in the sensitive and critical biomedical domain. In this paper, we propose an idea for out-of-distribution generalization of chest X-ray pathologies that uses a simple balanced batch sampling technique. We observed that balanced sampling between the multiple training datasets improves the performance over baseline models trained without balancing.