Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Founder and Scientific Advisor, Leadership Team
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Marie-Josée Beauchamp, Administrative Assistant at marie-josee.beauchamp@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating Alumni - Université de Montréal
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Independent visiting researcher
Co-supervisor :
PhD - Université de Montréal
Independent visiting researcher
Principal supervisor :
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
Independent visiting researcher
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Principal supervisor :
Collaborating Alumni
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
Collaborating researcher - University of Waterloo
Principal supervisor :
Collaborating Alumni - Max-Planck-Institute for Intelligent Systems
Collaborating researcher - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Independent visiting researcher - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher
Principal supervisor :
Postdoctorate - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate
Co-supervisor :
Independent visiting researcher - Technical University of Munich
PhD - Université de Montréal
Co-supervisor :
Independent visiting researcher
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher
Collaborating researcher - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Collaborating Alumni - McGill University
Principal supervisor :

Publications

Neural Attentive Circuits
Nasim Rahaman
Francesco Locatello
Bernhard Schölkopf
Li Erran Li
Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modali… (see more)ties. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a method… (see more)ology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.
TaHiD: Tackling Data Hiding in Fake News Detection with News Propagation Networks
Adrien Benamira
Benjamin Devillers
Etienne Lesot
Ayush K. Ray
Manal Saadi
Fragkiskos D 587
Steven Bird
Ewan Klein
Edward Loper
Nat-593
Carlos Castillo
Marcelo Mendoza
Barbara Poblete
Daryna Dementieva
Alexander Panchenko
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Ashish Vaswani
Noam M. Shazeer … (see 8 more)
Niki Parmar
Pietro Lio’
Yaqing Wang
Fenglong Ma
Zhiwei Jin
Fake news with detrimental societal effects has 001 attracted extensive attention and research. De-002 spite early success, the state-of-the… (see more)-art meth-003 ods fall short of considering the propagation 004 of news. News propagates at different times 005 through different mediums, including users, 006 comments, and sources, which form the news 007 propagation network. Moreover, the serious 008 problem of data hiding arises, which means 009 that fake news publishers disguise fake news 010 as real to confuse users by deleting comments 011 that refute the rumor or deleting the news itself 012 when it has been spread widely. Existing meth-013 ods do not consider the propagation of news 014 and fail to identify what matters in the process, 015 which leads to fake news hiding in the prop-016 agation network and escaping from detection. 017 Inspired by the propagation of news, we pro-018 pose a novel fake news detection framework 019 named TaHiD, which models the propagation 020 as a heterogeneous dynamic graph and contains 021 the propagation attention module to measure 022 the influence of different propagation. Exper-023 iments demonstrate that TaHiD extracts use-024 ful information from the news propagation net-025 work and outperforms state-of-the-art methods 026 on several benchmark datasets for fake news 027 detection. Additional studies also show that 028 TaHiD is capable of identifying fake news in 029 the case of data hiding. 030
Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning
Aniket Rajiv Didolkar
Nitesh Bharadwaj Gundavarapu
Nan Rosemary Ke
Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution
Anthony Zador
Bence Ölveczky
Sean Escola
Kwabena Boahen
Matthew Botvinick
Dmitri Chklovskii
Anne Churchland
Claudia Clopath
James DiCarlo
Surya Ganguli
Jeff Hawkins
Konrad Paul Kording
Alexei Koulakov
Timothy P. Lillicrap
Adam Marblestone
Bruno Olshausen
Alexandre Pouget … (see 7 more)
Cristina Savin
Terrence Sejnowski
Eero Simoncelli
Sara Solla
David Sussillo
Andreas S. Tolias
Doris Tsao
Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution
Anthony Zador
Bence Ölveczky
Sean Escola
Kwabena Boahen
Matthew Botvinick
Dmitri Chklovskii
Anne Churchland
Claudia Clopath
James DiCarlo
Surya Ganguli
Jeff Hawkins
Konrad Paul Kording
Alexei Koulakov
Timothy P. Lillicrap
Adam Marblestone
Bruno Olshausen
Alexandre Pouget … (see 7 more)
Cristina Savin
Terrence Sejnowski
Eero Simoncelli
Sara Solla
David Sussillo
Andreas S. Tolias
Doris Tsao
Trajectory Balance: Improved Credit Assignment in GFlowNets
Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or s… (see more)trings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.
Unifying Likelihood-free Inference with Black-box Optimization and Beyond
Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on th… (see more)e pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be"reinvented"in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.
Weakly Supervised Representation Learning with Sparse Perturbations
Jason Hartford
The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge o… (see more)r any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables--e.g. images in a reinforcement learning environment where actions move individual sprites--identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.
Multi-Domain Balanced Sampling Improves Out-of-Distribution Generalization of Chest X-ray Pathology Prediction Models
Enoch Amoatey Tetteh
Joseph D Viviano
Learning models that generalize under different distribution shifts in medical imaging has been a long-standing research challenge. There ha… (see more)ve been several proposals for efficient and robust visual representation learning among vision research practitioners, especially in the sensitive and critical biomedical domain. In this paper, we propose an idea for out-of-distribution generalization of chest X-ray pathologies that uses a simple balanced batch sampling technique. We observed that balanced sampling between the multiple training datasets improves the performance over baseline models trained without balancing.
Generative Adversarial Networks
Generative Adversarial Networks (GANs) are a type of deep learning techniques that have shown remarkable success in generating realistic ima… (see more)ges, videos, and other types of data. This paper provides a comprehensive guide to GANs, covering their architecture, loss functions, training methods, applications, evaluation metrics, challenges, and future directions. We begin with an introduction to GANs and their historical development, followed by a review of the background and related work. We then provide a detailed overview of the GAN architecture, including the generator and discriminator networks, and discuss the key design choices and variations. Next, we review the loss functions utilized in GANs, including the original minimax objective, as well as more recent approaches s.a. Wasserstein distance and gradient penalty. We then delve into the training of GANs, discussing common techniques s.a. alternating optimization, minibatch discrimination, and spectral normalization. We also provide a survey of the various applications of GANs across domains. In addition, we review the evaluation metrics utilized to assess the diversity and quality of GAN-produced data. Furthermore, we discuss the challenges and open issues in GANs, including mode collapse, training instability, and ethical considerations. Finally, we provide a glimpse into the future directions of GAN research, including improving scalability, developing new architectures, incorporating domain knowledge, and exploring new applications. Overall, this paper serves as a comprehensive guide to GANs, providing both theoretical and practical insights for researchers and practitioners in the field.
Effect of diversity in Meta-Learning
Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task … (see more)distribution plays a vital role in the performance of the model. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on two datasets - Omniglot and miniImageNet and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.