Portrait of Yoshua Bengio

Yoshua Bengio

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Founder and Scientific Advisor, Leadership Team
Research Topics
Causality
Computational Neuroscience
Deep Learning
Generative Models
Graph Neural Networks
Machine Learning Theory
Medical Machine Learning
Molecular Modeling
Natural Language Processing
Probabilistic Models
Reasoning
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Marie-Josée Beauchamp, Administrative Assistant at marie-josee.beauchamp@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific advisor of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as special advisor and founding scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Collaborating Alumni - McGill University
Collaborating Alumni - Université de Montréal
Collaborating researcher - Cambridge University
Principal supervisor :
PhD - Université de Montréal
Independent visiting researcher
Co-supervisor :
PhD - Université de Montréal
Collaborating researcher - N/A
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher - KAIST
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Research Intern - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Research Intern - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Collaborating Alumni - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate - Université de Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Principal supervisor :
Collaborating Alumni
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
Independent visiting researcher - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Ying Wu Coll of Computing
PhD - University of Waterloo
Principal supervisor :
Collaborating Alumni - Max-Planck-Institute for Intelligent Systems
Research Intern - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Postdoctorate - Université de Montréal
Independent visiting researcher - Université de Montréal
Postdoctorate - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Postdoctorate - Université de Montréal
Master's Research - Université de Montréal
Collaborating Alumni - Université de Montréal
Postdoctorate
Independent visiting researcher - Technical University of Munich
PhD - Université de Montréal
Co-supervisor :
Postdoctorate - Université de Montréal
Postdoctorate - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating Alumni - McGill University
Principal supervisor :

Publications

FAENet: Frame Averaging Equivariant GNN for Materials Modeling
Alexandre AGM Duval
Victor Schmidt
Alex Hernandez-Garcia
Santiago Miret
Fragkiskos D. Malliaros
Applications of machine learning techniques for materials modeling typically involve functions known to be equivariant or invariant to speci… (see more)fic symmetries. While graph neural networks (GNNs) have proven successful in such tasks, they enforce symmetries via the model architecture, which often reduces their expressivity, scalability and comprehensibility. In this paper, we introduce (1) a flexible framework relying on stochastic frame-averaging (SFA) to make any model E(3)-equivariant or invariant through data transformations. (2) FAENet: a simple, fast and expressive GNN, optimized for SFA, that processes geometric information without any symmetrypreserving design constraints. We prove the validity of our method theoretically and empirically demonstrate its superior accuracy and computational scalability in materials modeling on the OC20 dataset (S2EF, IS2RE) as well as common molecular modeling tasks (QM9, QM7-X). A package implementation is available at https://faenet.readthedocs.io.
GFlowNet-EM for Learning Compositional Latent Variable Models
Edward J Hu
Moksh J. Jain
Katie E Everett
Alexandros Graikos
Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large nu… (see more)mber of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictive approximations to the posterior. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density by learning a stochastic policy for sequential construction of samples, for this intractable E-step. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational inference algorithms for complex distributions over discrete structures. Our approach, GFlowNet-EM, enables the training of expressive LVMs with discrete compositional latents, as shown by experiments on non-context-free grammar induction and on images using discrete variational autoencoders (VAEs) without conditional independence enforced in the encoder.
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Eric Nguyen
Michael Poli
Marjan Faizi
Armin W Thomas
Callum Birch-Sykes
Michael Wornow
Aman Patel
Clayton M. Rabideau
Stefano Massaroli
Stefano Ermon
Stephen Baccus
Christopher Re
Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural language mode… (see more)ls, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on toke
Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization
Dianbo Liu
Alex Lamb
Xu Ji
Pascal Notsawo
Michael Curtis Mozer
Kenji Kawaguchi
The Effect of diversity in Meta-Learning
Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that task … (see more)distribution plays a vital role in the performance of the model. Conventional wisdom is that task diversity should improve the performance of meta-learning. In this work, we find evidence to the contrary; we study different task distributions on a myriad of models and datasets to evaluate the effect of task diversity on meta-learning algorithms. For this experiment, we train on multiple datasets, and with three broad classes of meta-learning models - Metric-based (i.e., Protonet, Matching Networks), Optimization-based (i.e., MAML, Reptile, and MetaOptNet), and Bayesian meta-learning models (i.e., CNAPs). Our experiments demonstrate that the effect of task diversity on all these algorithms follows a similar trend, and task diversity does not seem to offer any benefits to the learning of the model. Furthermore, we also demonstrate that even a handful of tasks, repeated over multiple batches, would be sufficient to achieve a performance similar to uniform sampling and draws into question the need for additional tasks to create better models.
Constant Memory Attention Block
Frederick Tung
Hossein Hajimirsadeghi
Mohamed Osama Ahmed