Yoshua Bengio

ahmad.ghawanmeh@mila.quebec

Biography

*For media requests, please write to medias@mila.quebec.

For more information please contact Julie Mongeau, executive assistant at julie.mongeau@mila.quebec.

Yoshua Bengio is recognized worldwide as a leading expert in AI. He is most known for his pioneering work in deep learning, which earned him the 2018 A.M. Turing Award, “the Nobel Prize of computing,” with Geoffrey Hinton and Yann LeCun.

Bengio is a full professor at Université de Montréal, and the founder and scientific director of Mila – Quebec Artificial Intelligence Institute. He is also a senior fellow at CIFAR and co-directs its Learning in Machines & Brains program, serves as scientific director of IVADO, and holds a Canada CIFAR AI Chair.

In 2019, Bengio was awarded the prestigious Killam Prize and in 2022, he was the most cited computer scientist in the world by h-index. He is a Fellow of the Royal Society of London, Fellow of the Royal Society of Canada, Knight of the Legion of Honor of France and Officer of the Order of Canada. In 2023, he was appointed to the UN’s Scientific Advisory Board for Independent Advice on Breakthroughs in Science and Technology.

Concerned about the social impact of AI, Bengio helped draft the Montréal Declaration for the Responsible Development of Artificial Intelligence and continues to raise awareness about the importance of mitigating the potentially catastrophic risks associated with future AI systems.

Current Students

Aayush Bajaj

Professional Master's - Université de Montréal

Co-supervisor :

Samira Ebrahimi Kahou

aayush.bajaj@mila.quebec

Ahmad Ghawanmeh

Professional Master's - Université de Montréal

Akram Erraqabi

PhD - Université de Montréal

akram.erraqabi@mila.quebec

Alex Hernandez-Garcia

Postdoctorate - Université de Montréal

Co-supervisor :

Postdoctorate - Université de Montréal

alexander.tong@mila.quebec

Sasha Volokhova

PhD - Université de Montréal

alexandra.volokhova@mila.quebec

Alexandre Duval

Collaborating researcher - Université Paris-Saclay

Principal supervisor :

alexandre.duval@mila.quebec

andres.campero@mila.quebec

Aman Dalmia

Professional Master's - Université de Montréal

aman.dalmia@mila.quebec

Andrés Campero

Independent visiting researcher - MIT

aniket.didolkar@mila.quebec

Aniket Didolkar

PhD - Université de Montréal

ayoub.atanane@mila.quebec

Anja Surina

PhD - École Polytechnique Montréal Fédérale de Lausanne

anja.surina@mila.quebec

Ayoub Atanane

Research Intern - Université du Québec à Rimouski

Basile Terver

Collaborating researcher

Principal supervisor :

basile.terver@mila.quebec

Camille Rochefort-Boulanger

PhD - Université de Montréal

Principal supervisor :

rochefoc@mila.quebec

clemence.granade@mila.quebec

Chen Chen

Postdoctorate - Université de Montréal

Co-supervisor :

Collaborating Alumni

Professional Master's - Université de Montréal

Cristian Meo

Collaborating Alumni

cristian.meo@mila.quebec

cristian-dragos.manta@mila.quebec

Cristian Dragos Manta

PhD - Université de Montréal

Co-supervisor :

Dhanya Sridhar

damiano.fornasiere@mila.quebec

Damiano Fornasiere

PhD - Barcelona University

Dan Assouline

Collaborating Alumni

dan.assouline@mila.quebec

dinghuai.zhang@mila.quebec

Dinghuai Zhang

PhD - Université de Montréal

Principal supervisor :

Aaron Courville

Divya Sharma

Collaborating Alumni

divya.sharma@mila.quebec

Donna Vakalis

Postdoctorate - Université de Montréal

Co-supervisor :

donna.vakalis@mila.quebec

dragos.secrieru@mila.quebec

Dragos Secrieru

Master's Research - Université de Montréal

Edward Hu

PhD - Université de Montréal

edward.hu@mila.quebec

Elmimouni Zakaria

Research Intern - Université de Montréal

zakarya.elmimouni@mila.quebec

eric.elmoznino@mila.quebec

Eric Elmoznino

PhD - Université de Montréal

Co-supervisor :

Guillaume Lajoie

Research Intern - UQAR

ghait.boukachab@mila.quebec

jack.richter-powell@mila.quebec

Hae-Beom Lee

Collaborating Alumni

hae-beom.lee@mila.quebec

Jessie Richter-Powell

Independent visiting researcher - Université de Montréal

hussein-mohamu.jama@mila.quebec

Jama Mohamud

PhD - Université de Montréal

Principal supervisor :

Mirco Ravanelli

Research Intern - McGill University

jamal.abouhaibeh@mila.quebec

james.requeima@mila.quebec

James Requeima

Independent visiting researcher - Université de Montréal

Jarrid Rector-Brooks

PhD - Université de Montréal

Co-supervisor :

Sarath Chandar Anbil Parthipan

jarrid.rector-brooks@mila.quebec

Jean-pierre Falet

PhD - Université de Montréal

Co-supervisor :

Guillaume Lajoie

jean-pierre.falet@mila.quebec

Professional Master's - Université de Montréal

jerome.francis@mila.quebec

katie-elizabeth.everett@mila.quebec

George Jiangyan Ma

Research Intern - Université de Montréal

jiangyan.ma@mila.quebec

PhD - Université de Montréal

madankan@mila.quebec

Katie Everett

PhD - Massachusetts Institute of Technology

Léna Ezzine

PhD - Université de Montréal

lena-nehale.ezzine@mila.quebec

Leo Feng

PhD - Université de Montréal

leo.feng@mila.quebec

Leon Hetzel

Independent visiting researcher - Technical University Munich (TUM)

leon.hetzel@mila.quebec

Ling Pan

Independent visiting researcher - Hong Kong University of Science and Technology (HKUST)

ling.pan@mila.quebec

loubna.benabbou@mila.quebec

Loic Mandine

DESS - Université de Montréal

loic.mandine@mila.quebec

Loubna Benabbou

Independent visiting researcher - UQAR

marcin.sendera@mila.quebec

Luca Scimeca

Postdoctorate - Université de Montréal

luca.scimeca@mila.quebec

PhD - Université de Montréal

korablym@mila.quebec

Marcin Sendera

Research Intern - Université de Montréal

Marco STOCK

Independent visiting researcher - Technical University of Munich

marco.stock@mila.quebec

matthew.macdermott@mila.quebec

Matt MacDermott

Research Intern - Imperial College London

Mélisande Astrid Crystal Teng

PhD - Université de Montréal

Co-supervisor :

Postdoctorate - Université de Montréal

michal.koziarski@mila.quebec

Harry Zhao

PhD - McGill University

Principal supervisor :

Mingze Li

Professional Master's - Université de Montréal

mingze2.li@mila.quebec

Minsu Kim

Collaborating researcher - Université de Montréal

minsu.kim@mila.quebec

Research Intern - Université de Montréal

mohammed.abukalam@mila.quebec

mohammed.mahfoud@mila.quebec

Mohammed Mahfoud

Research Intern - Université de Montréal

Mohsin Hasan

PhD - Université de Montréal

mohsin.hasan@mila.quebec

nikolay.malkin@mila.quebec

Moksh Jain

PhD - Université de Montréal

moksh.jain@mila.quebec

PhD - Max-Planck-Institute for Intelligent Systems

rahamann@mila.quebec

Nicole Zhang

PhD - McGill University

Principal supervisor :

Collaborating Alumni - Université de Montréal

PhD - Université de Montréal

oussama.boussif@mila.quebec

pierre-paul.de-breuck@mila.quebec

Param Raval

Professional Master's - Université de Montréal

param.raval@mila.quebec

Paul Bertin

PhD - Université de Montréal

bertinpa@mila.quebec

Phong Nguyen

Independent visiting researcher - Université de Montréal

nguyenph@mila.quebec

Pierre-Paul De Breuck

Collaborating Alumni - Université de Montréal

Collaborating researcher

pietro.greiner@mila.quebec

Priya Nama Venkatesh

Professional Master's - Université de Montréal

priya.nama@mila.quebec

Prudencio Tossou

Collaborating researcher - Valence

Principal supervisor :

Dominique Beaini

prudencio.tossou@mila.quebec

Rim Assouel

PhD - Université de Montréal

assouelr@mila.quebec

Ruixiang Zhang

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

lahlosal@mila.quebec

Sarthak Mittal

PhD - Université de Montréal

Principal supervisor :

Seanie Lee

Research Intern - Université de Montréal

seanie.lee@mila.quebec

Professional Master's

aasheesh.singh@mila.quebec

Collaborating researcher - Université de Montréal

soren.mindermann@mila.quebec

Stefan Bauer

Independent visiting researcher

Co-supervisor :

Guillaume Lajoie

stefan.bauer@mila.quebec

stefano.massaroli@mila.quebec

Stefano Massaroli

Postdoctorate - Université de Montréal

Stephen Lu

Research Intern - McGill University

stephen.lu@mila.quebec

Professional Master's - Université de Montréal

subhrajyoti.dasgupta@mila.quebec

Theo Saulus

Collaborating researcher

Principal supervisor :

thomas.jiralerspong@mila.quebec

theo.saulus@mila.quebec

Thomas Jiralerspong

Master's Research - Université de Montréal

Co-supervisor :

Doina Precup

PhD - Université de Montréal

tianyu.zhang@mila.quebec

PhD - Université de Montréal

Vedant Shah

Master's Research - Université de Montréal

vedant.shah@mila.quebec

PhD - Université de Montréal

Todosijevic Viktor Todosijevic

Collaborating researcher - RWTH Aachen University (Rheinisch-Westfälische Technische Hochschule Aachen)

Principal supervisor :

viktor.todosijevic@mila.quebec

vincent.quirion@mila.quebec

Vincent Quirion

Undergraduate - Université de Montréal

Xiaoyin Chen

PhD - Université de Montréal

xiaoyin.chen@mila.quebec

Yashaswi Pupneja

Professional Master's - Université de Montréal

yashaswi.pupneja@mila.quebec

younesse.kaddar@mila.quebec

Yizhao Wang

Professional Master's - Université de Montréal

yizhao.wang@mila.quebec

Younesse Kaddar

Research Intern - Université de Montréal

Skipper: Combining Spatial and Temporal Abstraction for Better Generalization

Zhen Liu

PhD - Université de Montréal

Principal supervisor :

Liam Paull

liuzhen@mila.quebec

Zibo Shang

Professional Master's - Université de Montréal

zibo.shang@mila.quebec

Zichao Yan

Postdoctorate - Université de Montréal

yanzicha@mila.quebec

Blog Posts

Generic thumbnail for Mila Blog articles.

February 22, 2024

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Scaling in the Service of Reasoning & Model-Based ML

April 4, 2023

Yoshua Bengio

Edward J. Hu

A collaboration between Mila and Relation Therapeutics to discover novel synergistic combinations of drugs in vitro

March 23, 2022

Paul Bertin

Jake P. Taylor-King

Yoshua Bengio

March 15, 2022

Generative Flow Networks

Yoshua Bengio

Publications

Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Accuracy

Alex Lamb

Vikas Verma

Juho Kannala

Adversarial robustness has become a central goal in deep learning, both in theory and practice. However, successful methods to improve adver… (see more)sarial robustness (such as adversarial training) greatly hurt generalization performance on the clean data. This could have a major impact on how adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve performance on the clean data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases clean test error from 5.8% to 16.7%, whereas with our Interpolated adversarial training we retain adversarial robustness while achieving a clean test error of only 6.5%. With our technique, the relative error increase for the robust model is reduced from 187.9% to just 12.1%.

2019-01-01

arXiv.org (preprint)

dblp.uni-trier.de

Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information

Eric P. Larsen

Sébastien Lachapelle

This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a method… (see more)ology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs.

2018-07-31

ArXiv (preprint)

doi.org

Information Fusion in Deep Convolutional Neural Networks for Biomedical Image Segmentation 1

Mohammad Havaei

Nicolas Guizard

Nicolas Chapados

2018-07-04

Signal Processing and Machine Learning for Biomedical Big Data (published)

doi.org

Focused Hierarchical RNNs for Conditional Sequence Processing

Nan Rosemary Ke

Konrad Żołna

Alessandro Sordoni

Zhouhan Lin

Adam Trischler

Recurrent Neural Networks (RNNs) with attention mechanisms have obtained state-of-the-art results for many sequence processing tasks. Most o… (see more)f these models use a simple form of encoder with attention that looks over the entire sequence and assigns a weight to each token independently. We present a mechanism for focusing RNN encoders for sequence modelling tasks which allows them to attend to key parts of the input as needed. We formulate this using a multi-layer conditional sequence encoder that reads in one token at a time and makes a discrete decision on whether the token is relevant to the context or question being asked. The discrete gating mechanism takes in the context embedding and the current hidden state as inputs and controls information flow into the layer above. We train it using policy gradient methods. We evaluate this method on several types of tasks with different attributes. First, we evaluate the method on synthetic tasks which allow us to evaluate the model for its generalization ability and probe the behavior of the gates in more controlled settings. We then evaluate this approach on large scale Question Answering tasks including the challenging MS MARCO and SearchQA tasks. Our models shows consistent improvements for both tasks over prior work and our baselines. It has also shown to generalize significantly better on synthetic tasks as compared to the baselines.

2018-07-03

Proceedings of the 35th International Conference on Machine Learning (published)

proceedings.mlr.press

Commonsense mining as knowledge base completion? A study on the impact of novelty

Stanisław Jastrzębski

Dzmitry Bahdanau

Seyedarian Hosseini

Michael Noukhovitch

Jackie Cheung

Commonsense knowledge bases such as ConceptNet represent knowledge in the form of relational triples. Inspired by recent work by Li et al., … (see more)we analyse if knowledge base completion models can be used to mine commonsense knowledge from raw text. We propose novelty of predicted triples with respect to the training set as an important factor in interpreting results. We critically analyse the difficulty of mining novel commonsense knowledge, and show that a simple baseline method that outperforms the previous state of the art on predicting more novel triples.

2018-06-01

Proceedings of the Workshop on Generalization in the Age of Deep Learning (published)

doi.org

Learning Anonymized Representations with Adversarial Neural Networks

Clément Feutry

Pablo Piantanida

P. Duhamel

Statistical methods protecting sensitive information or the identity of the data owner have become critical to ensure privacy of individuals… (see more) as well as of organizations. This paper investigates anonymization methods based on representation learning and deep neural networks, and motivated by novel information theoretical bounds. We introduce a novel training objective for simultaneously training a predictor over target variables of interest (the regular labels) while preventing an intermediate representation to be predictive of the private labels. The architecture is based on three sub-networks: one going from input to representation, one from representation to predicted regular labels, and one from representation to predicted private labels. The training procedure aims at learning representations that preserve the relevant part of the information (about regular labels) while dismissing information about the private labels which correspond to the identity of a person. We demonstrate the success of this approach for two distinct classification versus anonymization tasks (handwritten digits and sentiment analysis).

2018-02-26

ArXiv (preprint)

Sparse Attentive Backtracking: Long-Range Credit Assignment in Recurrent Networks

Nan Rosemary Ke

Anirudh Goyal

Olexa Bilaniuk

Jonathan Binas

Laurent Charlin

Chris Pal

A major drawback of backpropagation through time (BPTT) is the difficulty of learning long-term dependencies, coming from having to propagat… (see more)e credit information backwards through every single step of the forward computation. This makes BPTT both computationally impractical and biologically implausible. For this reason, full backpropagation through time is rarely used on long sequences, and truncated backpropagation through time is used as a heuristic. However, this usually leads to biased estimates of the gradient in which longer term dependencies are ignored. Addressing this issue, we propose an alternative algorithm, Sparse Attentive Backtracking, which might also be related to principles used by brains to learn long-term dependencies. Sparse Attentive Backtracking learns an attention mechanism over the hidden states of the past and selectively backpropagates through paths with high attention weights. This allows the model to learn long term dependencies while only backtracking for a small number of time steps, not just from the recent past but also from attended relevant past states.

2017-11-07

ArXiv (preprint)

Diet Networks: Thin Parameters for Fat Genomics

Adriana Romero Soriano

pierre luc carrier

Akram Erraqabi

Tristan Sylvain

Alex Auvolat

Etienne Dejoie

Marc-André Legault

Marie-Pierre Dubé

Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude… (see more) larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in medical research, more specifically in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer (number of input features times number of hidden units): each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed in data), and then learn (with another neural network called the parameter prediction network) how to map a feature's distributed representation (based on the feature's identity not its value) to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). This approach views the problem of producing the parameters associated with each feature as a multi-task learning problem. We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier.

2017-01-01

ICLR.cc/2017/conference (poster)

openreview.net

Diet Networks: Thin Parameters for Fat Genomics

Adriana Romero Soriano

pierre luc carrier

Akram Erraqabi

Tristan Sylvain

Alex Auvolat

Etienne Dejoie

Marc-André Legault

Marie-Pierre Dubé

2017-01-01

ICLR.cc/2017/conference (poster)

openreview.net

Diet Networks: Thin Parameters for Fat Genomic

Adriana Romero Soriano

pierre luc carrier

Akram Erraqabi

Tristan Sylvain

Alex Auvolat

Etienne Dejoie

Marc-andr'e Legault

M. Dubé

Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude… (see more) larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer: each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed), and then learn (with another neural network called the parameter prediction network) how to map a feature's distributed representation to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier.

2016-11-04

ArXiv (preprint)