## Yoshua Bengio

*Full Professor, Department of Computer Science and Operations Research, Université de Montréal. *

*Canada Research Chair in Statistical Learning Algorithms.*

*Founder and Scientific Director of Mila. *

*Scientific Director of IVADO. CIFAR Fellow and Program Director.*

### Research

#### Research Interests

My long-term goal is to **understand the mechanisms giving rise to intelligence**; understanding the underlying principles would deliver artificial intelligence, and I believe that learning algorithms are essential in this quest.

Since 1986 I have worked on **neural networks**, in particular on **deep learning** in this century. What fascinates me is how an intelligent agent, animal, human or machine, can **figure out how their environment works**. Of course this can be used to make good decisions, but I feel like at the heart is the notion of understanding, and the crucial question is how to **learn to understand**.

*In the past* I worked on learning of deep representations (either supervised or unsupervised), capturing sequential dependencies with recurrent networks and other autoregressive models, understanding credit assignment (including the quest for biologically plausible analogues of backprop, as well as end-to-end learning of complex modular information processing assemblies), meta-learning (or learning to learn), attention mechanisms, deep generative models, curriculum learning, variations of stochastic gradient descent and why SGD works for neural nets, convolutional architectures, natural language processing (especially with word embeddings, language models and machine translation), understanding why deep learning works so well and what its current limitations are. I worked on many applications of deep learning, including – but not limited to – healthcare (such as medical image analysis), standard AI tasks of computer vision, modeling speech and language and, more recently, robotics.

*Looking forward*, I am interested in

- how to go
**beyond the iid hypothesis**(and more generally the assumption that the test cases come from the same distribution as the training set), **causal learning**(i.e. figuring out what are the causal variables and how they are causally related),**modularizing knowledge**so it can be**re-used for fast transfer**and adaptation,- how agents can purposefully
**act to better understand**their environment, **grounded language learning**, as well as how neural networks could tackle**system 2**(i.e. conscious) cognitive tasks (such as reasoning, planning, imagination, etc) and how that can help a learner figure out high-level representations on both the perception and action sides.

I believe that all of the above are different aspects of a common goal, going beyond the limitations of current deep learning and towards human-level AI. I am also interested in **AI for social good**, in particular in healthcare and the environment (with a focus on **climate change**).

#### Notable Past Research

1989-1998 Convolutional and recurrent networks trained end-to-end with probabilistic alignment (HMMs) to model sequences, as the main contribution of my PhD thesis (1991); NIPS 1988, NIPS 1989, Eurospeech 1991, PAMI 1991, and *IEEE Trans. Neural Nets* 1992. These architectures were first applied to **speech recognition** in my PhD (and rediscovered after 2010) and then with Yann LeCun et al to **handwriting recognition and document analysis **(most cited paper is “Gradient-based learning applied to document recognition”, 1998, with over 15,000 citations in 2018), where we also introduce non-linear forms of conditional random fields (before they were a thing).

1991-1995 **Learning to learn** papers with Samy Bengio, starting with IJCNN 1991, “Learning a synaptic learning rule”. The idea of learning to learn (particularly by back-propagating through the whole process) has now become very popular, but we lacked the necessary computing power in the early 90’s.

1993-1995 Uncovering the **fundamental difficulty of learning in recurrent** **nets** and other machine learning models of temporal dependencies, associated with vanishing and exploding gradients: ICNN 1993, NIPS 1993, NIPS 1994, *IEEE Transactions on Neural Nets* 1994, and NIPS 1995. These papers have had a major impact and motivated later papers on architectures to aid with learning long-term dependencies and deal with vanishing or exploding gradients. An important but subtle contribution of the *IEEE Transactions* 1994 paper is to show that the condition required to store bits of information reliably over time also gives rise to vanishing gradients, using dynamical systems theory. The NIPS 1995 paper introduced the use of a hierarchy of time scales to combat the vanishing gradients issue.

1999-2014 Understanding how **distributed representations** can bypass the **curse of dimensionality** by providing generalization to an exponentially large set of regions from those comparatively few occupied by training examples. This series of papers also highlights how methods based on local generalization, like nearest-neighbor and Gaussian kernel SVMs, lack this kind of generalization ability. The NIPS 1999 introduced, for the first time, auto-regressive neural networks for density estimation (the ancestor of the NADE and PixelRNN/PixelCNN models). The NIPS 2004, NIPS 2005 and NIPS 2011 papers on this subject show how neural nets can learn a local metric, which can bring the power of generalization of distributed representations to kernel methods and manifold learning methods. Another NIPS 2005 paper shows the fundamental limitations of kernel methods due to a generalization of the curse of dimensionality (the curse of highly variable functions, which have many ups and downs). Finally, the ICLR 2014 paper demonstrates that, in the case of piecewise-linear networks (like those with ReLUs), the regions (linear pieces) distinguished by a one-hidden layer network is exponential in the number of neurons (whereas the number of parameters is quadratic in the number of neurons, and a local kernel method would require an exponential number of examples to capture the same kind of function).

2000-2008 **Word embeddings from neural networks and neural language models**. The NIPS 2000 paper introduces for the first time the learning of word embeddings as part of a neural network which models language data. The *JMLR* 2003 journal version expands this (these two papers together get around 3000 citations) and also introduces the idea of **asynchronous SGD** for distributed training of neural nets. Word embeddings have become one of the most common fixtures of deep learning when it comes to language data and this has basically created a new sub-field in the area of computational linguistics. I also introduced the use of importance sampling (AISTATS 2003, *IEEE Trans. on Neural Nets*, 2008) as well as of a probabilistic hierarchy (AISTATS 2005) to speed-up computations and face larger vocabularies.

2006-2014 Showing the **theoretical advantage of depth** for generalization. The NIPS 2006 oral presentation experimentally demonstrated the advantage of depth and is one of the most cited papers in the field (over 2600 citations). The NIPS 2011 paper shows how deeper sum-product networks can represent functions which would otherwise require an exponentially larger model if the network is shallow. Finally, the NIPS 2014 paper on the number of linear regions of deep neural networks generalizes the ICLR 2014 paper mentioned above, showing that the number of linear pieces produced by a piecewise linear network grows exponentially in both width of layers and number of layers, i.e., depth, making the functions represented by such networks generally impossible to capture efficiently with kernel methods (short of using a trained neural net as the kernel).

2006-2014 **Unsupervised deep learning** based on auto-encoders (with the special case of GANs as decoder-only models, see below). The NIPS 2006 paper introduced greedy layer-wise pre-training, both in the supervised case and unsupervised case with auto-encoders. The ICML 2008 paper introduced **denoising auto-encoders** and the NIPS 2013, ICML 2014 and JMLR 2014 papers cast their theory and generalize them as proper probabilistic models, at the same time introducing alternatives to maximum likelihood as training principles.

2014 Dispelling the **local-minima myth** regarding the optimization of neural networks, with the NIPS 2014 paper on saddle points, and demonstrating that it is the large number of parameters which makes it very unlikely that bad local minima exist.

2014 Introducing **Generative Adversarial Networks (GANs)** at NIPS 2014, which introduced many innovations in training deep generative models outside of the maximum likelihood framework and even outside of the classical framework of having a single objective function (instead entering into the territory of multiple models trained in a game-theoretical way, each with their objective). Presently one of the hottest research areas in deep learning with over 6000 citations mostly from papers that introduce variants of GANs, which have been producing impressively realistic synthetic images one would not have imagined computers being able to generate just a few years ago.

2014-2016 Introducing **content-based soft attention** and the breakthrough it brought to **neural machine translation**, mostly with Kyunghyun Cho and Dima Bahdanau. First introduced the encoder-decoder (now called sequence-to-sequence) architecture (EMNLP 2014) and then achieved a big jump in BLEU scores with content-based soft attention (ICLR 2015). These ingredients are now the basis of most commercial machine translation systems, another entire sub-field created using these techniques.