Portrait of Aaron Courville

Aaron Courville

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Computer Vision
Deep Learning
Generative Models
Natural Language Processing
Reinforcement Learning
Representation Learning

Biography

Aaron Courville is a professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal. He has a PhD from the Robotics Institute, Carnegie Mellon University.

Courville was an early contributor to deep learning: he is a founding member of Mila – Quebec Artificial Intelligence Institute, a fellow in CIFAR’s Learning in Machines & Brains program and, with Ian Goodfellow and Yoshua Bengio, co-wrote the seminal textbook on deep learning.

His current research focuses on the development of deep learning models and methods. He is particularly interested in reinforcement learning, deep generative models and multimodal ML, as well as their applications, such as computer vision and natural language processing.

Courville holds a Canada CIFAR AI Chair and a Canada Research Chair in Learning Representations that Generalize Systematically. His research has been supported by Microsoft Research, Samsung, Hitachi, Sony and Google (Focused Research Award).

Current Students

PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Professional Master's - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Principal supervisor :

Publications

Chunked Autoregressive GAN for Conditional Waveform Synthesis
Max Morrison
Rithesh Kumar
Kundan Kumar
Prem Seetharaman
Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress
Unifying Likelihood-free Inference with Black-box Optimization and Beyond
Dinghuai Zhang
Jie Fu
Black-box optimization formulations for biological sequence design have drawn recent attention due to their promising potential impact on th… (see more)e pharmaceutical industry. In this work, we propose to unify two seemingly distinct worlds: likelihood-free inference and black-box optimization, under one probabilistic framework. In tandem, we provide a recipe for constructing various sequence design methods based on this framework. We show how previous optimization approaches can be"reinvented"in our framework, and further propose new probabilistic black-box optimization algorithms. Extensive experiments on sequence design application illustrate the benefits of the proposed methodology.
Unsupervised Dependency Graph Network
Yikang Shen
Shawn Tan
Peng Li
Jie Zhou
Recent work has identified properties of pretrained self-attention models that mirror those of dependency parse structures. In particular, s… (see more)ome self-attention heads correspond well to individual dependency types. Inspired by these developments, we propose a new competitive mechanism that encourages these attention heads to model different dependency relations. We introduce a new model, the Unsupervised Dependency Graph Network (UDGN), that can induce dependency structures from raw corpora and the masked language modeling task. Experiment results show that UDGN achieves very strong unsupervised dependency parsing performance without gold POS tags and any other external information. The competitive gated heads show a strong correlation with human-annotated dependency types. Furthermore, the UDGN can also achieve competitive performance on masked language modeling and sentence textual similarity tasks.
Multi-label Iterated Learning for Image Classification with Label Ambiguity
Sai Rajeswar
Pau Rodriguez
Soumye Singhal
David Vazquez
Transfer learning from large-scale pre-trained models has become essential for many computer vision tasks. Recent studies have shown that da… (see more)tasets like ImageNet are weakly labeled since images with multiple object classes present are assigned a single label. This ambiguity biases models towards a single prediction, which could result in the suppression of classes that tend to co-occur in the data. Inspired by language emergence literature, we propose multi-label iterated learning (MILe) to incorporate the inductive biases of multi-label learning from single labels using the framework of iterated learning. MILe is a simple yet effective procedure that builds a multi-label description of the image by propagating binary predictions through successive generations of teacher and student networks with a learning bottleneck. Experiments show that our approach exhibits systematic benefits on ImageNet accuracy as well as ReaL F1 score, which indicates that MILe deals better with label ambiguity than the standard training procedure, even when fine-tuning from self-supervised weights. We also show that MILe is effective reducing label noise, achieving state-of-the-art performance on real-world large-scale noisy data such as WebVision. Furthermore, MILe improves performance in class incremental settings such as IIRC and it is robust to distribution shifts. Code: https://github.com/rajeswar18/MILe
Gradient Starvation: A Learning Proclivity in Neural Networks
Mohammad Pezeshki
Sékou-Oumar Kaba
We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks… (see more). Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.
Explicitly Modeling Syntax in Language Models with Incremental Parsing and a Dynamic Oracle
Syntax is fundamental to our thinking about language. Failing to capture the structure of input language could lead to generalization proble… (see more)ms and over-parametrization. In the present work, we propose a new syntax-aware language model: Syntactic Ordered Memory (SOM). The model explicitly models the structure with an incremental parser and maintains the conditional probability setting of a standard language model (left-to-right). To train the incremental parser and avoid exposure bias, we also propose a novel dynamic oracle, so that SOM is more robust to wrong parsing decisions. Experiments show that SOM can achieve strong results in language modeling, incremental parsing, and syntactic generalization tests while using fewer parameters than other models.
Understanding by Understanding Not: Modeling Negation in Language Models
Negation is a core construction in natural language. Despite being very successful on many tasks, state-of-the-art pre-trained language mode… (see more)ls often handle negation incorrectly. To improve language models in this regard, we propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences from a raw text corpus. By training BERT with the resulting combined objective we reduce the mean top 1 error rate to 4% on the negated LAMA dataset. We also see some improvements on the negated NLI benchmarks.
DATA-EFFICIENT REINFORCEMENT LEARNING
Nitarshan Rajkumar
Michael Noukhovitch
Ankesh Anand
Philip Bachman
Data efficiency poses a major challenge for deep reinforcement learning. We approach this issue from the perspective of self-supervised repr… (see more)esentation learning, leveraging reward-free exploratory data to pretrain encoder networks. We employ a novel combination of latent dynamics modelling and goal-reaching objectives, which exploit the inherent structure of data in reinforcement learning. We demonstrate that our method scales well with network capacity and pretraining data. When evaluated on the Atari 100k data-efficiency benchmark, our approach significantly outperforms previous methods combining unsupervised pretraining with task-specific finetuning, and approaches human-level performance.
Deep Reinforcement Learning at the Edge of the Statistical Precipice
Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. M… (see more)ost published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. Beginning with the Arcade Learning Environment (ALE), the shift towards computationally-demanding benchmarks has led to the practice of evaluating only a small number of runs per task, exacerbating the statistical uncertainty in point estimates. In this paper, we argue that reliable evaluation in the few run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. With the aim of increasing the field's confidence in reported results with a handful of runs, we advocate for reporting interval estimates of aggregate performance and propose performance profiles to account for the variability in results, as well as present more robust and efficient aggregate metrics, such as interquartile mean scores, to achieve small uncertainty in results. Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the DeepMind Control Suite, again revealing discrepancies in prior comparisons. Our findings call for a change in how we evaluate performance in deep RL, for which we present a more rigorous evaluation methodology, accompanied with an open-source library rliable, to prevent unreliable results from stagnating the field. This work received an outstanding paper award at NeurIPS 2021.
Pretraining Representations for Data-Efficient Reinforcement Learning
Max Schwarzer
Nitarshan Rajkumar
Michael Noukhovitch
Ankesh Anand
Philip Bachman
Data efficiency is a key challenge for deep reinforcement learning. We address this problem by using unlabeled data to pretrain an encoder w… (see more)hich is then finetuned on a small amount of task-specific data. To encourage learning representations which capture diverse aspects of the underlying MDP, we employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL. When limited to 100k steps of interaction on Atari games (equivalent to two hours of human experience), our approach significantly surpasses prior work combining offline representation pretraining with task-specific finetuning, and compares favourably with other pretraining methods that require orders of magnitude more data. Our approach shows particular promise when combined with larger models as well as more diverse, task-aligned observational data -- approaching human-level performance and data-efficiency on Atari in our best setting.
Systematic generalisation with group invariant predictions
Faruk Ahmed
Harm van Seijen
We consider situations where the presence of dominant simpler correlations with the target variable in a training set can cause an SGD-train… (see more)ed neural network to be less reliant on more persistently correlating complex features. When the non-persistent, simpler correlations correspond to non-semantic background factors, a neural network trained on this data can exhibit dramatic failure upon encountering systematic distributional shift, where the correlating background features are recombined with different objects. We perform an empirical study on three synthetic datasets, showing that group invariance methods across inferred partitionings of the training set can lead to significant improvements at such test-time situations. We also suggest a simple invariance penalty, showing with experiments on our setups that it can perform better than alternatives. We find that even without assuming access to any systematically shifted validation sets, one can still find improvements over an ERM-trained reference model.