Portrait of Joelle Pineau

Joelle Pineau

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, McGill University, School of Computer Science
Co-Manager Director, Meta AI (FAIR - Facebook AI Research)
Research Topics
Medical Machine Learning
Natural Language Processing
Reinforcement Learning

Biography

Joelle Pineau is a professor and William Dawson Scholar at the School of Computer Science, McGill University, where she co-directs the Reasoning and Learning Lab. She is a core academic member of Mila – Quebec Artificial Intelligence Institute, a Canada CIFAR AI Chair, and VP of AI research at Meta (previously Facebook), where she leads the Fundamental AI Research (FAIR) team. Pineau holds a BSc in systems design engineering from the University of Waterloo, and an MSc and PhD in robotics from Carnegie Mellon University.

Her research focuses on developing new models and algorithms for planning and learning in complex partially observable domains. She also works on applying these algorithms to complex problems in robotics, health care, games and conversational agents. In addition to being on the editorial board of the Journal of Machine Learning Research and past president of the International Machine Learning Society, Pineau is the recipient of numerous awards and honours: NSERC’s E.W.R. Steacie Memorial Fellowship (2018), Governor General Innovation Award (2019), Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), Senior Fellow of the Canadian Institute for Advanced Research (CIFAR), and Fellow of the Royal Society of Canada.

Current Students

Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University

Publications

On-line Adaptative Curriculum Learning for GANs
Thang Doan
Joao Monteiro
Isabela Albuquerque
R Devon Hjelm
Generative Adversarial Networks (GANs) can successfully approximate a probability distribution and produce realistic samples. However, open … (see more)questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen as a one-student/multiple-teachers setting. We formalize this problem within the full-information adversarial bandit framework, where we evaluate the capability of an algorithm to select mixtures of discriminators for providing the generator with feedback during learning. To this end, we propose a reward function which reflects the progress made by the generator and dynamically update the mixture weights allocated to each discriminator. We also draw connections between our algorithm and stochastic optimization methods and then show that existing approaches using multiple discriminators in literature can be recovered from our framework. We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support. On the other hand, highly expressive discriminators ensure samples quality. Finally, experimental results show that our approach improves samples quality and diversity over existing baselines by effectively learning a curriculum. These results also support the claim that weaker discriminators have higher entropy improving modes coverage. Keywords: multiple discriminators, curriculum learning, multiple resolutions discriminators, multi-armed bandits, generative adversarial networks, smooth discriminators, multi-discriminator gan training, multiple experts.
Multitask Metric Learning: Theory and Algorithm
Boyu Wang
Hejia Zhang
Peng Liu
Zebang Shen
In this paper, we study the problem of multitask metric learning (mtML). We first examine the generalization bound of the regularized mtML f… (see more)ormulation based on the notion of algorithmic stability, proving the convergence rate of mtML and revealing the trade-off between the tasks. Moreover, we also establish the theoretical connection between the mtML, single-task learning and pooling-task learning approaches. In addition, we present a novel boosting-based mtML (mt-BML) algorithm, which scales well with the feature dimension of the data. Finally, we also devise an efficient second-order Riemannian retraction operator which is tailored specifically to our mt-BML algorithm. It produces a low-rank solution of mtML to reduce the model complexity, and may also improve generalization performances. Extensive evaluations on several benchmark data sets verify the effectiveness of our learning algorithm.
Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning
Thang Doan
R Devon Hjelm
Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensiona… (see more)l state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem. In this work, we present a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies. Empirical results on the MuJoCo suite demonstrate a high performance gain for our algorithm compared to prior work, including Soft-Actor Critic (SAC).
No Press Diplomacy: Modeling Multi-Agent Gameplay
Yuchen Lu
Steven Bocco
Max O. Smith
Jonathan K. Kummerfeld
Satinder Singh
Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Rel… (see more)iance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. We present DipNet, a neural-network-based policy model for No Press Diplomacy. The model was trained on a new dataset of more than 150,000 human games. Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. Both the SL and RL agents demonstrate state-of-the-art No Press performance by beating popular rule-based bots.
Recurrent Value Functions
Nishanth Anand
Lucas Caccia
Despite recent successes in Reinforcement Learning, value-based methods often suffer from high variance hindering performance. In this paper… (see more), we illustrate this in a continuous control setting where state of the art methods perform poorly whenever sensor noise is introduced. To overcome this issue, we introduce Recurrent Value Functions (RVFs) as an alternative to estimate the value function of a state. We propose to estimate the value function of the current state using the value function of past states visited along the trajectory. Due to the nature of their formulation, RVFs have a natural way of learning an emphasis function that selectively emphasizes important states. First, we establish RVF's asymptotic convergence properties in tabular settings. We then demonstrate their robustness on a partially observable domain and continuous control tasks. Finally, we provide a qualitative interpretation of the learned emphasis function.
Contextual Bandits for Adapting Treatment in a Mouse Model of de Novo Carcinogenesis
Charis Achilleos
Demetris Iacovides
Katerina Strati
Georgios D. Mitsis
Focused Hierarchical RNNs for Conditional Sequence Processing
Recurrent Neural Networks (RNNs) with attention mechanisms have obtained state-of-the-art results for many sequence processing tasks. Most o… (see more)f these models use a simple form of encoder with attention that looks over the entire sequence and assigns a weight to each token independently. We present a mechanism for focusing RNN encoders for sequence modelling tasks which allows them to attend to key parts of the input as needed. We formulate this using a multi-layer conditional sequence encoder that reads in one token at a time and makes a discrete decision on whether the token is relevant to the context or question being asked. The discrete gating mechanism takes in the context embedding and the current hidden state as inputs and controls information flow into the layer above. We train it using policy gradient methods. We evaluate this method on several types of tasks with different attributes. First, we evaluate the method on synthetic tasks which allow us to evaluate the model for its generalization ability and probe the behavior of the gates in more controlled settings. We then evaluate this approach on large scale Question Answering tasks including the challenging MS MARCO and SearchQA tasks. Our models shows consistent improvements for both tasks over prior work and our baselines. It has also shown to generalize significantly better on synthetic tasks as compared to the baselines.
RE-EVALUATE: Reproducibility in Evaluating Reinforcement Learning Algorithms
Reinforcement learning (RL) has recently achieved tremendous success in solving complex tasks. Careful considerations are made towards repro… (see more)ducible research in machine learning. Reproducibility in RL often becomes more difficult, due to the lack of standard evaluation method and detailed methodology for algorithms and comparisons with existing work. In this work, we highlight key differences in evaluation in RL compared to supervised learning, and discuss specific issues that are often non-intuitive for newcomers. We study the importance of reproducibility in evaluation in RL, and propose an evaluation pipeline that can be decoupled from the algorithm code. We hope such an evaluation pipeline can be standardized, as a step towards robust and reproducible research in RL.
Deep Reinforcement Learning that Matters
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning… (see more) (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward fu… (see more)nction can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.
Sequential Coordination of Deep Models for Learning Visual Arithmetic
Achieving machine intelligence requires a smooth integration of perception and reasoning, yet models developed to date tend to specialize in… (see more) one or the other; sophisticated manipulation of symbols acquired from rich perceptual spaces has so far proved elusive. Consider a visual arithmetic task, where the goal is to carry out simple arithmetical algorithms on digits presented under natural conditions (e.g. hand-written, placed randomly). We propose a two-tiered architecture for tackling this problem. The lower tier consists of a heterogeneous collection of information processing modules, which can include pre-trained deep neural networks for locating and extracting characters from the image, as well as modules performing symbolic transformations on the representations extracted by perception. The higher tier consists of a controller, trained using reinforcement learning, which coordinates the modules in order to solve the high-level task. For instance, the controller may learn in what contexts to execute the perceptual networks and what symbolic transformations to apply to their outputs. The resulting model is able to solve a variety of tasks in the visual arithmetic domain, and has several advantages over standard, architecturally homogeneous feedforward networks including improved sample efficiency.
A Deep Reinforcement Learning Chatbot (Short Version)
We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon … (see more)Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including neural network and template-based models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than other systems. The results highlight the potential of coupling ensemble systems with deep reinforcement learning as a fruitful path for developing real-world, open-domain conversational agents.