Portrait of Joelle Pineau

Joelle Pineau

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, McGill University, School of Computer Science
Co-Manager Director, Meta AI (FAIR - Facebook AI Research)
Research Topics
Medical Machine Learning
Natural Language Processing
Reinforcement Learning

Biography

Joelle Pineau is a professor and William Dawson Scholar at the School of Computer Science, McGill University, where she co-directs the Reasoning and Learning Lab. She is a core academic member of Mila – Quebec Artificial Intelligence Institute, a Canada CIFAR AI Chair, and VP of AI research at Meta (previously Facebook), where she leads the Fundamental AI Research (FAIR) team. Pineau holds a BSc in systems design engineering from the University of Waterloo, and an MSc and PhD in robotics from Carnegie Mellon University.

Her research focuses on developing new models and algorithms for planning and learning in complex partially observable domains. She also works on applying these algorithms to complex problems in robotics, health care, games and conversational agents. In addition to being on the editorial board of the Journal of Machine Learning Research and past president of the International Machine Learning Society, Pineau is the recipient of numerous awards and honours: NSERC’s E.W.R. Steacie Memorial Fellowship (2018), Governor General Innovation Award (2019), Fellow of the Association for the Advancement of Artificial Intelligence (AAAI), Senior Fellow of the Canadian Institute for Advanced Research (CIFAR), and Fellow of the Royal Society of Canada.

Current Students

Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University

Publications

Improving Passage Retrieval with Zero-Shot Question Generation
Devendra Singh Sachan
Mike Lewis
Mandar Joshi
Armen Aghajanyan
Wen-tau Yih
Luke Zettlemoyer
We propose a simple and effective re-ranking method for improving passage retrieval in open question answering. The re-ranker re-scores retr… (see more)ieved passages with a zero-shot question generation model, which uses a pre-trained language model to compute the probability of the input question conditioned on a retrieved passage. This approach can be applied on top of any retrieval method (e.g. neural or keyword-based), does not require any domain- or task-specific training (and therefore is expected to generalize better to data distribution shifts), and provides rich cross-attention between query and passage (i.e. it must explain every token in the question). When evaluated on a number of open-domain retrieval datasets, our re-ranker improves strong unsupervised retrieval models by 6%-18% absolute and strong supervised models by up to 12% in terms of top-20 passage retrieval accuracy. We also obtain new state-of-the-art results on full open-domain question answering by simply adding the new re-ranker to existing models with no further changes.
SPeCiaL: Self-Supervised Pretraining for Continual Learning
Lucas Caccia
A Generalized Bootstrap Target for Value-Learning, Efficiently Combining Value and Feature Predictions
Anthony GX-Chen
Blake A. Richards
Estimating value functions is a core component of reinforcement learning algorithms. Temporal difference (TD) learning algorithms use bootst… (see more)rapping, i.e. they update the value function toward a learning target using value estimates at subsequent time-steps. Alternatively, the value function can be updated toward a learning target constructed by separately predicting successor features (SF)--a policy-dependent model--and linearly combining them with instantaneous rewards. We focus on bootstrapping targets used when estimating value functions, and propose a new backup target, the
Block Contextual MDPs for Continual Learning
In reinforcement learning (RL), when defining a Markov Decision Process (MDP), the environment dynamics is implicitly assumed to be stationa… (see more)ry. This assumption of stationarity, while simplifying, can be unrealistic in many scenarios. In the continual reinforcement learning scenario, the sequence of tasks is another source of nonstationarity. In this work, we propose to examine this continual reinforcement learning setting through the Block Contextual MDP (BC-MDP) framework, which enables us to relax the assumption of stationarity. This framework challenges RL algorithms to handle both nonstationarity and rich observation settings and, by additionally leveraging smoothness properties, enables us to study generalization bounds for this setting. Finally, we take inspiration from adaptive control to propose a novel algorithm that addresses the challenges introduced by this more realistic BC-MDP setting, allows for zero-shot adaptation at evaluation time, and achieves strong performance on several nonstationary environments.
New Insights on Reducing Abrupt Representation Change in Online Continual Learning
In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. E… (see more)xperience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. In this work, we focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. We shed new light on this question by showing that applying ER causes the newly added classes' representations to overlap significantly with the previous classes, leading to highly disruptive parameter updates. Based on this empirical analysis, we propose a new method which mitigates this issue by shielding the learned representations from drastic adaptation to accommodate new classes. We show that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries, where much of the forgetting typically occurs. Empirical results show significant gains over strong baselines on standard continual learning benchmarks
Biomedical Research & Informatics Living Laboratory for Innovative Advances of New Technologies in Community Mobility Rehabilitation: Protocol for a longitudinal evaluation of mobility outcomes (Preprint)
Sara Ahmed
Philippe Archambault
Claudine Auger
Joyce Fung
Eva Kehayia
Anouk Lamontagne
Annette Majnemer
Sylvie Nadeau
Alain Ptito
Bonnie Swaine
Efficient Continual Learning Ensembles in Neural Network Subspaces
Thang Doan
Seyed Iman Mirzadeh
Mehrdad Farajtabar
A growing body of research in continual learning focuses on the catastrophic forgetting problem. While many attempts have been made to allev… (see more)iate this problem, the majority of the methods assume a single model in the continual learning setup. In this work, we question this assumption and show that employing ensemble models can be a simple yet effective method to improve continual performance. However, the training and inference cost of ensembles can increase linearly with the number of models. Motivated by this limitation, we leverage the recent advances in the deep learning optimization literature, such as mode connectivity and neural network subspaces, to derive a new method that is both computationally advantageous and can outperform the state-of-the-art continual learning algorithms
Low-Rank Representation of Reinforcement Learning Policies
We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional… (see more) embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability and convergence guarantees. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly represented in a low-dimensional space while the embedded policy incurs almost no decrease in returns.
Robust Policy Learning over Multiple Uncertainty Sets
Annie Xie
Chelsea Finn
Reinforcement learning (RL) agents need to be robust to variations in safety-critical environments. While system identification methods prov… (see more)ide a way to infer the variation from online experience, they can fail in settings where fast identification is not possible. Another dominant approach is robust RL which produces a policy that can handle worst-case scenarios, but these methods are generally designed to achieve robustness to a single uncertainty set that must be specified at train time. Towards a more general solution, we formulate the multi-set robustness problem to learn a policy robust to different perturbation sets. We then design an algorithm that enjoys the benefits of both system identification and robust RL: it reduces uncertainty where possible given a few interactions, but can still act robustly with respect to the remaining uncertainty. On a diverse set of control tasks, our approach demonstrates improved worst-case performance on new environments compared to prior methods based on system identification and on robust RL alone.
The Curious Case of Absolute Position Embeddings
Transformer language models encode the notion of word order using positional information. Most commonly, this positional information is repr… (see more)esented by absolute position embeddings (APEs), that are learned from the pretraining data. However, in natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. In this work, we observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information. Specifically, when models are subjected to sentences starting from a non-zero position (excluding the effect of priming), they exhibit noticeably degraded performance on zero to full-shot tasks, across a range of model families and model sizes. Our findings raise questions about the efficacy of APEs to model the relativity of position information, and invite further introspection on the sentence and word order processing strategies employed by these models.
Towards Policy-Guided Conversational Recommendation with Dialogue Acts
Paul Crook
Y-Lan Boureau
J. Weston
Akbar Karimi
Leonardo Rossi
Andrea Prati
Wenqiang Lei
Xiangnan He
Qingyun Yisong Miao
Richang Wu
Min-Yen Hong
Kan Tat-Seng
Raymond Li
Hannes Schulz
Zujie Liang
Huang Hu
Can Xu
Jian Miao
Lizi Liao … (see 47 more)
Ryuichi Takanobu
Yunshan Ma
Xun Yang
Wenchang Ma
Minlie Huang
Minghao Tu
Iulian Serban
Aaron C. Courville
David Silver
Julian Schrittwieser
K. Simonyan
Ioannis Antonoglou
Aja Huang
A. Guez
Hanlin Zhu
O. Vinyals
Igor Babuschkin
M. Mathieu
Max Jaderberg
Wojciech M. Czar-725 necki
A. Dudzik
Petko Georgiev
Richard Powell
T. Ewalds
Dan Horgan
M. Kroiss
Ivo Danihelka
J. Agapiou
Junhyuk Oh
Valentin Dalibard
David Choi
L. Sifre
Yury Sulsky
Sasha Vezhnevets
James Molloy
Trevor Cai
D. Budden
T. Paine
Ziyu Wang
Tobias Pfaff
Tobias Pohlen
Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs
Philip S. Thomas
Romain Laroche
We study the problem of Safe Policy Improvement (SPI) under constraints in the offline Reinforcement Learning (RL) setting. We consider the … (see more)scenario where: (i) we have a dataset collected under a known baseline policy, (ii) multiple reward signals are received from the environment inducing as many objectives to optimize. We present an SPI formulation for this RL setting that takes into account the preferences of the algorithm’s user for handling the trade-offs for different reward signals while ensuring that the new policy performs at least as well as the baseline policy along each individual objective. We build on traditional SPI algorithms and propose a novel method based on Safe Policy Iteration with Baseline Bootstrapping (SPIBB, Laroche et al., 2019) that provides high probability guarantees on the performance of the agent in the true environment. We show the effectiveness of our method on a synthetic grid-world safety task as well as in a real-world critical care context to learn a policy for the administration of IV fluids and vasopressors to treat sepsis.