Portrait of Doina Precup

Doina Precup

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, McGill University, School of Computer Science
Research Team Leader, Google DeepMind
Research Topics
Medical Machine Learning
Molecular Modeling
Probabilistic Models
Reasoning
Reinforcement Learning

Biography

Doina Precup combines teaching at McGill University with fundamental research on reinforcement learning, in particular AI applications in areas of significant social impact, such as health care. She is interested in machine decision-making in situations where uncertainty is high.

In addition to heading the Montreal office of Google DeepMind, Precup is a Senior Fellow of the Canadian Institute for Advanced Research and a Fellow of the Association for the Advancement of Artificial Intelligence.

Her areas of speciality are artificial intelligence, machine learning, reinforcement learning, reasoning and planning under uncertainty, and applications.

Current Students

PhD - McGill University
Collaborating Alumni - McGill University
Co-supervisor :
Collaborating Alumni - McGill University
Collaborating Alumni - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
Principal supervisor :
Master's Research - McGill University
Principal supervisor :
Collaborating researcher - McGill University
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - McGill University
Principal supervisor :
Collaborating researcher - Birla Institute of Technology
PhD - McGill University
Collaborating Alumni - McGill University
Master's Research - McGill University
Collaborating Alumni - McGill University
PhD - Polytechnique Montréal
PhD - McGill University
Postdoctorate - McGill University
Collaborating Alumni - McGill University
Collaborating Alumni - McGill University
PhD - McGill University
Principal supervisor :
PhD - McGill University
Collaborating Alumni - McGill University
Master's Research - McGill University
Principal supervisor :
Collaborating researcher - McGill University
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
Principal supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
PhD - McGill University
Co-supervisor :
PhD - McGill University
Research Intern - McGill University
Master's Research - McGill University
Co-supervisor :
PhD - McGill University
Principal supervisor :
PhD - McGill University
Collaborating Alumni - McGill University
Co-supervisor :

Publications

Policy Evaluation Networks
Jean Harb
Tom Schaul
Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a si… (see more)ngle policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.
oIRL: Robust Adversarial Inverse Reinforcement Learning with Temporally Extended Actions
David Venuto
Léonard Boussioux
Junhao Wang
Explicit engineering of reward functions for given environments has been a major hindrance to reinforcement learning methods. While Inverse … (see more)Reinforcement Learning (IRL) is a solution to recover reward functions from demonstrations only, these learned rewards are generally heavily \textit{entangled} with the dynamics of the environment and therefore not portable or \emph{robust} to changing environments. Modern adversarial methods have yielded some success in reducing reward entanglement in the IRL setting. In this work, we leverage one such method, Adversarial Inverse Reinforcement Learning (AIRL), to propose an algorithm that learns hierarchical disentangled rewards with a policy over options. We show that this method has the ability to learn \emph{generalizable} policies and reward functions in complex transfer learning tasks, while yielding results in continuous control benchmarks that are comparable to those of the state-of-the-art methods.
Representation of Reinforcement Learning Policies in Reproducing Kernel Hilbert Spaces
We propose a general framework for policy representation for reinforcement learning tasks. This framework involves finding a low-dimensional… (see more) embedding of the policy on a reproducing kernel Hilbert space (RKHS). The usage of RKHS based methods allows us to derive strong theoretical guarantees on the expected return of the reconstructed policy. Such guarantees are typically lacking in black-box models, but are very desirable in tasks requiring stability. We conduct several experiments on classic RL domains. The results confirm that the policies can be robustly embedded in a low-dimensional space while the embedded policy incurs almost no decrease in return.
A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. We demonstrate it… (see more)s effectiveness by presenting simple and unified proofs of convergence for a variety of commonly-used methods. We show that value-based methods such as TD(
On Efficiency in Hierarchical Reinforcement Learning
Zheng Wen
Morteza Ibrahimi
Andre Barreto
Benjamin Van Roy
Satinder Singh
Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, bo… (see more)th in terms of statistical as well as computational efficiency. While this has been demonstrated empirically over time in a variety of tasks, theoretical results quantifying the ben-efits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRL methods. Specifically, we formalize the intuition that HRL can exploit well repeating "subMDPs", with similar reward and transition structure. We show that, under reasonable assumptions, a model-based Thompson sampling-style HRL algorithm that exploits this structure is statistically efficient, as established through a finite-time regret bound. We also establish conditions under which planning with structure-induced options is near-optimal and computationally efficient.
Efficient Planning under Partial Observability with Unnormalized Q Functions and Spectral Learning
An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay
Prioritized Experience Replay (PER) is a deep reinforcement learning technique in which agents learn from transitions sampled with non-unifo… (see more)rm probability proportionate to their temporal-difference error. We show that any loss function evaluated with non-uniformly sampled data can be transformed into another uniformly sampled loss function with the same expected gradient. Surprisingly, we find in some environments PER can be replaced entirely by this new loss function without impact to empirical performance. Furthermore, this relationship suggests a new branch of improvements to PER by correcting its uniformly sampled loss function equivalent. We demonstrate the effectiveness of our proposed modifications to PER and the equivalent loss function in several MuJoCo and Atari environments.
Forethought and Hindsight in Credit Assignment
Hado van Hasselt
We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent … (see more)can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. Particularly, we work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We establish the relative merits, limitations and complementary properties of both planning mechanisms in carefully constructed scenarios. Further, we investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated. Lastly, we discuss the issue of model estimation and highlight a spectrum of methods that stretch from explicit environment-dynamics predictors to more abstract planner-aware models.
Reward Propagation Using Graph Convolutional Networks
Potential-based reward shaping provides an approach for designing good reward functions, with the purpose of speeding up learning. However, … (see more)automatically finding potential functions for complex environments is a difficult problem (in fact, of the same difficulty as learning a value function from scratch). We propose a new framework for learning potential functions by leveraging ideas from graph representation learning. Our approach relies on Graph Convolutional Networks which we use as a key ingredient in combination with the probabilistic inference view of reinforcement learning. More precisely, we leverage Graph Convolutional Networks to perform message passing from rewarding states. The propagated messages can then be used as potential functions for reward shaping to accelerate learning. We verify empirically that our approach can achieve considerable improvements in both small and high-dimensional control problems.
Reward Redistribution Mechanisms in Multi-agent Reinforcement Learning
Aly Ibrahim
Piracha
Daoud
In typical Multi-Agent Reinforcement Learning (MARL) settings, each agent acts to maximize its individual reward objective. However, for col… (see more)lective social welfare maximization, some agents may need to act non-selfishly. We propose a reward shaping mechanism using extrinsic motivation for achieving modularity and increased cooperation among agents in Sequential Social Dilemma (SSD) problems. Our mechanism, inspired by capitalism, provides extrinsic motivation to agents by redistributing a portion of collected re-wards based on each agent’s individual contribution towards team rewards. We demonstrate empirically that this mechanism leads to higher collective welfare relative to existing baselines. Furthermore, this reduces free rider issues and leads to more diverse policies. We evaluate our proposed mechanism for already specialised agents that are pre-trained for specific roles. We show that our mechanism, in the most challenging CleanUp environment, significantly out-performs two baselines (based roughly on socialism and anarchy) and accumulates 2-3 times higher rewards in an easier setting of the environment.
Value Preserving State-Action Abstractions
David Abel
Nathan Umbanhowar
Dilip Arumugam
Michael L. Littman
Value-driven Hindsight Modelling
Arthur Guez
Theophane Weber
Lars Buesing
Steven Kapturowski
David Silver
Nicolas Heess
Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors fo… (see more)r value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.