Doina Precup

Sumana Basu

Doctorat - McGill

Co-superviseur⋅e :

Adriana Romero Soriano

Collaborateur·rice alumni - McGill

Lynn Cherif

Maîtrise recherche - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Superviseur⋅e principal⋅e :

David Meger

Jonathan Colaço Carr

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Prakash Panangaden

Élodie Coté-Gauthier

Collaborateur·rice de recherche - McGill

Co-superviseur⋅e :

Isabeau Prémont-Schwarz

Franco Del Balso

Stagiaire de recherche - UdeM

Jesse Farebrother

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Collaborateur·rice alumni - McGill

Mohammad Sami Nur Islam Islam

Maîtrise recherche - McGill

Arushi Jain

Collaborateur·rice alumni - McGill

Doctorat - Polytechnique

Flemming Kondrup

Postdoctorat - McGill

Elaine Lau

Maîtrise recherche - McGill

Jonathan Lebensold

Collaborateur·rice alumni - McGill

Baccalauréat - McGill

Ray Luo

Doctorat - McGill

Superviseur⋅e principal⋅e :

G McCracken

Doctorat - McGill

Nazanin Mohammadi Sepahvand

Collaborateur·rice alumni - McGill

Shahrad Mohammadzadeh

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Gabriela Moisescu-Pareja

Collaborateur·rice de recherche - McGill

Co-superviseur⋅e :

Irina Rish

Padideh Nouri

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Nate Rahn

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Sahand Rezaei-Shoshtari

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Samira Ebrahimi Kahou

Zihan Wang

Doctorat - McGill

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Guangyuan Wang

Stagiaire de recherche - McGill

Steve Wen

Maîtrise recherche - McGill

Co-superviseur⋅e :

Gregory Dudek

Zijing Wu

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Harry Zhao

Collaborateur·rice alumni - McGill

Co-superviseur⋅e :

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

Reward is enough

David Silver

Satinder Singh

Richard S. Sutton

2021-10-01

Artificial Intelligence (publié)

A Survey of Exploration Methods in Reinforcement Learning

Herke van Hoof

2021-09-01

ArXiv (prépublication)

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

Scott Fujimoto

David Meger

Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a… (voir plus) sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

2021-07-01

Proceedings of the 38th International Conference on Machine Learning (publié)

proceedings.mlr.press

Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards

A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structur… (voir plus)es and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.

2021-07-01

Proceedings of the 38th International Conference on Machine Learning (publié)

proceedings.mlr.press

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Haque Ishfaq

Qiwen Cui

Viet Huy Nguyen

Alex Ayoub

Zhuoran Yang

Zhaoran Wang

Lin F. Yang

We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm … (voir plus)as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class

2021-06-15

ArXiv (prépublication)

Correcting Momentum in Temporal Difference Learning

Emmanuel Bengio

Joelle Pineau

A common optimization tool used in deep reinforcement learning is momentum, which consists in accumulating and discounting past gradients, r… (voir plus)eapplying them at each iteration. We argue that, unlike in supervised learning, momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale: not only does the gradient of the loss change due to parameter updates, the loss itself changes due to bootstrapping. We first show that this phenomenon exists, and then propose a first-order correction term to momentum. We show that this correction term improves sample efficiency in policy evaluation by correcting target value drift. An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.

2021-06-07

ArXiv (prépublication)

openreview.net

AndroidEnv: A Reinforcement Learning Platform for Android

Daniel Toyama

Philippe Hamel

Anita Gergely

Gheorghe Comanici

Amelia Glaese

Zafarali Ahmed

Tyler Jackson

Shibl Mourad

We introduce AndroidEnv, an open-source platform for Reinforcement Learning (RL) research built on top of the Android ecosystem. AndroidEnv … (voir plus)allows RL agents to interact with a wide variety of apps and services commonly used by humans through a universal touchscreen interface. Since agents train on a realistic simulation of an Android device, they have the potential to be deployed on real devices. In this report, we give an overview of the environment, highlighting the significant features it provides for research, and we present an empirical evaluation of some popular reinforcement learning agents on a set of tasks built on this platform.

2021-05-27

ArXiv (prépublication)

Self-Supervised Attention-Aware Reinforcement Learning

Haiping Wu

Khimya Khetarpal

Visual saliency has emerged as a major visualization tool for interpreting deep reinforcement learning (RL) agents. However, much of the exi… (voir plus)sting research uses it as an analyzing tool rather than an inductive bias for policy learning. In this work, we use visual attention as an inductive bias for RL agents. We propose a novel self-supervised attention learning approach which can 1. learn to select regions of interest without explicit annotations, and 2. act as a plug for existing deep RL methods to improve the learning performance. We empirically show that the self-supervised attention-aware deep RL methods outperform the baselines in the context of both the rate of convergence and performance. Furthermore, the proposed self-supervised attention is not tied with specific policies, nor restricted to a specific scene. We posit that the proposed approach is a general self-supervised attention module for multi-task learning and transfer learning, and empirically validate the generalization ability of the proposed method. Finally, we show that our method learns meaningful object keypoints highlighting improvements both qualitatively and quantitatively.

2021-05-18

AAAI Conference on Artificial Intelligence (publié)

Variance Penalized On-Policy and Off-Policy Actor-Critic

Ayush Jain

2021-05-18

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

What is Going on Inside Recurrent Meta Reinforcement Learning Agents?

Safa Alver

Recurrent meta reinforcement learning (meta-RL) agents are agents that employ a recurrent neural network (RNN) for the purpose of"learning a… (voir plus) learning algorithm". After being trained on a pre-specified task distribution, the learned weights of the agent's RNN are said to implement an efficient learning algorithm through their activity dynamics, which allows the agent to quickly solve new tasks sampled from the same distribution. However, due to the black-box nature of these agents, the way in which they work is not yet fully understood. In this study, we shed light on the internal working mechanisms of these agents by reformulating the meta-RL problem using the Partially Observable Markov Decision Process (POMDP) framework. We hypothesize that the learned activity dynamics is acting as belief states for such agents. Several illustrative experiments suggest that this hypothesis is true, and that recurrent meta-RL agents can be viewed as agents that learn to act optimally in partially observable environments consisting of multiple related tasks. This view helps in understanding their failure cases and some interesting model-based results reported in the literature.

2021-04-29

ArXiv (prépublication)

Safe option-critic: learning safety in the option-critic architecture

Arushi Jain

Khimya Khetarpal

Abstract Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications b… (voir plus)ut also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

2021-04-07

The Knowledge Engineering Review (publié)

Training a First-Order Theorem Prover from Synthetic Data

Vlad Firoiu

Eser Aygün

Ankit Anand

Zafarali Ahmed

Xavier Glorot

Laurent Orseau

Lei Zhang

Shibl Mourad

2021-03-05

ArXiv (prépublication)