Doina Precup

Mohammad Sami Nur Islam Islam

Jesse Farebrother

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Doctorat - McGill

Superviseur⋅e principal⋅e :

Eilif Benjamin Muller

Doctorat - McGill

Doctorat - McGill

Maîtrise recherche - McGill

Nazanin Mohammadi Sepahvand

Google Scholar

Arushi Jain

Doctorat - McGill

Doctorat - McGill

Postdoctorat - McGill

Google Scholar

Elaine Lau

Maîtrise recherche - McGill

Jonathan Lebensold

Collaborateur·rice alumni - McGill

Baccalauréat - McGill

Ray Luo

Doctorat - McGill

Superviseur⋅e principal⋅e :

G McCracken

Doctorat - McGill

Google Scholar

Doctorat - McGill

Shahrad Mohammadzadeh

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Gabriela Moisescu-Pareja

Maîtrise recherche - McGill

Padideh Nouri

Doctorat - UdeM

Co-superviseur⋅e :

Charles Onu

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Nate Rahn

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Sahand Rezaei-Shoshtari

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Blake Richards

samiemandana@gmail.com

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Samira Ebrahimi Kahou

Stagiaire de recherche - McGill

Steve Wen

Maîtrise recherche - McGill

Co-superviseur⋅e :

Gregory Dudek

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Zijing Wu

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Harry Zhao

Doctorat - McGill

Co-superviseur⋅e :

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers

Amir Ardalan Kalantari

Mohammad Saeed Amini

Sarath Chandar

Much of recent Deep Reinforcement Learning success is owed to the neural architecture's potential to learn and use effective internal repres… (voir plus)entations of the world. While many current algorithms access a simulator to train with a large amount of data, in realistic settings, including while playing games that may be played against people, collecting experience can be quite costly. In this paper, we introduce a deep reinforcement learning architecture whose purpose is to increase sample efficiency without sacrificing performance. We design this architecture by incorporating advances achieved in recent years in the field of Natural Language Processing and Computer Vision. Specifically, we propose a visually attentive model that uses transformers to learn a self-attention mechanism on the feature maps of the state representation, while simultaneously optimizing return. We demonstrate empirically that this architecture improves sample complexity for several Atari environments, while also achieving better performance in some of the games.

2022-02-01

ArXiv (prépublication)

Constructing a Good Behavior Basis for Transfer using Generalized Policy Updates

Safa Alver

We study the problem of learning a good set of policies, so that when combined together, they can solve a wide variety of unseen reinforceme… (voir plus)nt learning tasks with no or very little new data. Specifically, we consider the framework of generalized policy evaluation and improvement, in which the rewards for all tasks of interest are assumed to be expressible as a linear combination of a fixed set of features. We show theoretically that, under certain assumptions, having access to a specific set of diverse policies, which we call a set of independent policies, can allow for instantaneously achieving high-level performance on all possible downstream tasks which are typically more complex than the ones on which the agent was trained. Based on this theoretical analysis, we propose a simple algorithm that iteratively constructs this set of policies. In addition to empirically validating our theoretical results, we compare our approach with recently proposed diverse policy set construction methods and show that, while others fail, our approach is able to build a behavior basis that enables instantaneous transfer to all possible downstream tasks. We also show empirically that having access to a set of independent policies can better bootstrap the learning process on downstream tasks where the new reward function cannot be described as a linear combination of the features. Finally, we demonstrate how this policy set can be useful in a lifelong reinforcement learning setting.

2022-01-28

ICLR.cc/2022/Conference (poster)

openreview.net

COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

Jongmin Lee

Cosmin Paduraru

Daniel J Mankowitz

Nicolas Heess

Kee-Eung Kim

Arthur Guez

We consider the offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected… (voir plus) return while satisfying given cost constraints, learning only from a pre-collected dataset. This problem setting is appealing in many real-world scenarios, where direct interaction with the environment is costly or risky, and where the resulting policy should comply with safety constraints. However, it is challenging to compute a policy that guarantees satisfying the cost constraints in the offline RL setting, since the off-policy evaluation inherently has an estimation error. In this paper, we present an offline constrained RL algorithm that optimizes the policy in the space of the stationary distribution. Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction. Experimental results show that COptiDICE attains better policies in terms of constraint satisfaction and return-maximization, outperforming baseline algorithms.

2022-01-28

ICLR.cc/2022/Conference (spotlight)

doi.org

openreview.net

The Paradox of Choice: Using Attention in Hierarchical Reinforcement Learning

Andrei Cristian Nica

Khimya Khetarpal

2022-01-24

ArXiv (prépublication)

Attention Option-Critic

Raviteja Chunduru

2022-01-07

ArXiv (prépublication)

Attention Option-Critic

Raviteja Chunduru

2022-01-07

ArXiv (prépublication)

Attention Option-Critic

Raviteja Chunduru

2022-01-07

ArXiv (prépublication)

Appendix: On the Expressivity of Markov Reward

David Abel

Will Dabney

Anna Harutyunyan

Mark K. Ho

Michael L. Littman

Satinder Singh

(Q1) What does it mean for Bob to *solve* one of these tasks? That is, if Alice chooses a SOAP, PO, or TO for Bob to learn to solve, when ca… (voir plus)n Alice determine Bob has solved the task? A: Bob can be said to be doing better on a given task if his behavior improves, as is typical in evaluating behavior under reward. The difference with SOAPs, POs, and TOs is that we measure improvement relative to the task rather than reward. For instance, given a SOAP, we might say that Bob has solved the task once he has found one of the good policies, and we might measure Bob’s progress on a task in terms of the distance of his greedy policy to one of the good policies (as done in our learning experiments). The same reasoning applies to POs and TOs: Bob is doing better on a task in so far as his greedy policy (or trajectories) is (are) higher up the ordering.

Behind the Machine's Gaze: Biologically Constrained Neural Networks Exhibit Human-like Visual Attention

Leo Schwinn

B. Eskofier

Dario Zanca

2022-01-01

arXiv.org (prépublication)

doi.org

Behind the Machine's Gaze: Neural Networks with Biologically-inspired Constraints Exhibit Human-like Visual Attention

Leo Schwinn

Bjoern Eskofier

Dario Zanca

By and large, existing computational models of visual attention tacitly assume perfect vision and full access to the stimulus and thereby de… (voir plus)viate from foveated biological vision. Moreover, modeling top-down attention is generally reduced to the integration of semantic features without incorporating the signal of a high-level visual tasks that have been shown to partially guide human attention. We propose the Neural Visual Attention (NeVA) algorithm to generate visual scanpaths in a top-down manner. With our method, we explore the ability of neural networks on which we impose a biologically-inspired foveated vision constraint to generate human-like scanpaths without directly training for this objective. The loss of a neural network performing a downstream visual task (i.e., classification or reconstruction) flexibly provides top-down guidance to the scanpath. Extensive experiments show that our method outperforms state-of-the-art unsupervised human attention models in terms of similarity to human scanpaths. Additionally, the flexibility of the framework allows to quantitatively investigate the role of different tasks in the generated visual behaviors. Finally, we demonstrate the superiority of the approach in a novel experiment that investigates the utility of scanpaths in real-world applications, where imperfect viewing conditions are given.

2022-01-01

Trans. Mach. Learn. Res. (publié)

openreview.net

Continuous MDP Homomorphisms and Homomorphic Policy Gradient

Sahand Rezaei-Shoshtari

Rosie Zhao

Improving Robustness against Real-World and Worst-Case Distribution Shifts through Decision Region Quantification

Leo Schwinn

Leon Bungert

A. Nguyen

Ren'e Raab

Falk Pulsmeyer

B. Eskofier

Dario Zanca

The reliability of neural networks is essential for their use in safety-critical applications. Existing approaches generally aim at improvin… (voir plus)g the robustness of neural networks to either real-world distribution shifts (e.g., common corruptions and perturbations, spatial transformations, and natural adversarial examples) or worst-case distribution shifts (e.g., optimized adversarial examples). In this work, we propose the Decision Region Quantification (DRQ) algorithm to improve the robustness of any differentiable pre-trained model against both real-world and worst-case distribution shifts in the data. DRQ analyzes the robustness of local decision regions in the vicinity of a given data point to make more reliable predictions. We theoretically motivate the DRQ algorithm by showing that it effectively smooths spurious local extrema in the decision surface. Furthermore, we propose an implementation using targeted and untargeted adversarial attacks. An extensive empirical evaluation shows that DRQ increases the robustness of adversarially and non-adversarially trained models against real-world and worst-case distribution shifts on several computer vision benchmark datasets.

2022-01-01

ICML (publié)

doi.org