Doina Precup

Sumana Basu

Doctorat - McGill

Co-superviseur⋅e :

Adriana Romero Soriano

Collaborateur·rice alumni - McGill

Lynn Cherif

Maîtrise recherche - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Superviseur⋅e principal⋅e :

David Meger

Jonathan Colaço Carr

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Prakash Panangaden

Élodie Coté-Gauthier

Collaborateur·rice de recherche - McGill

Co-superviseur⋅e :

Isabeau Prémont-Schwarz

Franco Del Balso

Stagiaire de recherche - UdeM

Jesse Farebrother

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Collaborateur·rice alumni - McGill

Mohammad Sami Nur Islam Islam

Maîtrise recherche - McGill

Arushi Jain

Collaborateur·rice alumni - McGill

Doctorat - Polytechnique

Flemming Kondrup

Postdoctorat - McGill

Elaine Lau

Maîtrise recherche - McGill

Jonathan Lebensold

Collaborateur·rice alumni - McGill

Baccalauréat - McGill

Ray Luo

Doctorat - McGill

Superviseur⋅e principal⋅e :

G McCracken

Doctorat - McGill

Nazanin Mohammadi Sepahvand

Collaborateur·rice alumni - McGill

Shahrad Mohammadzadeh

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

Gabriela Moisescu-Pareja

Collaborateur·rice de recherche - McGill

Co-superviseur⋅e :

Irina Rish

Padideh Nouri

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Nate Rahn

Doctorat - McGill

Superviseur⋅e principal⋅e :

Marc Gendron-Bellemare

Sahand Rezaei-Shoshtari

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Nishanth Anand Vemgal

Doctorat - McGill

Doctorat - McGill

Co-superviseur⋅e :

Samira Ebrahimi Kahou

Zihan Wang

Doctorat - McGill

Skipper : combiner l’abstraction spatiale et temporelle afin d’améliorer la généralisation

Guangyuan Wang

Stagiaire de recherche - McGill

Steve Wen

Maîtrise recherche - McGill

Co-superviseur⋅e :

Gregory Dudek

Zijing Wu

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Harry Zhao

Collaborateur·rice alumni - McGill

Co-superviseur⋅e :

Billets de blogue

Generic thumbnail for Mila Blog articles.

22 février 2024

par

Mingde Harry Zhao

Safa Alver

Harm van Seijen

Romain Laroche

Doina Precup

Yoshua Bengio

Lire l'article

Publications

Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata

Borja Balle

We address the approximate minimization problem for weighted finite automata (WFAs) with weights in …

2021-02-13

ArXiv (prépublication)

doi.org

arxiv.org

A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

Mingde Zhao

We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state during plan… (voir plus)ning. The agent uses a bottleneck mechanism over a set-based representation to force the number of entities to which the agent attends at each planning step to be small. In experiments, we investigate the bottleneck mechanism with several sets of customized environments featuring different challenges. We consistently observe that the design allows the planning agents to generalize their learned task-solving abilities in compatible unseen environments by attending to the relevant objects, leading to better out-of-distribution generalization performance.

Finite time analysis of temporal difference learning with linear function approximation: the tail averaged case

Gandharv Patil

Prashanth L.A.

In this paper, we study the ﬁnite-time behaviour of temporal difference (TD) learning algorithms when combined with tail-averaging, and pr… (voir plus)esent instance dependent bounds on the parameter error of the tail-averaged TD iterate. Our error bounds hold in expectation as well as with high probability, exhibit a sharper rate of decay for the initial error (bias), and are comparable with existing bounds in the literature.

Flexible Option Learning

Martin Klissarov

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

Moksh J. Jain

This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions… (voir plus), such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e.g., in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.

Improving Long-Term Metrics in Recommendation Systems using Short-Horizon Offline RL

Bogdan Mazoure

Paul Mineiro

Pavithra Srinath

Reza Sharifi Sedeh

Adith Swaminathan

We study session-based recommendation scenarios where we want to recommend items to users during sequential interactions to improve their lo… (voir plus)ng-term utility. Optimizing a long-term metric is challenging because the learning signal (whether the recommendations achieved their desired goals) is delayed and confounded by other user interactions with the system. Immediately measurable proxies such as clicks can lead to suboptimal recommendations due to misalignment with the long-term metric. Many works have applied episodic reinforcement learning (RL) techniques for session-based recommendation but these methods do not account for policy-induced drift in user intent across sessions. We develop a new batch RL algorithm called Short Horizon Policy Improvement (SHPI) that approximates policy-induced distribution shifts across sessions. By varying the horizon hyper-parameter in SHPI, we recover well-known policy improvement schemes in the RL literature. Empirical results on four recommendation tasks show that SHPI can outperform matrix factorization, offline bandits, and offline RL baselines. We also provide a stable and computationally efficient implementation using weighted regression oracles.

2021-01-01

arXiv.org (prépublication)

dblp.uni-trier.de

Preferential Temporal Difference Learning

Nishanth Anand

2021-01-01

ICML (publié)

proceedings.mlr.press

arxiv.org

Randomized Exploration in Reinforcement Learning with General Value Function Approximation

Haque Ishfaq

Qiwen Cui

Viet Bang Nguyen

Alex Ayoub

Zhuoran Yang

Zhaoran Wang

Lin Yang

2021-01-01

International Conference on Machine Learning (publié)

proceedings.mlr.press

Randomized Least Squares Policy Optimization

Haque Ishfaq

Zhuoran Yang

Andrei-Stefan Lupu

Viet Bang Nguyen

Lewis Liu

Riashat Islam

Zhaoran Wang

Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. … (voir plus)However, designing provably efﬁcient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper conﬁdence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves (cid:101) O ( d 3 / 2 H 3 / 2 √ T ) worst-case (frequentist) regret, where H is the number of episodes, T is the total number of steps and d is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efﬁcient PO algorithms.

Temporally Abstract Partial Models

Khimya Khetarpal

Zafarali Ahmed

Gheorghe Comanici

Humans and animals have the ability to reason and make predictions about different courses of action at many time scales. In reinforcement l… (voir plus)earning, option models (Sutton, Precup \& Singh, 1999; Precup, 2000) provide the framework for this kind of temporally abstract prediction and reasoning. Natural intelligent agents are also able to focus their attention on courses of action that are relevant or feasible in a given situation, sometimes termed affordable actions. In this paper, we define a notion of affordances for options, and develop temporally abstract partial option models, that take into account the fact that an option might be affordable only in certain situations. We analyze the trade-offs between estimation and approximation error in planning and learning when using such models, and identify some interesting special cases. Additionally, we empirically demonstrate the ability to learn both affordances and partial option models online resulting in improved sample efficiency and planning time in the Taxi domain.

On the Expressivity of Markov Reward

David Abel

Will Dabney

Anna Harutyunyan

Mark K. Ho

Michael L. Littman

Satinder Singh

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way … (voir plus)to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of"task"that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.