Andrei Lupu

Viet Bang Nguyen

Lewis Liu

Riashat Islam

Zhaoran Wang

Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. … (voir plus)However, designing provably efﬁcient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper conﬁdence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves (cid:101) O ( d 3 / 2 H 3 / 2 √ T ) worst-case (frequentist) regret, where H is the number of episodes, T is the total number of steps and d is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efﬁcient PO algorithms.

2020-12-31

(publié)

www.semanticscholar.org

Option-Critic in Cooperative Multi-Agent Systems

Jhelum Chakravorty

Nadeem Ward

Julien Roy

Maxime Chevalier-Boisvert

Sumana Basu

In this paper, we investigate learning temporal abstractions in cooperative multi-agent systems, using the options framework (Sutton et al, … (voir plus)1999). First, we address the planning problem for the decentralized POMDP represented by the multi-agent system, by introducing a \emph{common information approach}. We use the notion of \emph{common beliefs} and broadcasting to solve an equivalent centralized POMDP problem. Then, we propose the Distributed Option Critic (DOC) algorithm, which uses centralized option evaluation and decentralized intra-option improvement. We theoretically analyze the asymptotic convergence of DOC and build a new multi-agent environment to demonstrate its validity. Our experiments empirically show that DOC performs competitively against baselines and scales with the number of agents.

2020-05-04

International Joint Conference on Autonomous Agents and Multiagent Systems (publié)

arxiv.org

Gifting in Multi-Agent Reinforcement Learning (Student Abstract)

This work performs a first study on multi-agent reinforcement learning with deliberate reward passing between agents. We empirically demonst… (voir plus)rate that such mechanics can greatly improve the learning progression in a resource appropriation setting and provide a preliminary discussion of the complex effects of gifting on the learning dynamics.

2020-04-02

AAAI Conference on Artificial Intelligence (publié)

Leveraging Observations in Bandits: Between Risks and Benefits.

Audrey Durand

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allo… (voir plus)wing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

2019-07-16

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

Imitation Upper Confidence Bound for Bandits on a Graph

We consider a graph of interconnected agents implementing a common policy and each playing a bandit problem with identical reward distributi… (voir plus)ons. We restrict the information propagated in the graph such that agents can uniquely observe each other's actions. We propose an extension of the Upper Confidence Bound (UCB) algorithm to this setting and empirically demonstrate that our solution improves the performance over UCB according to multiple metrics and within various graph configurations.

2018-04-28

Proceedings of the AAAI Conference on Artificial Intelligence (publié)