Amit Sinha

Periodic agent-state based Q-learning for POMDPs

Matthieu Geist

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Periodic agent-state based Q-learning for POMDPs

Matthieu Geist

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. H… (see more)owever, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

2024-08-01

EWRL/2024/Workshop (accepted)

Asymmetric Actor-Critic with Approximate Information State

Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) is a challenging problem because decisions need to b… (see more)e made based on the entire history of observations and actions. However, in several scenarios, state information is available during the training phase. We are interested in exploiting the availability of this state information during the training phase to efficiently learn a history-based policy using RL. Specifically, we consider actor-critic algorithms, where the actor uses only the history information but the critic uses both history and state. Such algorithms are called asymmetric actor-critic, to highlight the fact that the actor and critic have asymmetric information. Motivated by the recent success of using representation losses in RL for POMDPs [1], we derive similar theoretical results for the asymmetric actor-critic case and evaluate the effectiveness of adding such auxiliary losses in experiments. In particular, we learn a history representation-called an approximate information state (AIS)-and bound the performance loss when acting using AIS.

2023-12-13

IEEE Conference on Decision and Control (published)

Dealing With Non-stationarity in Decentralized Cooperative Multi-Agent Deep Reinforcement Learning via Multi-Timescale Learning

Hadi Nekoei

Akilesh Badrinaaraayanan

Mohammad Amin Amini

Janarthanan Rajendran

Sarath Chandar

2023-11-20

Proceedings of The 2nd Conference on Lifelong Learning Agents (published)

Approximate information state based convergence analysis of recurrent Q-learning

Erfan SeyedSalehi

Nima Akbarzadeh

In spite of the large literature on reinforcement learning (RL) algorithms for partially observable Markov decision processes (POMDPs), a co… (see more)mplete theoretical understanding is still lacking. In a partially observable setting, the history of data available to the agent increases over time so most practical algorithms either truncate the history to a finite window or compress it using a recurrent neural network leading to an agent state that is non-Markovian. In this paper, it is shown that in spite of the lack of the Markov property, recurrent Q-learning (RQL) converges in the tabular setting. Moreover, it is shown that the quality of the converged limit depends on the quality of the representation which is quantified in terms of what is known as an approximate information state (AIS). Based on this characterization of the approximation error, a variant of RQL with AIS losses is presented. This variant performs better than a strong baseline for RQL that does not use AIS losses. It is demonstrated that there is a strong correlation between the performance of RQL over time and the loss associated with the AIS representation.

2023-07-20

EWRL/2023/Workshop (accepted)

Dealing With Non-stationarity in Decentralized Cooperative Multi-Agent Deep Reinforcement Learning via Multi-Timescale Learning

Hadi Nekoei

Akilesh Badrinaaraayanan

Mohammad Amin Amini

Janarthanan Rajendran

Sarath Chandar

2023-02-06

ArXiv (preprint)

Robustness and Sample Complexity of Model-Based MARL for General-Sum Markov Games

Jayakumar Subramanian

2023-01-21

Dynamic Games and Applications (published)

Staged independent learning: Towards decentralized cooperative multi-agent Reinforcement Learning

Hadi Nekoei

Akilesh Badrinaaraayanan

Mohammad Amini

Janarthanan Rajendran

Sarath Chandar

We empirically show that classic ideas from two-time scale stochastic approximation \citep{borkar1997stochastic} can be combined with sequen… (see more)tial iterative best response (SIBR) to solve complex cooperative multi-agent reinforcement learning (MARL) problems. We first start with giving a multi-agent estimation problem as a motivating example where SIBR converges while parallel iterative best response (PIBR) does not. Then we present a general implementation of staged multi-agent RL algorithms based on SIBR and multi-time scale stochastic approximation, and show that our new methods which we call Staged Independent Proximal Policy Optimization (SIPPO) and Staged Independent Q-learning (SIQL) outperform state-of-the-art independent learning on almost all the tasks in the epymarl \citep{papoudakis2020benchmarking} benchmark. This can be seen as a first step towards more decentralized MARL methods based on SIBR and multi-time scale learning.

2022-04-25

ICLR.cc/2022/Workshop/GMS (published)

Approximate information state for approximate planning and reinforcement learning in partially observed systems

Jayakumar Subramanian

Raihan Seraj

We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundam… (see more)ental notion of information state. We provide two equivalent definitions of information state---i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.

Robustness of Whittle Index Policy to Model Approximation

2022-01-01

Social Science Research Network (published)

Robustness of Markov perfect equilibrium to model approximations in general-sum dynamic games

Jayakumar Subramanian

Dynamic games (also called stochastic games or Markov games) are an important class of games for modeling multi-agent interactions. In many … (see more)situations, the dynamics and reward functions of the game are learnt from past data and are therefore approximate. In this paper, we study the robustness of Markov perfect equilibrium to approximations in reward and transition functions. Using approximation results from Markov decision processes, we show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game. We provide explicit bounds on the approximation error in terms of three quantities: (i) the error in approximating the reward functions, (ii) the error in approximating the transition function, and (iii) a property of the value function of the MPE of the approximate game. The second and third quantities depend on the choice of metric on probability spaces. We also present coarser upper bounds which do not depend on the value function but only depend on the properties of the reward and transition functions of the approximate game. We illustrate the results via a numerical example.

2021-12-20

2021 Seventh Indian Control Conference (ICC) (published)