Aditya Mahajan

In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state obser… (see more)vations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-dependent rate of convergence shows that, almost surely, the system identification error is

2022-12-06

IEEE Conference on Decision and Control (published)

A modified Thompson sampling-based learning algorithm for unknown linear systems

Mukul Gagrani

Sagar Sudhakara

Rahul Jain

Ashutosh Nayyar

Yi Ouyang

We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This… (see more) algorithm operates in episodes of dynamic length and it is shown to have a regret bound of

2022-12-06

2022 IEEE 61st Conference on Decision and Control (CDC) (published)

Partially observable restless bandits with restarts: indexability and computation of Whittle index

Nima Akbarzadeh

We consider restless bandits with restarts, where the state of the active arms resets according to a known probability distribution while th… (see more)e state of the passive arms evolves in a Markovian manner. We assume that the state of the arm is observed after it is reset but not observed otherwise. We show that the model is indexable and propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. A detailed numerical study of machine repair models shows that Whittle index policy outperforms myopic policy and is close to optimal policy.

2022-12-06

IEEE Conference on Decision and Control (published)

Thompson-Sampling Based Reinforcement Learning for Networked Control of Unknown Linear Systems

Borna Sayedana

Mohammad Afshari

Peter E. Caines

In recent years, there has been considerable interest in reinforcement learning for linear quadratic Gaussian (LQG) systems. In this paper, … (see more)we consider a generalization of such systems where the controller and the plant are connected over an unreliable packet drop channel. Packet drops cause the system dynamics to switch between controlled and uncontrolled modes. This switching phenomena introduces new challenges in designing learning algorithms. We identify a sufficient condition under which the regret of Thompson sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) at horizon T is bounded by

2022-12-06

2022 IEEE 61st Conference on Decision and Control (CDC) (published)

Structure-Aware Reinforcement Learning for Node-Overload Protection in Mobile Edge Computing

Anirudha Jitani

Zhongwen Zhu

Hatem Abou-Zeid

Emmanuel Thepie Fapi

Hakimeh Purmehdi

Mobile Edge Computing (MEC) involves placing computational capability and applications at the edge of the network, providing benefits such a… (see more)s reduced latency, reduced network congestion, and improved performance of applications. The performance and reliability of MEC degrades significantly when the edge server(s) in the cluster are overloaded. In this work, an adaptive admission control policy to prevent edge node from getting overloaded is presented. This approach is based on a recently-proposed low complexity RL (Reinforcement Learning) algorithm called SALMUT (Structure-Aware Learning for Multiple Thresholds), which exploits the structure of the optimal admission control policy in multi-class queues for an average-cost setting. We extend the framework to work for node overload-protection problem in a discounted-cost setting. The proposed solution is validated using several scenarios mimicking real-world deployments in two different settings — computer simulations and a docker testbed. Our empirical evaluations show that the total discounted cost incurred by SALMUT is similar to state-of-the-art deep RL algorithms such as PPO (Proximal Policy Optimization) and A2C (Advantage Actor Critic) but requires an order of magnitude less time to train, outputs easily interpretable policy, and can be deployed in an online manner.

2022-12-01

IEEE Transactions on Cognitive Communications and Networking (published)

Conditions for indexability of restless bandits and an
$\mathcal{O}\!\left(K^3\right)$
algorithm to compute Whittle index

Nima Akbarzadeh

Abstract Restless bandits are a class of sequential resource allocation problems concerned with allocating one or more resources among sever… (see more)al alternative processes where the evolution of the processes depends on the resources allocated to them. Such models capture the fundamental trade-offs between exploration and exploitation. In 1988, Whittle developed an index heuristic for restless bandit problems which has emerged as a popular solution approach because of its simplicity and strong empirical performance. The Whittle index heuristic is applicable if the model satisfies a technical condition known as indexability. In this paper, we present two general sufficient conditions for indexability and identify simpler-to-verify refinements of these conditions. We then revisit a previously proposed algorithm called the adaptive greedy algorithm which is known to compute the Whittle index for a sub-class of restless bandits. We show that a generalization of the adaptive greedy algorithm computes the Whittle index for all indexable restless bandits. We present an efficient implementation of this algorithm which can compute the Whittle index of a restless bandit with K states in

2022-06-14

Advances in Applied Probability (published)

Optimal Control of Network-Coupled Subsystems: Spectral Decomposition and Low-Dimensional Solutions

Shuang Gao

In this article, we investigate the optimal control of network-coupled subsystems with coupled dynamics and costs. The dynamics coupling may… (see more) be represented by the adjacency matrix, the Laplacian matrix, or any other symmetric matrix corresponding to an underlying weighted undirected graph. Cost couplings are represented by two coupling matrices which have the same eigenvectors as the coupling matrix in the dynamics. We use the spectral decomposition of these three coupling matrices to decompose the overall system into

2022-06-01

IEEE Transactions on Control of Network Systems (published)

Staged independent learning: Towards decentralized cooperative multi-agent Reinforcement Learning

Hadi Nekoei

Akilesh Badrinaaraayanan

Amit Sinha

Mohammad Amini

Janarthanan Rajendran

Sarath Chandar

We empirically show that classic ideas from two-time scale stochastic approximation \citep{borkar1997stochastic} can be combined with sequen… (see more)tial iterative best response (SIBR) to solve complex cooperative multi-agent reinforcement learning (MARL) problems. We first start with giving a multi-agent estimation problem as a motivating example where SIBR converges while parallel iterative best response (PIBR) does not. Then we present a general implementation of staged multi-agent RL algorithms based on SIBR and multi-time scale stochastic approximation, and show that our new methods which we call Staged Independent Proximal Policy Optimization (SIPPO) and Staged Independent Q-learning (SIQL) outperform state-of-the-art independent learning on almost all the tasks in the epymarl \citep{papoudakis2020benchmarking} benchmark. This can be seen as a first step towards more decentralized MARL methods based on SIBR and multi-time scale learning.

2022-04-25

ICLR.cc/2022/Workshop/GMS (published)

openreview.net

On learning Whittle index policy for restless bandits with scalable regret

Nima Akbarzadeh

Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system mod… (see more)el is unknown. However, the cumulative regret of most RL algorithms scales as

2022-02-07

ArXiv (preprint)

Approximate information state for approximate planning and reinforcement learning in partially observed systems

Jayakumar Subramanian

Amit Sinha

Raihan Seraj

We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundam… (see more)ental notion of information state. We provide two equivalent definitions of information state---i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.

Robustness of Whittle Index Policy to Model Approximation

Amit Sinha

2022-01-01

Social Science Research Network (published)

Scalable Operator Allocation for Multirobot Assistance: A Restless Bandit Approach

Abhinav Dahiya

Nima Akbarzadeh

Stephen L. Smith

In this article, we consider the problem of allocating human operators in a system with multiple semiautonomous robots. Each robot is requir… (see more)ed to perform an independent sequence of tasks, subject to a chance of failing and getting stuck in a fault state at every task. If and when required, a human operator can assist or teleoperate a robot. Conventional dynamic programming-based techniques used to solve such problems face scalability issues due to an exponential growth of state and action spaces with the number of robots and operators. In this article, we derive conditions under which the operator allocation problem satisfies a technical condition called indexability, thereby enabling the use of the Whittle index heuristic. The conditions are easy to check, and we show that they hold for a wide range of problems of interest. Our key insight is to leverage the structure of the value function of individual robots, resulting in conditions that can be verified separately for each state of each robot. We apply these conditions to two types of transitions commonly seen in remote robot supervision systems. Through numerical simulations, we demonstrate the efficacy of Whittle index policy as a near-optimal and scalable approach that outperforms existing scalable methods.

2022-01-01

IEEE Transactions on Control of Network Systems (published)