Aditya Mahajan

Ashutosh Nayyar

Yi Ouyang

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …

2024-02-13

ArXiv (preprint)

Bridging State and History Representations: Understanding Self-Predictive RL

Tianwei Ni

Benjamin Eysenbach

Erfan SeyedSalehi

Michel Ma

Clement Gehring

Pierre-Luc Bacon

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (see more)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Strong Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems

Borna Sayedana

Mohammad Afshari

Peter E. Caines

In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state obser… (see more)vations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-independent rate of convergence shows that, almost surely, the system identification error is

2024-01-01

IEEE Transactions on Automatic Control (published)

Asymmetric Actor-Critic with Approximate Information State

Amit Sinha

Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) is a challenging problem because decisions need to b… (see more)e made based on the entire history of observations and actions. However, in several scenarios, state information is available during the training phase. We are interested in exploiting the availability of this state information during the training phase to efficiently learn a history-based policy using RL. Specifically, we consider actor-critic algorithms, where the actor uses only the history information but the critic uses both history and state. Such algorithms are called asymmetric actor-critic, to highlight the fact that the actor and critic have asymmetric information. Motivated by the recent success of using representation losses in RL for POMDPs [1], we derive similar theoretical results for the asymmetric actor-critic case and evaluate the effectiveness of adding such auxiliary losses in experiments. In particular, we learn a history representation-called an approximate information state (AIS)-and bound the performance loss when acting using AIS.

2023-12-13

IEEE Conference on Decision and Control (published)

Relative Almost Sure Regret Bounds for Certainty Equivalence Control of Markov Jump Systems

Borna Sayedana

Mohammad Afshari

Peter E. Caines

In this paper, we consider learning and control problem in an unknown Markov jump linear system (MJLS) with perfect state observations. We f… (see more)irst establish a generic upper bound on regret for any learning based algorithm. We then propose a certainty equivalence-based learning alagrithm and show that this algorithm achieves a regret of

2023-12-13

2023 62nd IEEE Conference on Decision and Control (CDC) (published)

Weighted-Norm Bounds on Model Approximation in MDPs with Unbounded Per-Step Cost

Berk Bozkurt

Ashutosh Nayyar

Yi Ouyang

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov Decision Process (MDP) …

2023-12-13

IEEE Conference on Decision and Control (published)

Mean-field games among teams

Jayakumar Subramanian

Akshat Kumar

2023-10-18

ArXiv (preprint)

Decentralized Linear Quadratic Systems With Major and Minor Agents and Non-Gaussian Noise

Mohammad Afshari

A decentralized linear quadratic system with a major agent and a collection of minor agents is considered. The major agent affects the minor… (see more) agents, but not vice versa. The state of the major agent is observed by all agents. In addition, the minor agents have a noisy observation of their local state. The noise process is not assumed to be Gaussian. The structures of the optimal strategy and the best linear strategy are characterized. It is shown that the major agent's optimal control action is a linear function of the major agent's minimum mean-squared error (MMSE) estimate of the system state while the minor agent's optimal control action is a linear function of the major agent's MMSE estimate of the system state and a “correction term” that depends on the difference of the minor agent's MMSE estimate of its local state and the major agent's MMSE estimate of the minor agent's local state. Since the noise is non-Gaussian, the minor agent's MMSE estimate is a nonlinear function of its observation. It is shown that replacing the minor agent's MMSE estimate with its linear least mean square estimate gives the best linear control strategy. The results are proved using a direct method based on conditional independence, common-information-based splitting of state and control actions, and simplifying the per-step cost based on conditional independence, orthogonality principle, and completion of squares.

2023-08-01

IEEE Transactions on Automatic Control (published)

Approximate information state based convergence analysis of recurrent Q-learning

Erfan SeyedSalehi

Nima Akbarzadeh

Amit Sinha

In spite of the large literature on reinforcement learning (RL) algorithms for partially observable Markov decision processes (POMDPs), a co… (see more)mplete theoretical understanding is still lacking. In a partially observable setting, the history of data available to the agent increases over time so most practical algorithms either truncate the history to a finite window or compress it using a recurrent neural network leading to an agent state that is non-Markovian. In this paper, it is shown that in spite of the lack of the Markov property, recurrent Q-learning (RQL) converges in the tabular setting. Moreover, it is shown that the quality of the converged limit depends on the quality of the representation which is quantified in terms of what is known as an approximate information state (AIS). Based on this characterization of the approximation error, a variant of RQL with AIS losses is presented. This variant performs better than a strong baseline for RQL that does not use AIS losses. It is demonstrated that there is a strong correlation between the performance of RQL over time and the loss associated with the AIS representation.

2023-07-20

EWRL/2023/Workshop (published)

openreview.net

On learning history-based policies for controlling Markov decision processes

Gandharv Patil

Doina Precup

2023-06-19

ICML.cc/2023/Workshop/Frontiers4LCD (published)

openreview.net

Conditions for indexability of restless bandits and an algorithm to compute whittle index – CORRIGENDUM

Nima Akbarzadeh

2023-06-09

Advances in Applied Probability (published)

Scalable Regret for Learning to Control Network-Coupled Subsystems With Unknown Dynamics

Sagar Sudhakara

Ashutosh Nayyar

Yi Ouyang

In this article, we consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems … (see more)connected over a network. Our goal is to minimize and quantify the regret (i.e., loss in performance) of our learning and control strategy with respect to an oracle who knows the system model. Upfront viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling-based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by

2023-03-01

IEEE Transactions on Control of Network Systems (published)