Aditya Mahajan

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

Convergence of regularized agent-state based Q-learning in POMDPs

Amit Sinha

Matthieu Geist

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practic… (voir plus)e. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

2025-07-17

EWRL/2025/Workshop (poster)

Model approximation in MDPs with unbounded per-step cost

Berk Bozkurt

Ashutosh Nayyar

Yi Ouyang

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …

2025-07-01

IEEE Transactions on Automatic Control (publié)

Constant step-size stochastic approximation with delayed updates

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

In this paper, we consider constant step-size stochastic approximation with delayed updates. For the non-delayed case, it is well known that… (voir plus) under appropriate conditions, the discrete-time iterates of stochastic approximation track the trajectory of a continuous-time ordinary differential equation (ODE). For the delayed case, we show in this paper that, under appropriate conditions, the discrete-time iterates track the trajectory of a delay-differential equation (DDE) rather than an ODE. Thus, delayed updates lead to a qualitative change in the behavior of constant step-size stochastic approximation. We present multiple examples to illustrate the qualitative affect of delay and show that increasing the delay is generally destabilizing but, for some systems, it can be stabilizing as well.

2024-12-16

2024 IEEE 63rd Conference on Decision and Control (CDC) (publié)

A vector almost-supermartingale convergence theorem and its applications

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

The almost-supermartingale convergence theorem of Robbins and Siegmund (1971) is a fundamental tool for establishing the convergence of vari… (voir plus)ous stochastic iterative algorithms including system identification, adaptive control, and reinforcement learning. The theorem is stated for non-negative scalar valued stochastic processes. In this paper, we generalize the theorem to non-negative vector valued stochastic processes and provide two set of sufficient conditions for such processes to converge almost surely. We present several applications of vector almost-supermartingale convergence theorem, including convergence of autoregressive supermartingales, delayed supermartingales, and stochastic approximation with delayed updates.

2024-12-16

2024 IEEE 63rd Conference on Decision and Control (CDC) (publié)

Periodic agent-state based Q-learning for POMDPs

Amit Sinha

Matthieu Geist

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Periodic agent-state based Q-learning for POMDPs

Amit Sinha

Matthieu Geist

The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. H… (voir plus)owever, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.

2024-08-01

EWRL/2024/Workshop (accepté)

On learning history-based policies for controlling Markov decision processes

Gandharv Patil

Doina Precup

2024-04-18

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (publié)

Bridging State and History Representations: Understanding Self-Predictive RL

Benjamin Eysenbach

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (voir plus)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

2024-01-17

ArXiv (prépublication)

Bridging State and History Representations: Understanding Self-Predictive RL

Benjamin Eysenbach

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (voir plus)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

2024-01-17

ArXiv (prépublication)

Bridging State and History Representations: Understanding Self-Predictive RL

Benjamin Eysenbach

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (voir plus)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

2024-01-17

ArXiv (prépublication)