Aditya Mahajan

Reza Pourmohammadi Najafabadi

Collaborating Alumni - McGill University

Website

Google Scholar

Arka Ian Goswami

Master's Research - McGill University

Gaspard Lambrechts

Postdoctorate - McGill University

Co-supervisor :

Master's Research - Université de Montréal

Samin Nili

PhD - McGill University

Google Scholar

Master's Research - McGill University

Github

PhD - McGill University

Github

Google Scholar

Tao Zhang

PhD - McGill University

Publications

Sub-optimality bounds for certainty equivalent policies in partially observed systems

Ashutosh Nayyar

Yi Ouyang

In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (see more) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems that allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models where the cost and the dynamics are smooth in an appropriate sense, we derive upper bounds on the sub-optimality of certainty equivalent policies. We present several examples to illustrate the results.

2026-02-01

ArXiv (preprint)

arxiv.org

Generalized certainty equivalence based policies in partially observable systems

Ashutosh Nayyar

Yi Ouyang

In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (see more) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems and allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models with Lipschitz cost and dynamics, we derive upper bounds on the sub-optimality of certainty equivalent policies in terms of expected error of the proposed estimator. We present several examples to illustrate the results.

2025-12-08

IEEE Conference on Decision and Control (published)

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Gaspard Lambrechts

Damien Ernst

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (see more)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Convergence of regularized agent-state based Q-learning in POMDPs

Matthieu Geist

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practic… (see more)e. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

2025-07-16

EWRL/2025/Workshop (poster)

openreview.net

Model approximation in MDPs with unbounded per-step cost

Ashutosh Nayyar

Yi Ouyang

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …

2025-06-30

IEEE Transactions on Automatic Control (published)

arxiv.org

Low-Dimensional solutions for optimal control of network-coupled subsystems over a directed network

Mohamed-Amine Azzouz

Shuang Gao

In this paper, we investigate optimal control of network-coupled subsystems, where the coupling between the dynamics of the subsystems is re… (see more)presented by the adjacency or Laplacian matrix of a directed graph. Under the assumption that the coupling matrix is normal and the cost coupling is compatible with the dynamics coupling, we use the spectral decomposition of the coupling matrix to decompose the overall system into at most n systems with noise coupled dynamics and decoupled cost, where n is the size of the network. Furthermore, the optimal control input at each subsystem can be computed by solving n1 decoupled Riccati equations where n1 (n1 ≤ n) denotes the number of distinct eigenvalues of the coupling matrix, where complex conjugate pairs are not double-counted. A salient feature of the result is that the solution complexity depends on the number of distinct eigenvalues of the coupling matrix rather than the size of the network. Therefore, the proposed solution framework provides a scalable method for synthesizing and implementing optimal control laws for large-scale network-coupled subsystems.

2024-12-31

CDC (published)

Agent-state based policies in POMDPs: Beyond belief-state MDPs

The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. Howeve… (see more)r, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable “agent state” and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer’s approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

arxiv.org

Constant step-size stochastic approximation with delayed updates

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

In this paper, we consider constant step-size stochastic approximation with delayed updates. For the non-delayed case, it is well known that… (see more) under appropriate conditions, the discrete-time iterates of stochastic approximation track the trajectory of a continuous-time ordinary differential equation (ODE). For the delayed case, we show in this paper that, under appropriate conditions, the discrete-time iterates track the trajectory of a delay-differential equation (DDE) rather than an ODE. Thus, delayed updates lead to a qualitative change in the behavior of constant step-size stochastic approximation. We present multiple examples to illustrate the qualitative affect of delay and show that increasing the delay is generally destabilizing but, for some systems, it can be stabilizing as well.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

A vector almost-supermartingale convergence theorem and its applications

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

The almost-supermartingale convergence theorem of Robbins and Siegmund (1971) is a fundamental tool for establishing the convergence of vari… (see more)ous stochastic iterative algorithms including system identification, adaptive control, and reinforcement learning. The theorem is stated for non-negative scalar valued stochastic processes. In this paper, we generalize the theorem to non-negative vector valued stochastic processes and provide two set of sufficient conditions for such processes to converge almost surely. We present several applications of vector almost-supermartingale convergence theorem, including convergence of autoregressive supermartingales, delayed supermartingales, and stochastic approximation with delayed updates.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

Periodic agent-state based Q-learning for POMDPs

Matthieu Geist

2024-09-24

NeurIPS.cc/2024/Conference (poster)

openreview.net

On learning history-based policies for controlling Markov decision processes

Gandharv Patil

Doina Precup

Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state a… (see more)bstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.

2024-04-17

Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (published)

proceedings.mlr.press

Bridging State and History Representations: Understanding Self-Predictive RL

Benjamin Eysenbach

Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (see more)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.

2024-01-15

ICLR.cc/2024/Conference (poster)