Aditya Mahajan

Biography

Aditya Mahajan is a professor in the Department of Electrical and Computer Engineering at McGill University and an associate academic member of Mila – Quebec Artificial Intelligence Institute.

He is also a member of the McGill Centre for Intelligent Machines (CIM), the International Laboratory for Learning Systems (ILLS), and the Group for Research in Decision Analysis (GERAD). Mahajan received his BTech degree in electrical engineering from the Indian Institute of Technology Kanpur, and his MSc and PhD degrees in electrical engineering and computer science from the University of Michigan at Ann Arbor.

He is a senior member of the U.S. Institute of Electrical and Electronics Engineers (IEEE), as well as a member of Professional Engineers Ontario. He currently serves as associate editor for IEEE Transactions on Automatic Control, IEEE Control Systems Letters, and Mathematics of Control, Signals, and Systems (Springer). He served as associate editor for the conference editorial board of the IEEE Control Systems Society from 2014 to 2017.

Mahajan’s numerous awards include the 2015 George Axelby Outstanding Paper Award, 2016 NSERC Discovery Accelerator Award, 2014 CDC Best Student Paper Award (as supervisor), and 2016 NecSys Best Student Paper Award (as supervisor). Mahajan’s principal research interests are stochastic control and reinforcement learning.

Current Students

Mohamed-Amine Azzouz

Master's Research - McGill University

Collaborating Alumni - McGill University

Website

Arka Ian Goswami

Master's Research - McGill University

Gaspard Lambrechts

Postdoctorate - McGill University

Co-supervisor :

Master's Research - Université de Montréal

Samin Nili

PhD - McGill University

Reza Pourmohammadi Najafabadi

Master's Research - McGill University

Github

Collaborating Alumni - McGill University

Github

Tao Zhang

PhD - McGill University

Publications

Risk-seeking conservative policy iteration with agent-state based policies for Dec-POMDPs with guaranteed convergence

Matthieu Geist

Optimally solving decentralized decision-making problems modeled as Dec-POMDPs is known to be NEXP-complete. These optimal solutions are pol… (see more)icies based on the entire history of observations and actions of an agent. However, some applications may require more compact policies because of limited compute capabilities, which can be modeled by considering a limited number of memory states (or agent states). While such an agent-state based policy class may not contain the optimal solution, it is still of practical interest to find the best agent-state policy within the class. We focus on an iterated best response style algorithm which guarantees monotonic improvements and convergence to a local optimum in polynomial runtime in the Dec-POMDP model size. In order to obtain a better local optimum, we use a modified objective which incentivizes risk-seeking alongside a conservative policy iteration update. Our empirical results show that our approach performs as well as state-of-the-art approaches on several benchmark Dec-POMDPs, achieving near-optimal performance while having polynomial runtime despite the limited memory. We also show that using more agent states (a larger memory) leads to greater performance. Our approach provides a novel way of incorporating memory constraints on the agents in the Dec-POMDP problem.

2026-04-09

arXiv (preprint)

Operator-Theoretic Foundations and Policy Gradient Methods for General MDPs with Unbounded Costs

Abhishek Gupta

Markov decision processes (MDPs) is viewed as an optimization of an objective function over certain linear operators over general function s… (see more)paces. Using the well-established perturbation theory of linear operators, this viewpoint allows one to identify derivatives of the objective function as a function of the linear operators. This leads to generalization of many well-known results in reinforcement learning to cases with generate state and action spaces. Prior results of this type were only established in the finite-state finite-action MDP settings and in settings with certain linear function approximations. The framework also leads to new low-complexity PPO-type reinforcement learning algorithms for general state and action space MDPs.

2026-03-17

arXiv (preprint)

Sub-optimality bounds for certainty equivalent policies in partially observed systems

Ashutosh Nayyar

Yi Ouyang

In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (see more) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems that allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models where the cost and the dynamics are smooth in an appropriate sense, we derive upper bounds on the sub-optimality of certainty equivalent policies. We present several examples to illustrate the results.

2026-02-01

ArXiv (preprint)

Generalized certainty equivalence based policies in partially observable systems

Ashutosh Nayyar

Yi Ouyang

In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (see more) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems and allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models with Lipschitz cost and dynamics, we derive upper bounds on the sub-optimality of certainty equivalent policies in terms of expected error of the proposed estimator. We present several examples to illustrate the results.

2025-12-08

IEEE Conference on Decision and Control (published)

A Theoretical Justification for Asymmetric Actor-Critic Algorithms

Gaspard Lambrechts

Damien Ernst

In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (see more)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Convergence of regularized agent-state based Q-learning in POMDPs

Matthieu Geist

In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practic… (see more)e. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

2025-07-16

EWRL/2025/Workshop (poster)

openreview.net

Model approximation in MDPs with unbounded per-step cost

Ashutosh Nayyar

Yi Ouyang

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …

2025-06-30

IEEE Transactions on Automatic Control (published)

Low-Dimensional solutions for optimal control of network-coupled subsystems over a directed network

Mohamed-Amine Azzouz

Shuang Gao

In this paper, we investigate optimal control of network-coupled subsystems, where the coupling between the dynamics of the subsystems is re… (see more)presented by the adjacency or Laplacian matrix of a directed graph. Under the assumption that the coupling matrix is normal and the cost coupling is compatible with the dynamics coupling, we use the spectral decomposition of the coupling matrix to decompose the overall system into at most n systems with noise coupled dynamics and decoupled cost, where n is the size of the network. Furthermore, the optimal control input at each subsystem can be computed by solving n1 decoupled Riccati equations where n1 (n1 ≤ n) denotes the number of distinct eigenvalues of the coupling matrix, where complex conjugate pairs are not double-counted. A salient feature of the result is that the solution complexity depends on the number of distinct eigenvalues of the coupling matrix rather than the size of the network. Therefore, the proposed solution framework provides a scalable method for synthesizing and implementing optimal control laws for large-scale network-coupled subsystems.

2024-12-31

CDC (published)

Agent-state based policies in POMDPs: Beyond belief-state MDPs

The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. Howeve… (see more)r, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable “agent state” and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer’s approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

Constant step-size stochastic approximation with delayed updates

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

In this paper, we consider constant step-size stochastic approximation with delayed updates. For the non-delayed case, it is well known that… (see more) under appropriate conditions, the discrete-time iterates of stochastic approximation track the trajectory of a continuous-time ordinary differential equation (ODE). For the delayed case, we show in this paper that, under appropriate conditions, the discrete-time iterates track the trajectory of a delay-differential equation (DDE) rather than an ODE. Thus, delayed updates lead to a qualitative change in the behavior of constant step-size stochastic approximation. We present multiple examples to illustrate the qualitative affect of delay and show that increasing the delay is generally destabilizing but, for some systems, it can be stabilizing as well.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

A vector almost-supermartingale convergence theorem and its applications

Silviu-Iulian Niculescu

Mathukumalli Vidyasagar

The almost-supermartingale convergence theorem of Robbins and Siegmund (1971) is a fundamental tool for establishing the convergence of vari… (see more)ous stochastic iterative algorithms including system identification, adaptive control, and reinforcement learning. The theorem is stated for non-negative scalar valued stochastic processes. In this paper, we generalize the theorem to non-negative vector valued stochastic processes and provide two set of sufficient conditions for such processes to converge almost surely. We present several applications of vector almost-supermartingale convergence theorem, including convergence of autoregressive supermartingales, delayed supermartingales, and stochastic approximation with delayed updates.

2024-12-15

2024 IEEE 63rd Conference on Decision and Control (CDC) (published)

Periodic agent-state based Q-learning for POMDPs

Matthieu Geist

2024-09-24

NeurIPS.cc/2024/Conference (poster)