Portrait de Aditya Mahajan

Aditya Mahajan

Membre académique associé
Professeur agrégé, McGill University, Département de génie électrique et informatique
Sujets de recherche
Apprentissage par renforcement

Biographie

Aditya Mahajan est professeur de génie électrique et informatique à l'Université McGill. Il est membre du Centre sur les machines intelligentes (CIM) de McGill, de Mila – Institut québécois d’intelligence artificielle, du Laboratoire international des systèmes d'apprentissage (ILLS) et du Groupe d'études et de recherche en analyse des décisions (GERAD). Il est titulaire d'une licence en génie électrique de l'Indian Institute of Technology de Kanpur (Inde), ainsi que d'une maîtrise et d'un doctorat en génie électrique et en informatique de l'Université du Michigan à Ann Arbor (États-Unis).

Aditya Mahajan est membre senior de l'Institute of Electrical and Electronics Engineers (IEEE) et membre de Professional Engineers Ontario. Il est actuellement rédacteur en chef adjoint des IEEE Transactions on Automatic Control, des IEEE Control Systems Letters et de Mathematics of Control, Signals, and Systems (Springer). Il a été rédacteur associé au comité de rédaction de la conférence de l'IEEE Control Systems Society de 2014 à 2017.

Il a reçu le prix George Axelby 2015 récompensant un article exceptionnel, un supplément d’accélération à la découverte du Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG) en 2016, le prix CDC du meilleur article étudiant 2014 (en tant que superviseur) et le prix NecSys du meilleur article étudiant 2016 (en tant que superviseur). Ses principaux domaines de recherche sont le contrôle stochastique et l'apprentissage par renforcement.

Étudiants actuels

Maîtrise recherche - McGill
Maîtrise recherche - McGill
Collaborateur·rice alumni - McGill
Maîtrise recherche - McGill
Postdoctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Doctorat - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Doctorat - McGill

Publications

Sub-optimality bounds for certainty equivalent policies in partially observed systems
Ashutosh Nayyar
Yi Ouyang
In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (voir plus) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems that allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models where the cost and the dynamics are smooth in an appropriate sense, we derive upper bounds on the sub-optimality of certainty equivalent policies. We present several examples to illustrate the results.
Generalized certainty equivalence based policies in partially observable systems
Ashutosh Nayyar
Yi Ouyang
In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (voir plus) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems and allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models with Lipschitz cost and dynamics, we derive upper bounds on the sub-optimality of certainty equivalent policies in terms of expected error of the proposed estimator. We present several examples to illustrate the results.
A Theoretical Justification for Asymmetric Actor-Critic Algorithms
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (voir plus)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.
A Theoretical Justification for Asymmetric Actor-Critic Algorithms
Convergence of regularized agent-state based Q-learning in POMDPs
Matthieu Geist
In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practic… (voir plus)e. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.
Model approximation in MDPs with unbounded per-step cost
Ashutosh Nayyar
Yi Ouyang
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …
Low-Dimensional solutions for optimal control of network-coupled subsystems over a directed network
In this paper, we investigate optimal control of network-coupled subsystems, where the coupling between the dynamics of the subsystems is re… (voir plus)presented by the adjacency or Laplacian matrix of a directed graph. Under the assumption that the coupling matrix is normal and the cost coupling is compatible with the dynamics coupling, we use the spectral decomposition of the coupling matrix to decompose the overall system into at most n systems with noise coupled dynamics and decoupled cost, where n is the size of the network. Furthermore, the optimal control input at each subsystem can be computed by solving n1 decoupled Riccati equations where n1 (n1 ≤ n) denotes the number of distinct eigenvalues of the coupling matrix, where complex conjugate pairs are not double-counted. A salient feature of the result is that the solution complexity depends on the number of distinct eigenvalues of the coupling matrix rather than the size of the network. Therefore, the proposed solution framework provides a scalable method for synthesizing and implementing optimal control laws for large-scale network-coupled subsystems.
Constant step-size stochastic approximation with delayed updates
Silviu-Iulian Niculescu
Mathukumalli Vidyasagar
In this paper, we consider constant step-size stochastic approximation with delayed updates. For the non-delayed case, it is well known that… (voir plus) under appropriate conditions, the discrete-time iterates of stochastic approximation track the trajectory of a continuous-time ordinary differential equation (ODE). For the delayed case, we show in this paper that, under appropriate conditions, the discrete-time iterates track the trajectory of a delay-differential equation (DDE) rather than an ODE. Thus, delayed updates lead to a qualitative change in the behavior of constant step-size stochastic approximation. We present multiple examples to illustrate the qualitative affect of delay and show that increasing the delay is generally destabilizing but, for some systems, it can be stabilizing as well.
A vector almost-supermartingale convergence theorem and its applications
Silviu-Iulian Niculescu
Mathukumalli Vidyasagar
The almost-supermartingale convergence theorem of Robbins and Siegmund (1971) is a fundamental tool for establishing the convergence of vari… (voir plus)ous stochastic iterative algorithms including system identification, adaptive control, and reinforcement learning. The theorem is stated for non-negative scalar valued stochastic processes. In this paper, we generalize the theorem to non-negative vector valued stochastic processes and provide two set of sufficient conditions for such processes to converge almost surely. We present several applications of vector almost-supermartingale convergence theorem, including convergence of autoregressive supermartingales, delayed supermartingales, and stochastic approximation with delayed updates.
Periodic agent-state based Q-learning for POMDPs
Matthieu Geist
Periodic agent-state based Q-learning for POMDPs
Matthieu Geist
The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. H… (voir plus)owever, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy. Our main thesis that we illustrate via examples is that because the agent state does not satisfy the Markov property, non-stationary agent-state based policies can outperform stationary ones. To leverage this feature, we propose PASQL (periodic agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies. By combining ideas from periodic Markov chains and stochastic approximation, we rigorously establish that PASQL converges to a cyclic limit and characterize the approximation error of the converged periodic policy. Finally, we present a numerical experiment to highlight the salient features of PASQL and demonstrate the benefit of learning periodic policies over stationary policies.
On learning history-based policies for controlling Markov decision processes