Portrait de Aditya Mahajan

Aditya Mahajan

Membre académique associé
Professeur agrégé, McGill University, Département de génie électrique et informatique
Sujets de recherche
Apprentissage par renforcement

Biographie

Aditya Mahajan est professeur de génie électrique et informatique à l'Université McGill. Il est membre du Centre sur les machines intelligentes (CIM) de McGill, de Mila – Institut québécois d’intelligence artificielle, du Laboratoire international des systèmes d'apprentissage (ILLS) et du Groupe d'études et de recherche en analyse des décisions (GERAD). Il est titulaire d'une licence en génie électrique de l'Indian Institute of Technology de Kanpur (Inde), ainsi que d'une maîtrise et d'un doctorat en génie électrique et en informatique de l'Université du Michigan à Ann Arbor (États-Unis).

Aditya Mahajan est membre senior de l'Institute of Electrical and Electronics Engineers (IEEE) et membre de Professional Engineers Ontario. Il est actuellement rédacteur en chef adjoint des IEEE Transactions on Automatic Control, des IEEE Control Systems Letters et de Mathematics of Control, Signals, and Systems (Springer). Il a été rédacteur associé au comité de rédaction de la conférence de l'IEEE Control Systems Society de 2014 à 2017.

Il a reçu le prix George Axelby 2015 récompensant un article exceptionnel, un supplément d’accélération à la découverte du Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG) en 2016, le prix CDC du meilleur article étudiant 2014 (en tant que superviseur) et le prix NecSys du meilleur article étudiant 2016 (en tant que superviseur). Ses principaux domaines de recherche sont le contrôle stochastique et l'apprentissage par renforcement.

Étudiants actuels

Maîtrise recherche - McGill
Maîtrise recherche - McGill
Collaborateur·rice alumni - McGill
Maîtrise recherche - McGill
Postdoctorat - McGill
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Doctorat - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Doctorat - McGill

Publications

Sub-optimality bounds for certainty equivalent policies in partially observed systems
Ashutosh Nayyar
Yi Ouyang
In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (voir plus) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems that allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models where the cost and the dynamics are smooth in an appropriate sense, we derive upper bounds on the sub-optimality of certainty equivalent policies. We present several examples to illustrate the results.
Generalized certainty equivalence based policies in partially observable systems
Ashutosh Nayyar
Yi Ouyang
In this paper, we present a generalization of the certainty equivalence principle of stochastic control. One interpretation of the classical… (voir plus) certainty equivalence principle for linear systems with output feedback and quadratic costs is as follows: the optimal action at each time is obtained by evaluating the optimal state-feedback policy of the stochastic linear system at the minimum mean square error (MMSE) estimate of the state. Motivated by this interpretation, we consider certainty equivalent policies for general (non-linear) partially observed stochastic systems and allow for any state estimate rather than restricting to MMSE estimates. In such settings, the certainty equivalent policy is not optimal. For models with Lipschitz cost and dynamics, we derive upper bounds on the sub-optimality of certainty equivalent policies in terms of expected error of the proposed estimator. We present several examples to illustrate the results.
A Theoretical Justification for Asymmetric Actor-Critic Algorithms
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learni… (voir plus)ng paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.
Convergence of regularized agent-state based Q-learning in POMDPs
Matthieu Geist
In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practic… (voir plus)e. Two salient features of such algorithms are: (i) the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii) policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.
Model approximation in MDPs with unbounded per-step cost
Ashutosh Nayyar
Yi Ouyang
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process …
Low-Dimensional solutions for optimal control of network-coupled subsystems over a directed network
In this paper, we investigate optimal control of network-coupled subsystems, where the coupling between the dynamics of the subsystems is re… (voir plus)presented by the adjacency or Laplacian matrix of a directed graph. Under the assumption that the coupling matrix is normal and the cost coupling is compatible with the dynamics coupling, we use the spectral decomposition of the coupling matrix to decompose the overall system into at most n systems with noise coupled dynamics and decoupled cost, where n is the size of the network. Furthermore, the optimal control input at each subsystem can be computed by solving n1 decoupled Riccati equations where n1 (n1 ≤ n) denotes the number of distinct eigenvalues of the coupling matrix, where complex conjugate pairs are not double-counted. A salient feature of the result is that the solution complexity depends on the number of distinct eigenvalues of the coupling matrix rather than the size of the network. Therefore, the proposed solution framework provides a scalable method for synthesizing and implementing optimal control laws for large-scale network-coupled subsystems.
Agent-state based policies in POMDPs: Beyond belief-state MDPs
The traditional approach to POMDPs is to convert them into fully observed MDPs by considering a belief state as an information state. Howeve… (voir plus)r, a belief-state based approach requires perfect knowledge of the system dynamics and is therefore not applicable in the learning setting where the system model is unknown. Various approaches to circumvent this limitation have been proposed in the literature. We present a unified treatment of some of these approaches by viewing them as models where the agent maintains a local recursively updateable “agent state” and chooses actions based on the agent state. We highlight the different classes of agent-state based policies and the various approaches that have been proposed in the literature to find good policies within each class. These include the designer’s approach to find optimal non-stationary agent-state based policies, policy search approaches to find a locally optimal stationary agent-state based policies, and the approximate information state to find approximately optimal stationary agent-state based policies. We then present how ideas from the approximate information state approach have been used to improve Q-learning and actor-critic algorithms for learning in POMDPs.
Constant step-size stochastic approximation with delayed updates
Silviu-Iulian Niculescu
Mathukumalli Vidyasagar
In this paper, we consider constant step-size stochastic approximation with delayed updates. For the non-delayed case, it is well known that… (voir plus) under appropriate conditions, the discrete-time iterates of stochastic approximation track the trajectory of a continuous-time ordinary differential equation (ODE). For the delayed case, we show in this paper that, under appropriate conditions, the discrete-time iterates track the trajectory of a delay-differential equation (DDE) rather than an ODE. Thus, delayed updates lead to a qualitative change in the behavior of constant step-size stochastic approximation. We present multiple examples to illustrate the qualitative affect of delay and show that increasing the delay is generally destabilizing but, for some systems, it can be stabilizing as well.
A vector almost-supermartingale convergence theorem and its applications
Silviu-Iulian Niculescu
Mathukumalli Vidyasagar
The almost-supermartingale convergence theorem of Robbins and Siegmund (1971) is a fundamental tool for establishing the convergence of vari… (voir plus)ous stochastic iterative algorithms including system identification, adaptive control, and reinforcement learning. The theorem is stated for non-negative scalar valued stochastic processes. In this paper, we generalize the theorem to non-negative vector valued stochastic processes and provide two set of sufficient conditions for such processes to converge almost surely. We present several applications of vector almost-supermartingale convergence theorem, including convergence of autoregressive supermartingales, delayed supermartingales, and stochastic approximation with delayed updates.
Periodic agent-state based Q-learning for POMDPs
Matthieu Geist
On learning history-based policies for controlling Markov decision processes
Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state a… (voir plus)bstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.
Bridging State and History Representations: Understanding Self-Predictive RL
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially obse… (voir plus)rvable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted objectives and optimization, such as the stop-gradient technique, in learning self-predictive representations. These findings together yield a minimalist algorithm to learn self-predictive representations for states and histories. We validate our theories by applying our algorithm to standard MDPs, MDPs with distractors, and POMDPs with sparse rewards. These findings culminate in a set of preliminary guidelines for RL practitioners.