Portrait de Aditya Mahajan

Aditya Mahajan

Membre académique associé
Professeur agrégé, McGill University, Département de génie électrique et informatique
Sujets de recherche
Apprentissage par renforcement

Biographie

Aditya Mahajan est professeur de génie électrique et informatique à l'Université McGill. Il est membre du Centre sur les machines intelligentes (CIM) de McGill, de Mila – Institut québécois d’intelligence artificielle, du Laboratoire international des systèmes d'apprentissage (ILLS) et du Groupe d'études et de recherche en analyse des décisions (GERAD). Il est titulaire d'une licence en génie électrique de l'Indian Institute of Technology de Kanpur (Inde), ainsi que d'une maîtrise et d'un doctorat en génie électrique et en informatique de l'Université du Michigan à Ann Arbor (États-Unis).

Aditya Mahajan est membre senior de l'Institute of Electrical and Electronics Engineers (IEEE) et membre de Professional Engineers Ontario. Il est actuellement rédacteur en chef adjoint des IEEE Transactions on Automatic Control, des IEEE Control Systems Letters et de Mathematics of Control, Signals, and Systems (Springer). Il a été rédacteur associé au comité de rédaction de la conférence de l'IEEE Control Systems Society de 2014 à 2017.

Il a reçu le prix George Axelby 2015 récompensant un article exceptionnel, un supplément d’accélération à la découverte du Conseil de recherches en sciences naturelles et en génie du Canada (CRSNG) en 2016, le prix CDC du meilleur article étudiant 2014 (en tant que superviseur) et le prix NecSys du meilleur article étudiant 2016 (en tant que superviseur). Ses principaux domaines de recherche sont le contrôle stochastique et l'apprentissage par renforcement.

Étudiants actuels

Maîtrise recherche - McGill
Collaborateur·rice alumni - McGill
Collaborateur·rice alumni - McGill
Doctorat - McGill
Maîtrise recherche - McGill
Doctorat - McGill
Doctorat - McGill
Doctorat - McGill
Doctorat - McGill

Publications

A modified Thompson sampling-based learning algorithm for unknown linear systems
Mukul Gagrani
Sagar Sudhakara
Rahul Jain
Ashutosh Nayyar
Yi Ouyang
We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This… (voir plus) algorithm operates in episodes of dynamic length and it is shown to have a regret bound of
Partially observable restless bandits with restarts: indexability and computation of Whittle index
Nima Akbarzadeh
We consider restless bandits with restarts, where the state of the active arms resets according to a known probability distribution while th… (voir plus)e state of the passive arms evolves in a Markovian manner. We assume that the state of the arm is observed after it is reset but not observed otherwise. We show that the model is indexable and propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. A detailed numerical study of machine repair models shows that Whittle index policy outperforms myopic policy and is close to optimal policy.
Thompson-Sampling Based Reinforcement Learning for Networked Control of Unknown Linear Systems
Borna Sayedana
Mohammad Afshari
Peter E. Caines
In recent years, there has been considerable interest in reinforcement learning for linear quadratic Gaussian (LQG) systems. In this paper, … (voir plus)we consider a generalization of such systems where the controller and the plant are connected over an unreliable packet drop channel. Packet drops cause the system dynamics to switch between controlled and uncontrolled modes. This switching phenomena introduces new challenges in designing learning algorithms. We identify a sufficient condition under which the regret of Thompson sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) at horizon T is bounded by
Structure-Aware Reinforcement Learning for Node-Overload Protection in Mobile Edge Computing
Anirudha Jitani
Zhongwen Zhu
Hatem Abou-Zeid
Emmanuel Thepie Fapi
Hakimeh Purmehdi
Mobile Edge Computing (MEC) involves placing computational capability and applications at the edge of the network, providing benefits such a… (voir plus)s reduced latency, reduced network congestion, and improved performance of applications. The performance and reliability of MEC degrades significantly when the edge server(s) in the cluster are overloaded. In this work, an adaptive admission control policy to prevent edge node from getting overloaded is presented. This approach is based on a recently-proposed low complexity RL (Reinforcement Learning) algorithm called SALMUT (Structure-Aware Learning for Multiple Thresholds), which exploits the structure of the optimal admission control policy in multi-class queues for an average-cost setting. We extend the framework to work for node overload-protection problem in a discounted-cost setting. The proposed solution is validated using several scenarios mimicking real-world deployments in two different settings — computer simulations and a docker testbed. Our empirical evaluations show that the total discounted cost incurred by SALMUT is similar to state-of-the-art deep RL algorithms such as PPO (Proximal Policy Optimization) and A2C (Advantage Actor Critic) but requires an order of magnitude less time to train, outputs easily interpretable policy, and can be deployed in an online manner.
Conditions for indexability of restless bandits and an
$\mathcal{O}\!\left(K^3\right)$
algorithm to compute Whittle index
Nima Akbarzadeh
Abstract Restless bandits are a class of sequential resource allocation problems concerned with allocating one or more resources among sever… (voir plus)al alternative processes where the evolution of the processes depends on the resources allocated to them. Such models capture the fundamental trade-offs between exploration and exploitation. In 1988, Whittle developed an index heuristic for restless bandit problems which has emerged as a popular solution approach because of its simplicity and strong empirical performance. The Whittle index heuristic is applicable if the model satisfies a technical condition known as indexability. In this paper, we present two general sufficient conditions for indexability and identify simpler-to-verify refinements of these conditions. We then revisit a previously proposed algorithm called the adaptive greedy algorithm which is known to compute the Whittle index for a sub-class of restless bandits. We show that a generalization of the adaptive greedy algorithm computes the Whittle index for all indexable restless bandits. We present an efficient implementation of this algorithm which can compute the Whittle index of a restless bandit with K states in
Optimal Control of Network-Coupled Subsystems: Spectral Decomposition and Low-Dimensional Solutions
Shuang Gao
In this article, we investigate the optimal control of network-coupled subsystems with coupled dynamics and costs. The dynamics coupling may… (voir plus) be represented by the adjacency matrix, the Laplacian matrix, or any other symmetric matrix corresponding to an underlying weighted undirected graph. Cost couplings are represented by two coupling matrices which have the same eigenvectors as the coupling matrix in the dynamics. We use the spectral decomposition of these three coupling matrices to decompose the overall system into
Staged independent learning: Towards decentralized cooperative multi-agent Reinforcement Learning
Hadi Nekoei
Akilesh Badrinaaraayanan
Amit Sinha
Mohammad Amini
Janarthanan Rajendran
We empirically show that classic ideas from two-time scale stochastic approximation \citep{borkar1997stochastic} can be combined with sequen… (voir plus)tial iterative best response (SIBR) to solve complex cooperative multi-agent reinforcement learning (MARL) problems. We first start with giving a multi-agent estimation problem as a motivating example where SIBR converges while parallel iterative best response (PIBR) does not. Then we present a general implementation of staged multi-agent RL algorithms based on SIBR and multi-time scale stochastic approximation, and show that our new methods which we call Staged Independent Proximal Policy Optimization (SIPPO) and Staged Independent Q-learning (SIQL) outperform state-of-the-art independent learning on almost all the tasks in the epymarl \citep{papoudakis2020benchmarking} benchmark. This can be seen as a first step towards more decentralized MARL methods based on SIBR and multi-time scale learning.
On learning Whittle index policy for restless bandits with scalable regret
Nima Akbarzadeh
Reinforcement learning is an attractive approach to learn good resource allocation and scheduling policies based on data when the system mod… (voir plus)el is unknown. However, the cumulative regret of most RL algorithms scales as
Approximate information state for approximate planning and reinforcement learning in partially observed systems
Jayakumar Subramanian
Amit Sinha
Raihan Seraj
We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundam… (voir plus)ental notion of information state. We provide two equivalent definitions of information state---i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.
Robustness of Whittle Index Policy to Model Approximation
Amit Sinha
Scalable Operator Allocation for Multirobot Assistance: A Restless Bandit Approach
Abhinav Dahiya
Nima Akbarzadeh
Stephen L. Smith
In this article, we consider the problem of allocating human operators in a system with multiple semiautonomous robots. Each robot is requir… (voir plus)ed to perform an independent sequence of tasks, subject to a chance of failing and getting stuck in a fault state at every task. If and when required, a human operator can assist or teleoperate a robot. Conventional dynamic programming-based techniques used to solve such problems face scalability issues due to an exponential growth of state and action spaces with the number of robots and operators. In this article, we derive conditions under which the operator allocation problem satisfies a technical condition called indexability, thereby enabling the use of the Whittle index heuristic. The conditions are easy to check, and we show that they hold for a wide range of problems of interest. Our key insight is to leverage the structure of the value function of individual robots, resulting in conditions that can be verified separately for each state of each robot. We apply these conditions to two types of transitions commonly seen in remote robot supervision systems. Through numerical simulations, we demonstrate the efficacy of Whittle index policy as a near-optimal and scalable approach that outperforms existing scalable methods.
Robustness of Markov perfect equilibrium to model approximations in general-sum dynamic games
Jayakumar Subramanian
Amit Sinha
Dynamic games (also called stochastic games or Markov games) are an important class of games for modeling multi-agent interactions. In many … (voir plus)situations, the dynamics and reward functions of the game are learnt from past data and are therefore approximate. In this paper, we study the robustness of Markov perfect equilibrium to approximations in reward and transition functions. Using approximation results from Markov decision processes, we show that the Markov perfect equilibrium of an approximate (or perturbed) game is always an approximate Markov perfect equilibrium of the original game. We provide explicit bounds on the approximation error in terms of three quantities: (i) the error in approximating the reward functions, (ii) the error in approximating the transition function, and (iii) a property of the value function of the MPE of the approximate game. The second and third quantities depend on the choice of metric on probability spaces. We also present coarser upper bounds which do not depend on the value function but only depend on the properties of the reward and transition functions of the approximate game. We illustrate the results via a numerical example.