Portrait of Khimya Khetarpal

Khimya Khetarpal

Affiliate Member
Research Scientist, Google DeepMind
Research Topics
Machine Learning Theory
Online Learning
Reinforcement Learning
Representation Learning


Khimya Khetarpal is a Research Scientist at Google Deepmind. She earned her PhD in Computer Science from the Reasoning and Learning Lab at McGill University and Mila, advised by Doina Precup. She is broadly interested in artificial intelligence and reinforcement learning. Her current research interests focus on how RL agents learn to efficiently represent the world's knowledge, plan with it, and adapt to changes over time. Khimya’s work has appeared in leading AI journals and conferences including NeurIPS, ICML, AAAI, AISTATS, ICLR, The Knowledge Engineering Review, ACM, JAIR and TMLR. Her work has also been featured in MIT Technology Review. She was recognized as a TMLR expert reviewer in 2023, one of the Rising Stars in EECS 2020, a finalist for Three Minute Thesis (3MT) competition in AAAI 2019, selected for the Doctoral Consortium at AAAI 2019, and awarded Best Paper Award (3rd Price) for an ICML 2018 workshop on lifelong learning. Throughout her career, she has sought to actively mentor through initiatives such as co-founding the Mila peer advising initiative, teaching and assisting AI4Good Lab, volunteering at Skype A Scientist, and mentoring at FIRST Robotics.

Her research aims to (1) understand intelligent behavior that bridges both action and perception grounded in theoretical foundations of reinforcement learning, and (2) build AI agents to efficiently represent the world's knowledge, plan with it, and adapt to changes over time through learning and interaction.

She currently approaches this with the following research directions:

- Selective Attention for Fast Adaptation and Robustness

- Learning Abstractions and Affordances

- Discovery and Continual Reinforcement Learning

Current Students

Master's Research - McGill University
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :


Balancing Context Length and Mixing Times for Reinforcement Learning at Scale
Matthew D Riemer
Janarthanan Rajendran
Normalization and effective learning rates in reinforcement learning
Clare Lyle
Zeyu Zheng
James Martens
Hado van Hasselt
Razvan Pascanu
Will Dabney
Normalization and effective learning rates in reinforcement learning
Clare Lyle
Zeyu Zheng
James Martens
Hado van Hasselt
Razvan Pascanu
Will Dabney
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature, with seve… (see more)ral works highlighting diverse benefits such as improving loss landscape conditioning and combatting overestimation bias. However, normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate. This becomes problematic in continual learning settings, where the resulting effective learning rate schedule may decay to near zero too quickly relative to the timescale of the learning problem. We propose to make the learning rate schedule explicit with a simple re-parameterization which we call Normalize-and-Project (NaP), which couples the insertion of normalization layers with weight projection, ensuring that the effective learning rate remains constant throughout training. This technique reveals itself as a powerful analytical tool to better understand learning rate schedules in deep reinforcement learning, and as a means of improving robustness to nonstationarity in synthetic plasticity loss benchmarks along with both the single-task and sequential variants of the Arcade Learning Environment. We also show that our approach can be easily applied to popular architectures such as ResNets and transformers while recovering and in some cases even slightly improving the performance of the base model in common stationary benchmarks.
A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning
Zhaohan Daniel Guo
Bernardo Avila Pires
Yunhao Tang
Clare Lyle
Mark Rowland
Nicolas Heess
Diana Borsa
Arthur Guez
Will Dabney
Disentangling the Causes of Plasticity Loss in Neural Networks
Clare Lyle
Zeyu Zheng
Hado van Hasselt
Razvan Pascanu
James Martens
Will Dabney
Toward Human-AI Alignment in Large-Scale Multi-Player Games
Sugandha Sharma
Guy Davidson
Anssi Kanervisto
Udit Arora
Katja Hofmann
Ida Momennejad
Achieving human-AI alignment in complex multi-agent games is crucial for creating trustworthy AI agents that enhance gameplay. We propose a … (see more)method to evaluate this alignment using an interpretable task-sets framework, focusing on high-level behavioral tasks instead of low-level policies. Our approach has three components. First, we analyze extensive human gameplay data from Xbox's Bleeding Edge (100K+ games), uncovering behavioral patterns in a complex task space. This task space serves as a basis set for a behavior manifold capturing interpretable axes: fight-flight, explore-exploit, and solo-multi-agent. Second, we train an AI agent to play Bleeding Edge using a Generative Pretrained Causal Transformer and measure its behavior. Third, we project human and AI gameplay to the proposed behavior manifold to compare and contrast. This allows us to interpret differences in policy as higher-level behavioral concepts, e.g., we find that while human players exhibit variability in fight-flight and explore-exploit behavior, AI players tend towards uniformity. Furthermore, AI agents predominantly engage in solo play, while humans often engage in cooperative and competitive multi-agent patterns. These stark differences underscore the need for interpretable evaluation, design, and integration of AI in human-aligned applications. Our study advances the alignment discussion in AI and especially generative AI research, offering a measurable framework for interpretable human-agent alignment in multiplayer gaming.
Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels
Thomas Jiralerspong
Flemming Kondrup
The ability to plan at many different levels of abstraction enables agents to envision the long-term repercussions of their decisions and th… (see more)us enables sample-efficient learning. This becomes particularly beneficial in complex environments from high-dimensional state space such as pixels, where the goal is distant and the reward sparse. We introduce Forecaster, a deep hierarchical reinforcement learning approach which plans over high-level goals leveraging a temporally abstract world model. Forecaster learns an abstract model of its environment by modelling the transitions dynamics at an abstract level and training a world model on such transition. It then uses this world model to choose optimal high-level goals through a tree-search planning procedure. It additionally trains a low-level policy that learns to reach those goals. Our method not only captures building world models with longer horizons, but also, planning with such models in downstream tasks. We empirically demonstrate Forecaster's potential in both single-task learning and generalization to new tasks in the AntMaze domain.
POMRL: No-Regret Learning-to-Plan with Increasing Horizons
Claire Vernade
Brendan O'Donoghue
Satinder Singh
Tom Zahavy
We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented w… (see more)ith a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying structure across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.
Discovering Object-Centric Generalized Value Functions From Pixels
Somjit Nath
Gopeshh Raaj Subbaraj
Deep Reinforcement Learning has shown significant progress in extracting useful representations from high-dimensional inputs albeit using ha… (see more)nd-crafted auxiliary tasks and pseudo rewards. Automatically learning such representations in an object-centric manner geared towards control and fast adaptation remains an open research problem. In this paper, we introduce a method that tries to discover meaningful features from objects, translating them to temporally coherent"question"functions and leveraging the subsequent learned general value functions for control. We compare our approach with state-of-the-art techniques alongside other ablations and show competitive performance in both stationary and non-stationary settings. Finally, we also investigate the discovered general value functions and through qualitative analysis show that the learned representations are not only interpretable but also, centered around objects that are invariant to changes across tasks facilitating fast adaptation.
Towards Continual Reinforcement Learning: A Review and Perspectives
The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning
Andrei Cristian Nica
Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to … (see more)having many choices. Hierarchical reinforcement learning methods aim to solve the first problem, by providing shortcuts that skip over multiple time steps. To cope with the breadth, it is desirable to restrict the agent's attention at each step to a reasonable number of possible choices. The concept of affordances (Gibson, 1977) suggests that only certain actions are feasible in certain states. In this work, we first characterize "affordances" as a "hard" attention mechanism that strictly limits the available choices of temporally extended options. We then investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices. To this end, we present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options. Finally, we identify and empirically demonstrate the settings in which the "paradox of choice" arises, i.e. when having fewer but more meaningful choices improves the learning speed and performance of a reinforcement learning agent.
Sequoia: A Software Framework to Unify Continual Learning Research
Fabrice Normandin
Oleksiy Ostapenko
Pau Rodriguez
Florian Golemo
Ryan Lindeborg
J. Hurtado
Matthew D Riemer
Lucas Cecchi
Timothee LESORT
Dominic Zhao
David Vazquez
Massimo Caccia
The field of Continual Learning (CL) seeks to develop algorithms that accumulate knowledge and skills over time through interaction with non… (see more)-stationary environments. In practice, a plethora of evaluation procedures (settings) and algorithmic solutions (methods) exist, each with their own potentially disjoint set of assumptions. This variety makes measuring progress in CL difficult. We propose a taxonomy of settings, where each setting is described as a set of assumptions. A tree-shaped hierarchy emerges from this view, where more general settings become the parents of those with more restrictive assumptions. This makes it possible to use inheritance to share and reuse research, as developing a method for a given setting also makes it directly applicable onto any of its children. We instantiate this idea as a publicly available software framework called Sequoia, which features a wide variety of settings from both the Continual Supervised Learning (CSL) and Continual Reinforcement Learning (CRL) domains. Sequoia also includes a growing suite of methods which are easy to extend and customize, in addition to more specialized methods from external libraries. We hope that this new paradigm and its first implementation can help unify and accelerate research in CL. You can help us grow the tree by visiting (this GitHub URL).