Portrait de Ankit Anand

Ankit Anand

Membre industriel associé
Professeur associé
Chercheur scientifique principal, Google DeepMind

Biographie

Ankit est actuellement chercheur principal chez Google DeepMind Montréal et membre industriel associé à Mila - Institut québécois d'intelligence artificielle. Ses intérêts de recherche se situent à l'intersection du raisonnement et de l'apprentissage par renforcement. Il s'intéresse également à l'application de méthodes d'IA contemporaines pour réaliser des avancées en mathématiques et en informatique théorique (démonstration automatisée de théorèmes et génération de contre-exemples). Récemment, il a également travaillé sur les modèles LearnLM pour développer des modèles génératifs d'IA spécialement entraînés pour la pédagogie et les applications d'enseignement et sur la façon dont l'IA pourrait avoir un impact pour une éducation équitable.

Auparavant, il a terminé son doctorat à l'IIT Delhi en travaillant avec les professeurs Mausam et Parag Singla. Au cours de son doctorat, il a travaillé sur la symétrie des algorithmes d'IA dans le contexte des modèles graphiques probabilistes et des algorithmes de recherche arborescente de Monte Carlo.

Publications

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning
Lynn Cherif
Flemming Kondrup
David Venuto
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through intent-based affordances -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances**
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto
Mohammad Sami Nur Islam
Martin Klissarov
Sherry Yang
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto
Mohammad Sami Nur Islam
Martin Klissarov
Sherry Yang
Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and p… (voir plus)rovide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
Abbas Mehrabian
Hyunjik Kim
Nicolas Sonnerat
Matej Balog
Gheorghe Comanici
Tudor Berariu
Andrew Lee
Anian Ruoss
Anna Bulanova
Daniel Toyama
Sam Blackwell
Bernardino Romera Paredes
Petar Veličković
Laurent Orseau
Joonkyung Lee
Anurag Murty Naredla
Adam Zsolt Wagner
Policy composition in reinforcement learning via multi-objective policy optimization
Shruti Mishra
Jordan Hoffmann
Nicolas Heess
Martin A. Riedmiller
Abbas Abdolmaleki
We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teach… (voir plus)er policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.
Accelerating exploration and representation learning with offline pre-training
Jake Bruce
Rob Fergus
Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement lea… (voir plus)rning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights.