Portrait de Ankit Anand

Ankit Anand

Membre industriel associé
Professeur associé
Chercheur scientifique principal, Google DeepMind

Biographie

Ankit est actuellement chercheur principal chez Google DeepMind Montréal et membre industriel associé à Mila - Institut québécois d'intelligence artificielle. Ses intérêts de recherche se situent à l'intersection du raisonnement et de l'apprentissage par renforcement. Il s'intéresse également à l'application de méthodes d'IA contemporaines pour réaliser des avancées en mathématiques et en informatique théorique (démonstration automatisée de théorèmes et génération de contre-exemples). Récemment, il a également travaillé sur les modèles LearnLM pour développer des modèles génératifs d'IA spécialement entraînés pour la pédagogie et les applications d'enseignement et sur la façon dont l'IA pourrait avoir un impact pour une éducation équitable.

Auparavant, il a terminé son doctorat à l'IIT Delhi en travaillant avec les professeurs Mausam et Parag Singla. Au cours de son doctorat, il a travaillé sur la symétrie des algorithmes d'IA dans le contexte des modèles graphiques probabilistes et des algorithmes de recherche arborescente de Monte Carlo.

Publications

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning
Lynn Cherif
Flemming Kondrup
David Venuto
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through intent-based affordances -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances**
Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning
Lynn Cherif
Flemming Kondrup
David Venuto
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through *intent-based affordances* -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances (
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto
Mohammad Sami Nur Islam
Martin Klissarov
Sherry Yang
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto
Mohammad Sami Nur Islam
Martin Klissarov
Sherry Yang
Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and p… (voir plus)rovide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
Abbas Mehrabian
Hyunjik Kim
Nicolas Sonnerat
Matej Balog
Gheorghe Comanici
Tudor Berariu
Andrew Lee
Anian Ruoss
Anna Bulanova
Daniel Toyama
Sam Blackwell
Bernardino Romera Paredes
Petar Veličković
Laurent Orseau
Joonkyung Lee
Anurag Murty Naredla
Adam Zsolt Wagner
Policy composition in reinforcement learning via multi-objective policy optimization
Shruti Mishra
Jordan Hoffmann
Nicolas Heess
Martin A. Riedmiller
Abbas Abdolmaleki
We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teach… (voir plus)er policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.
Accelerating exploration and representation learning with offline pre-training
Jake Bruce
Rob Fergus
Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement lea… (voir plus)rning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights.
Proving theorems using Incremental Learning and Hindsight Experience Replay
Maxwell Crouse
Eser Aygün
Laurent Orseau
Bassem Makni
Vernon Ralph Austel
Xavier Glorot
Cristina Cornelio
Shajith Ikbal
Stephen M Mcaleer
Vlad Firoiu
Pavan Kapanipathi
Lei M Zhang
Ndivhuwo Makondo
Shibl Mourad
The highest performing ATP systems (e.g., [7, 18]) in first order logic have been evolving for decades and have grown to use an increasing n… (voir plus)umber of manually designed heuristics mixed with some machine learning, to obtain a large number of search strategies that are tried sequentially or in parallel. Some recent works [5, 13, 19] build on top of these provers, using modern machine learning techniques to augment, select or prioritize their already existing heuristics, with some success. Other recent works do not build on top of other provers, but still require existing proof examples as input (e.g., [9, 23]). Such machine-learning-based ATP systems can struggle to solve difficult problems when the training dataset does not provide problems of sufficiently diverse difficulties. In this paper, we propose an approach which can build a strong theorem prover without relying on existing domain-specific heuristics or on prior input data (in the form of proofs) to prime the learning. We strive to design a learning methodology for ATP that allows a system to improve even when there are large gaps in the difficulty of given set of theorems. In particular, given a set of conjectures without proofs, our system trains itself, based on its own attempts and (dis)proves an increasing number of conjectures, an approach which can be viewed as a form of incremental learning. Additionally, all the previous approaches [19, 1, 13] learn exclusively on successful proof attempts. When no new theorem can be proven, the learner may not be able to improve anymore and thus the system may not be able to obtain more training data. This could in principle happen even at the very start of training, if all the theorems available are too hard. To tackle this challenge, we adapt the idea of hindsight experience replay (HER) [3] to ATP: Clauses reached during proof attempts (whether successful or not) are turned into goals in hindsight, producing a large amount of ‘auxiliary’ theorems with proofs of varied difficulties for the learner, even in principle when no theorem from the original set can be proven initially. This leads to a smoother learning regime and a constantly improving learner. We evaluate our approach on two popular benchmarks: MPTP2078 [2] and M2k [17] and compare it both with TRAIL [1], a recent machine learning prover as well as with E prover [24, 7], one of the leading heuristic provers. Our proposed approach substantially outperforms TRAIL [1] on both datasets, surpasses E in the auto configuration with a 100s time limit, and is competitive with E in the autoschedule configuration with a 7 days time limit. In addition, our approach almost always (99.5% of cases) finds shorter proofs than E.
Training a First-Order Theorem Prover from Synthetic Data
Vlad Firoiu
Eser Aygün
Zafarali Ahmed
Xavier Glorot
Laurent Orseau
Lei Zhang
Shibl Mourad
Learning to Prove from Synthetic Theorems
Eser Aygün
Zafarali Ahmed
Vlad Firoiu
Xavier Glorot
Laurent Orseau
Shibl Mourad
A major challenge in applying machine learning to automated theorem proving is the scarcity of training data, which is a key ingredient in t… (voir plus)raining successful deep learning models. To tackle this problem, we propose an approach that relies on training with synthetic theorems, generated from a set of axioms. We show that such theorems can be used to train an automated prover and that the learned prover transfers successfully to human-generated theorems. We demonstrate that a prover trained exclusively on synthetic theorems can solve a substantial fraction of problems in TPTP, a benchmark dataset that is used to compare state-of-the-art heuristic provers. Our approach outperforms a model trained on human-generated problems in most axiom sets, thereby showing the promise of using synthetic data for this task.