Portrait de Ankit Anand

Ankit Anand

Membre industriel associé
Professeur associé
Chercheur scientifique principal, Google DeepMind

Biographie

Ankit est actuellement chercheur principal chez Google DeepMind Montréal et membre industriel associé à Mila - Institut québécois d'intelligence artificielle. Ses intérêts de recherche se situent à l'intersection du raisonnement et de l'apprentissage par renforcement. Il s'intéresse également à l'application de méthodes d'IA contemporaines pour réaliser des avancées en mathématiques et en informatique théorique (démonstration automatisée de théorèmes et génération de contre-exemples). Récemment, il a également travaillé sur les modèles LearnLM pour développer des modèles génératifs d'IA spécialement entraînés pour la pédagogie et les applications d'enseignement et sur la façon dont l'IA pourrait avoir un impact pour une éducation équitable.

Auparavant, il a terminé son doctorat à l'IIT Delhi en travaillant avec les professeurs Mausam et Parag Singla. Au cours de son doctorat, il a travaillé sur la symétrie des algorithmes d'IA dans le contexte des modèles graphiques probabilistes et des algorithmes de recherche arborescente de Monte Carlo.

Publications

Perturbative study of Supercritical Crossover in Noncommutative-corrected Spacetime
Shoucheng Wang
We analytically study the Widom line and supercritical crossover of noncommutative charged AdS black holes. Treating the noncommutative para… (voir plus)meter
Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
Anthony GX-Chen
Gheorghe Comanici
Zaheer Abbas
Eser Aygün
David Smalling
Shibl Mourad
Andre Barreto
Mark Rowland
Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern… (voir plus) applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.
EXPRESS: Climate Communications in IPOs: Unpacking the Influence of Climate Disclosure Volume, Sender, and Message Characteristics
Alok R. Saboo
Ritesh Adhyapak
Climate disclosures have emerged as a prominent communication tool for firms facing growing pressure to address climate challenges, yet thei… (voir plus)r impact on firm performance remains unclear. This study proposes a nonlinear (U-shaped) relationship between climate disclosure volume and IPO firm performance, grounded in a damage-limitation logic. At low to moderate levels, disclosures amplify risk salience and proprietary costs, damaging valuations. At higher levels, offsetting benefits related to information, stewardship, and climate-friendly reputation outweigh these costs. Using multi-sourced data from 1,586 IPO firms, a BERT-based large language model to identify climate-related text in prospectuses, and econometric methods that address endogeneity, the authors find support for the proposed U-shaped relationship. The research further demonstrates that sender characteristics (underwriter reputation, customer concentration, and market orientation) and message characteristics (discretionary disclosure and message clarity) moderate the nonlinear relationship. Post-hoc analyses decomposing disclosure content reveal that climate risk disclosures damage valuations. In contrast, climate risk-management disclosures (governance, strategy, and metrics/targets) generate positive effects, suggesting that disclosure effectiveness depends on both volume and content composition. These effects persist in the long-term performance of firms. The findings provide actionable insights for firms developing disclosure strategies and policymakers encouraging climate-related communication.
Panorama of Soft Tissue Tumours at a Tertiary Care Centre in Bihar: A Retrospective Observational Study
Vibhuti Kumar
Objective and Aim: Soft tissue tumors (STTs) represent a heterogeneous group of neoplasms with diverse histogenesis, biological behavior, an… (voir plus)d clinical outcomes. The present study aims to evaluate the spectrum, frequency, demographic distribution, anatomical location, and histopathological patterns of soft tissue tumors diagnosed at a tertiary care center in Bihar, India, with special emphasis on benign–malignant correlation and clinicopathological characteristics. Materials and Methods: This retrospective observational study was conducted in the Department of Pathology at a tertiary care teaching hospital in Bihar over a period of five years (January 2019–December 2023). All histopathologically confirmed cases of soft tissue tumors were included. Tumors were classified according to the WHO Classification of Soft Tissue and Bone Tumors (2020). Statistical analysis was performed using SPSS version 26.0. Descriptive statistics, chi-square test, and logistic regression analysis were applied. Results: A total of 312 cases of soft tissue tumors were analyzed. Benign tumors constituted 76.9%, intermediate tumors 7.4%, and malignant tumors 15.7%. The most common benign tumor was lipoma (38.1%), while undifferentiated pleomorphic sarcoma (21.4%) was the most frequent malignant tumor. Malignant tumors were significantly associated with age >40 years (p 0.001) and deep-seated location (p = 0.002). Conclusion: Soft tissue tumors in Bihar show a predominance of benign lesions with lipoma bei
GRAIL: Graph Edit Distance and Node Alignment using LLM-Generated Code
Samidha Verma
Arushi Goyal
Ananya Mathur
Sayan Ranu
Graph Edit Distance (GED) is a widely used metric for measuring similarity between two graphs. Computing the optimal GED is NP-hard, leading… (voir plus) to the development of various neural and non-neural heuristics. While neural methods have achieved improved approximation quality compared to non-neural approaches, they face significant challenges: (1) They require large amounts of ground truth data, which is itself NP-hard to compute. (2) They operate as black boxes, offering limited interpretability. (3) They lack cross-domain generalization, necessitating expensive retraining for each new dataset. We address these limitations with GRAIL, introducing a paradigm shift in this domain. Instead of training a neural model to predict GED, GRAIL employs a novel combination of large language models (LLMs) and automated prompt tuning to generate a program that is used to compute GED. This shift from predicting GED to generating programs imparts various advantages, including end-to-end interpretability and an autonomous self-evolutionary learning mechanism without ground-truth supervision. Extensive experiments on seven datasets confirm that GRAIL not only surpasses state-of-the-art GED approximation methods in prediction quality but also achieves robust cross-domain generalization across diverse graph distributions.
Discovering Symbolic Cognitive Models from Human and Animal Behavior
Nenad Tomasev
Navodita Sharma
Rishika Mohanta
Aparna Dev
Kuba Perlin
Siddhant Jain
Kyle Levin
Noemi Elteto
Will Dabney
Alexander Novikov
Glenn C Turner
Maria K Eckstein
Nathaniel D. Daw
Kevin J Miller
Kim Stachenfeld
Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cogniti… (voir plus)ve process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist. Here, we adapt FunSearch (Romera-Paredes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior. We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each. The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.
Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning
Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through
Code as Reward: Empowering Reinforcement Learning with VLMs
David Venuto
Sami Nur Islam
Sherry Yang
Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and p… (voir plus)rovide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
Abbas Mehrabian
Hyunjik Kim
Nicolas Sonnerat
Matej Balog
Gheorghe Comanici
Tudor Berariu
Andrew Lee
Anian Ruoss
Anna Bulanova
Daniel Toyama
Sam Blackwell
Bernardino Romera Paredes
Laurent Orseau
Joonkyung Lee
Anurag Murty Naredla
Adam Zsolt Wagner
Policy composition in reinforcement learning via multi-objective policy optimization
Nicolas Heess
Martin A. Riedmiller
Abbas Abdolmaleki
We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teach… (voir plus)er policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.
Accelerating exploration and representation learning with offline pre-training
Jacob Bruce
Rob Fergus
Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement lea… (voir plus)rning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights.
Proving Theorems using Incremental Learning and Hindsight Experience Replay
Maxwell Crouse
Eser Aygün
Laurent Orseau
Bassem Makni
Vernon Ralph Austel
Cristina Cornelio
Shajith Ikbal
Stephen McAleer
Vlad Firoiu
Pavan Kapanipathi
Lei Zhang
Ndivhuwo Makondo
Shibl Mourad
Traditional automated theorem provers for first-order logic depend on speed-optimized search and many handcrafted heuristics that are design… (voir plus)ed to work best over a wide range of domains. Machine learning approaches in literature either depend on these traditional provers to bootstrap themselves or fall short on reaching comparable performance. In this paper, we propose a general incremental learning algorithm for training domain specific provers for first-order logic without equality, based only on a basic given-clause algorithm, but using a learned clause-scoring function. Clauses are represented as graphs and presented to transformer networks with spectral features. To address the sparsity and the initial lack of training data as well as the lack of a natural curriculum, we adapt hindsight experience replay to theorem proving, so as to be able to learn even when no proof can be found. We show that provers trained this way can match and sometimes surpass state-of-the-art traditional provers on the TPTP dataset in terms of both quantity and quality of the proofs.