Ankit Anand

Navodita Sharma

Rishika Mohanta

Aparna Dev

Kuba Perlin

Siddhant Jain

Kyle Levin

Noemi Elteto

Will Dabney

Alexander Novikov

Glenn C Turner

Maria K Eckstein

Nathaniel D. Daw

Kevin J Miller

Kim Stachenfeld

Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cogniti… (voir plus)ve process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist. Here, we adapt FunSearch (Romera-Paredes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior. We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each. The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.

2025-05-01

ICML.cc/2025/Conference (poster)

Cracking the Code of Action: A Generative Approach to Affordances for Reinforcement Learning

David Venuto

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through intent-based affordances -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances**

2025-03-05

ICLR.cc/2025/Workshop/DL4C (publié)

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

David Venuto

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboar… (voir plus)d actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through *intent-based affordances* -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose **Code as Generative Affordances (

2025-03-05

ICLR.cc/2025/Workshop/DL4C (publié)

Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto

Mohammad Sami Nur Islam

Martin Klissarov

Sherry Yang

2024-05-01

ICML.cc/2024/Conference (spotlight)

Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto

Mohammad Sami Nur Islam

Martin Klissarov

Sherry Yang

Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and p… (voir plus)rovide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.

2024-02-07

ArXiv (prépublication)

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abbas Mehrabian

Hyunjik Kim

Nicolas Sonnerat

Matej Balog

Gheorghe Comanici

Tudor Berariu

Andrew Lee

Anian Ruoss

Anna Bulanova

Daniel Toyama

Sam Blackwell

Bernardino Romera Paredes

Petar Veličković

Laurent Orseau

Joonkyung Lee

Anurag Murty Naredla

Adam Zsolt Wagner

2023-10-27

NeurIPS.cc/2023/Workshop/MATH-AI (poster)

Policy composition in reinforcement learning via multi-objective policy optimization

Shruti Mishra

Jordan Hoffmann

Nicolas Heess

Martin A. Riedmiller

Abbas Abdolmaleki

We enable reinforcement learning agents to learn successful behavior policies by utilizing relevant pre-existing teacher policies. The teach… (voir plus)er policies are introduced as objectives, in addition to the task objective, in a multi-objective policy optimization setting. Using the Multi-Objective Maximum a Posteriori Policy Optimization algorithm (Abdolmaleki et al. 2020), we show that teacher policies can help speed up learning, particularly in the absence of shaping rewards. In two domains with continuous observation and action spaces, our agents successfully compose teacher policies in sequence and in parallel, and are also able to further extend the policies of the teachers in order to solve the task. Depending on the specified combination of task and teacher(s), teacher(s) may naturally act to limit the final performance of an agent. The extent to which agents are required to adhere to teacher policies are determined by hyperparameters which determine both the effect of teachers on learning speed and the eventual performance of the agent on the task. In the humanoid domain (Tassa et al. 2018), we also equip agents with the ability to control the selection of teachers. With this ability, agents are able to meaningfully compose from the teacher policies to achieve a superior task reward on the walk task than in cases without access to the teacher policies. We show the resemblance of composed task policies with the corresponding teacher policies through videos.

2023-08-29

ArXiv (prépublication)

Accelerating exploration and representation learning with offline pre-training

Bogdan Mazoure

Jake Bruce

Rob Fergus

Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement lea… (voir plus)rning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights.

2023-06-20

ICML.cc/2023/Workshop/ILHF (publié)

Proving theorems using Incremental Learning and Hindsight Experience Replay

Maxwell Crouse

Eser Aygün

Bassem Makni

Laurent Orseau

Vernon Ralph Austel

Xavier Glorot

Cristina Cornelio

Shajith Ikbal

Stephen M Mcaleer

Pavan Kapanipathi

Vlad Firoiu

Ndivhuwo Makondo

Lei M Zhang

Shibl Mourad

The highest performing ATP systems (e.g., [7, 18]) in first order logic have been evolving for decades and have grown to use an increasing n… (voir plus)umber of manually designed heuristics mixed with some machine learning, to obtain a large number of search strategies that are tried sequentially or in parallel. Some recent works [5, 13, 19] build on top of these provers, using modern machine learning techniques to augment, select or prioritize their already existing heuristics, with some success. Other recent works do not build on top of other provers, but still require existing proof examples as input (e.g., [9, 23]). Such machine-learning-based ATP systems can struggle to solve difficult problems when the training dataset does not provide problems of sufficiently diverse difficulties. In this paper, we propose an approach which can build a strong theorem prover without relying on existing domain-specific heuristics or on prior input data (in the form of proofs) to prime the learning. We strive to design a learning methodology for ATP that allows a system to improve even when there are large gaps in the difficulty of given set of theorems. In particular, given a set of conjectures without proofs, our system trains itself, based on its own attempts and (dis)proves an increasing number of conjectures, an approach which can be viewed as a form of incremental learning. Additionally, all the previous approaches [19, 1, 13] learn exclusively on successful proof attempts. When no new theorem can be proven, the learner may not be able to improve anymore and thus the system may not be able to obtain more training data. This could in principle happen even at the very start of training, if all the theorems available are too hard. To tackle this challenge, we adapt the idea of hindsight experience replay (HER) [3] to ATP: Clauses reached during proof attempts (whether successful or not) are turned into goals in hindsight, producing a large amount of ‘auxiliary’ theorems with proofs of varied difficulties for the learner, even in principle when no theorem from the original set can be proven initially. This leads to a smoother learning regime and a constantly improving learner. We evaluate our approach on two popular benchmarks: MPTP2078 [2] and M2k [17] and compare it both with TRAIL [1], a recent machine learning prover as well as with E prover [24, 7], one of the leading heuristic provers. Our proposed approach substantially outperforms TRAIL [1] on both datasets, surpasses E in the auto configuration with a 100s time limit, and is competitive with E in the autoschedule configuration with a 7 days time limit. In addition, our approach almost always (99.5% of cases) finds shorter proofs than E.

2022-01-01

ICML (publié)

proceedings.mlr.press

Training a First-Order Theorem Prover from Synthetic Data

Vlad Firoiu

Eser Aygün

Zafarali Ahmed

Xavier Glorot

Laurent Orseau

Lei Zhang

Shibl Mourad

2021-03-05

ArXiv (prépublication)

Learning to Prove from Synthetic Theorems

Eser Aygün

Zafarali Ahmed

Vlad Firoiu

Xavier Glorot

Laurent Orseau

Shibl Mourad

A major challenge in applying machine learning to automated theorem proving is the scarcity of training data, which is a key ingredient in t… (voir plus)raining successful deep learning models. To tackle this problem, we propose an approach that relies on training with synthetic theorems, generated from a set of axioms. We show that such theorems can be used to train an automated prover and that the learned prover transfers successfully to human-generated theorems. We demonstrate that a prover trained exclusively on synthetic theorems can solve a substantial fraction of problems in TPTP, a benchmark dataset that is used to compare state-of-the-art heuristic provers. Our approach outperforms a model trained on human-generated problems in most axiom sets, thereby showing the promise of using synthetic data for this task.

2020-06-19

ArXiv (prépublication)