Martin Klissarov

MaestroMotif: Skill Design From Artificial Intelligence Feedback

Mikael Henaff

Roberta Raileanu

Marlos C. Machado

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an… (see more) AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM's feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM's code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

2025-01-21

ICLR.cc/2025/Conference (oral)

On the Modeling Capabilities of Large Language Models for Sequential Decision Making

R Devon Hjelm

Alexander T Toshev

Bogdan Mazoure

2025-01-21

ICLR.cc/2025/Conference (poster)

Code as Reward: Empowering Reinforcement Learning with VLMs

David Venuto

Sami Nur Islam

Sherry Yang

Ankit Anand

Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and p… (see more)rovide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.

2024-04-30

ICML.cc/2024/Conference (spotlight)

proceedings.mlr.press

Motif: Intrinsic Motivation From Artificial Intelligence Feedback

Roberta Raileanu

Mikael Henaff

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (see more)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

2024-01-15

ICLR.cc/2024/Conference (poster)

Flexible Option Learning

Temporal abstraction in reinforcement learning (RL), offers the promise of improving generalization and knowledge transfer in complex enviro… (see more)nments, by propagating information more efficiently over time. Although option learning was initially formulated in a way that allows updating many options simultaneously, using off-policy, intra-option learning (Sutton, Precup & Singh, 1999), many of the recent hierarchical reinforcement learning approaches only update a single option at a time: the option currently executing. We revisit and extend intra-option learning in the context of deep reinforcement learning, in order to enable updating all options consistent with current primitive action choices, without introducing any additional estimates. Our method can therefore be naturally adopted in most hierarchical RL frameworks. When we combine our approach with the option-critic algorithm for option discovery, we obtain significant improvements in performance and data-efficiency across a wide variety of domains.

2021-12-05

ArXiv (preprint)

Options of Interest: Temporal Abstraction with Interest Functions

Khimya Khetarpal

Maxime Chevalier-Boisvert

Pierre-Luc Bacon

Temporal abstraction refers to the ability of an agent to use behaviours of controllers which act for a limited, variable amount of time. Th… (see more)e options framework describes such behaviours as consisting of a subset of states in which they can initiate, an internal policy and a stochastic termination condition. However, much of the subsequent work on option discovery has ignored the initiation set, because of difficulty in learning it from data. We provide a generalization of initiation sets suitable for general function approximation, by defining an interest function associated with an option. We derive a gradient-based learning algorithm for interest functions, leading to a new interest-option-critic architecture. We investigate how interest functions can be leveraged to learn interpretable and reusable temporal abstractions. We demonstrate the efficacy of the proposed approach through quantitative and qualitative results, in both discrete and continuous environments.

2020-04-02

Proceedings of the AAAI Conference on Artificial Intelligence (published)

Reward Propagation Using Graph Convolutional Networks

Potential-based reward shaping provides an approach for designing good reward functions, with the purpose of speeding up learning. However, … (see more)automatically finding potential functions for complex environments is a difficult problem (in fact, of the same difficulty as learning a value function from scratch). We propose a new framework for learning potential functions by leveraging ideas from graph representation learning. Our approach relies on Graph Convolutional Networks which we use as a key ingredient in combination with the probabilistic inference view of reinforcement learning. More precisely, we leverage Graph Convolutional Networks to perform message passing from rewarding states. The propagated messages can then be used as potential functions for reward shaping to accelerate learning. We verify empirically that our approach can achieve considerable improvements in both small and high-dimensional control problems.

2019-12-31

NeurIPS (published)

Diffusion-Based Approximate Value Functions

We present a novel model-based framework inspired by spectral graph theory and deep geometric learning: the Diffusion-based Approximate Valu… (see more)e Function. Our approach efficiently approximates the graph Laplacian of an MDP’s underlying graph by using Graph Convolutional Networks (GCN). By generating an approximate value function, we diffuse the reward signal much faster than traditional Reinforcement Learning algorithms such as TD(0). This leads to substantial improvements on sparse rewards environments where efficient credit assignment is most demanding.

2018-06-26

ICML.cc/2018/ECA (accepted)

When Waiting is not an Option: Learning Options with a Deliberation Cost

Jean Harb

Pierre-Luc Bacon

Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. W… (see more)hile the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.

2018-04-28

Proceedings of the AAAI Conference on Artificial Intelligence (published)

Learnings Options End-to-End for Continuous Action Tasks

Pierre-Luc Bacon

Jean Harb

We present new results on learning temporally extended actions for continuoustasks, using the options framework (Suttonet al.[1999b], Precup… (see more) [2000]). In orderto achieve this goal we work with the option-critic architecture (Baconet al.[2017])using a deliberation cost and train it with proximal policy optimization (Schulmanet al.[2017]) instead of vanilla policy gradient. Results on Mujoco domains arepromising, but lead to interesting questions aboutwhena given option should beused, an issue directly connected to the use of initiation sets.

2017-11-29

ArXiv (preprint)