Bogdan Mazoure

Alexander T Toshev

2025-01-22

ICLR.cc/2025/Conference (poster)

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Andrew Szot

Omar Attia

Aleksei Timofeev

Harsh Agrawal

Zhe Gan

Zsolt Kira

Alexander T Toshev

We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language … (see more)and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

2024-12-11

ArXiv (preprint)

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Andrew Szot

Omar Attia

Aleksei Timofeev

Harsh Agrawal

Zhe Gan

Zsolt Kira

Alexander T Toshev

We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language … (see more)and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

2024-12-11

ArXiv (preprint)

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Andrew Szot

Omar Attia

Aleksei Timofeev

Harsh Agrawal

Zhe Gan

Zsolt Kira

Alexander T Toshev

We examine the capability of Multimodal Large Language Models (MLLMs) to tackle diverse domains that extend beyond the traditional language … (see more)and vision tasks these models are typically trained on. Specifically, our focus lies in areas such as Embodied AI, Games, UI Control, and Planning. To this end, we introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA). GEA is a single unified model capable of grounding itself across these varied domains through a multi-embodiment action tokenizer. GEA is trained with supervised learning on a large dataset of embodied experiences and with online RL in interactive simulators. We explore the data and algorithmic choices necessary to develop such a model. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents. The final GEA model achieves strong generalization performance to unseen tasks across diverse benchmarks compared to other generalist models and benchmark-specific approaches.

2024-12-11

ArXiv (preprint)

Grounding Multimodal Large Language Models in Actions

Andrew Szot

Harsh Agrawal

Zsolt Kira

Alexander T Toshev

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains including Embodied AI. In this w… (see more)ork, we study how to best ground a MLLM into different embodiments and their associated action spaces, including both continuous and discrete actions. For continuous actions, a set of learned tokenizations that capture an action at various resolutions allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action grounding approaches on five different environments, encompassing over 114 embodied tasks.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

On the benefits of pixel-based hierarchical policies for task generalization

T. Cristea-Platon

Josh Susskind

Walter Talbott

Reinforcement learning practitioners often avoid hierarchical policies, especially in image-based observation spaces. Typically, the single-… (see more)task performance improvement over flat-policy counterparts does not justify the additional complexity associated with implementing a hierarchy. However, by introducing multiple decision-making levels, hierarchical policies can compose lower-level policies to more effectively generalize between tasks, highlighting the need for multi-task evaluations. We analyze the benefits of hierarchy through simulated multi-task robotic control experiments from pixels. Our results show that hierarchical policies trained with task conditioning can (1) increase performance on training tasks, (2) lead to improved reward and state-space generalizations in similar tasks, and (3) decrease the complexity of fine tuning required to solve novel tasks. Thus, we believe that hierarchical policies should be considered when building reinforcement learning architectures capable of generalizing between tasks.

2024-07-27

ArXiv (preprint)

Grounding Multimodal Large Language Models in Actions

Andrew Szot

Harsh Agrawal

Zsolt Kira

Alexander T Toshev

Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this … (see more)work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.

2024-06-12

ArXiv (preprint)

Grounding Multimodal Large Language Models in Actions

Andrew Szot

Harsh Agrawal

Zsolt Kira

Alexander T Toshev

2024-06-12

ArXiv (preprint)

Generative Models for Decision Making

Lisa Lee

Roberta Raileanu

Yilun Du

Walter Talbott

Katherine Metcalf

Alexander T Toshev

Generative Artificial Intelligence (AI) has made significant advancements in recent years, particularly with the development of large langua… (see more)ge and diffusion models. These generative models have demonstrated impressive capabilities in various tasks, such as text generation and image and audio synthesis. Concurrently, Reinforcement Learning (RL) has made significant strides in solving complex sequential decision-making problems with the help of external knowledge sources . However, there remains untapped potential in combining generative models with RL algorithms to tackle real-world challenges, particularly to improve sample efficiency of tabula rasa training by introducing priors from related domains such as visual question-answering, image captioning and image generation. This workshop aims to bring together researchers and practitioners from the fields of generative AI and reinforcement learning to explore the latest advances, methodologies, and applications. By fostering collaborations between these two domains, we intend to unlock new opportunities for addressing complex problems that lie at the intersection of both fields.

2024-03-08

ICLR.cc/2024/Workshop_Proposals (published)

Large Language Models as Generalizable Policies for Embodied Tasks

Andrew Szot

Max Schwarzer

Harsh Agrawal

Walter Talbott

Rin Metcalf

Natalie Mackraz

Alexander T Toshev

2024-01-16

ICLR.cc/2024/Conference (poster)

Accelerating exploration and representation learning with offline pre-training

Jake Bruce

Doina Precup

Rob Fergus

Ankit Anand

Sequential decision-making agents struggle with long horizon tasks, since solving them requires multi-step reasoning. Most reinforcement lea… (see more)rning (RL) algorithms address this challenge by improved credit assignment, introducing memory capability, altering the agent's intrinsic motivation (i.e. exploration) or its worldview (i.e. knowledge representation). Many of these components could be learned from offline data. In this work, we follow the hypothesis that exploration and representation learning can be improved by separately learning two different models from a single offline dataset. We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward separately from a single collection of human demonstrations can significantly improve the sample efficiency on the challenging NetHack benchmark. We also ablate various components of our experimental setting and highlight crucial insights.

2023-06-20

ICML.cc/2023/Workshop/ILHF (published)

Value function estimation using conditional diffusion models for control

Walter Talbott

Miguel Ángel Bautista