Cyrus Neary

Cyrus Neary

Edward S. Hu

Kanav Arora

Kirsty Ellis

Luca Macesanu

Matthew Leonard

Meedeum Cho

Shivin Dass

Tony Wang

Xingfang Yuan

Abhishek Gupta

Dinesh Jayaraman

Glen Berseth … (see 6 more)

Kostas Daniilidis

Roberto Martín-Martín

Youngwoon Lee

Percy Liang

Chelsea Finn

Sergey Levine

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benc… (see more)hmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized "robot challenges", and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

2025-10-07

Proceedings of The 8th Conference on Robot Learning (published)

proceedings.mlr.press

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya

Karl Pertsch

Tony Lee

Moo Jin Kim

Arhan Jain

Cyrus Neary

Edward S. Hu

Kanav Arora

Kirsty Ellis

Luca Macesanu

Matthew Leonard

Meedeum Cho

Shivin Dass

Tony Wang

Xingfang Yuan

Abhishek Gupta

Dinesh Jayaraman

Glen Berseth … (see 6 more)

Kostas Daniilidis

Roberto Martín-Martín

Youngwoon Lee

Percy Liang

Chelsea Finn

Sergey Levine

2025-10-07

Proceedings of The 8th Conference on Robot Learning (published)

proceedings.mlr.press

Task Robustness via Re-Labelling Vision-Action Robot Data

Cyrus Neary

2025-09-06

robot-learning.org/CoRL/2025/Workshop/Robot_Data (published)

Task Robustness via Re-Labelling Vision-Action Robot Data

Cyrus Neary

The recent trend in scaling models for robot learning has resulted in impressive policies that can perform various manipulation tasks and ge… (see more)neralize to novel scenarios. However, these policies continue to struggle with following instructions, likely due to the limited linguistic and action sequence diversity in existing robotics datasets. This paper introduces

2025-09-06

robot-learning.org/CoRL/2025/Workshop/Robot_Data (published)

Zero-Shot Constraint Satisfaction with Forward- Backward Representations

Adriana Hugessen

Harley Wiltzer

Cyrus Neary

Amy Zhang

Traditionally, constrained policy optimization with Reinforcement Learning (RL) requires learning a new policy from scratch for any new envi… (see more)ronment, goal or cost function, with limited generalization to new tasks and constraints. Given the sample inefficiency of many common deep RL methods, this procedure can be impractical for many real-world scenarios, particularly when constraints or tasks are changing. As an alternative, in the unconstrained setting, various works have sought to pre-train representations from offline datasets to accelerate policy optimization upon specification of a reward. Such methods can permit faster adaptation to new tasks in a given environment, dramatically improving sample efficiency. Recently, zero-shot policy optimization has been explored by leveraging a particular

2025-07-01

rl-conference.cc/RLC/2025/Workshop/RLBrew (published)

Scalable Tree Search over Graphs with Learned Action Pruning for Power Grid Control

Florence Cloutier

Cyrus Neary

Adriana Hugessen

Viktor Todosijević

Zina Kamel

As real-world infrastructure systems become increasingly complex and large-scale, there is a growing need for learning-based control strateg… (see more)ies that can make informed decisions in complex and dynamic environments. However, large-scale problems — such as power grid control — introduce high-dimensional action spaces and necessitate transferability across varying grid topologies. We introduce **H**ierarchical **E**xpert-Guided **R**econfiguration **O**ptimization for **G**raph **T**opologies, **HERO-GT**, a model-based planning approach that combines a pretrained graph neural network (GNN) for topology-aware action pruning with a Monte Carlo Tree Search (MCTS) planner for targeted, structured exploration. More specifically, the high-level GNN predicts a promising subset of actions, which the low-level MCTS agent uses to focus its search and reduce computational overhead while remaining adaptable to unseen graph structures. Furthermore, the MCTS planner leverages a given *default policy*---which may be defined, for example, by heuristics, problem relaxations, or rule-based methods---to bias the search and prioritize actions that are expected to improve performance over the default. We deploy HERO-GT in power grid environments, demonstrating that it not only improves over a strong default policy, but also scales to a realistic operational setting where exhaustive search becomes computationally infeasible.

2025-06-17

rl-conference.cc/RLC/2025/Workshop/RL4RS (published)