Publications

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Pranav Atreya

Karl Pertsch

Tony Lee

Moo Jin Kim

Arhan Jain

Artur Kuramshin

Cyrus Neary

Edward S. Hu

Kanav Arora

Kirsty Ellis

Luca Macesanu

Matthew Leonard

Meedeum Cho

Özgür Aslan

Shivin Dass

Tony Wang

Xingfang Yuan

Abhishek Gupta

Dinesh Jayaraman

Glen Berseth … (see 6 more)

Kostas Daniilidis

Roberto Martín-Martín

Youngwoon Lee

Percy Liang

Chelsea Finn

Sergey Levine

Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benc… (see more)hmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized "robot challenges", and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies.

2025-10-07

Proceedings of The 8th Conference on Robot Learning (published)

proceedings.mlr.press

In silico Neutron Relative Biological Effectiveness Estimations For Pre-DNA Repair And Post-DNA Repair Endpoints

Nicolas Desjardins

John Kildea

2025-10-07

bioRxiv (preprint)

doi.org

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza

Tianyue H. Zhang

Laurent Charlin

Mateo Espinosa Zarlenga

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human unde… (see more)rstandable concepts. However, CBMs typically rely on datasets with assumedly accurate concept labels—an assumption often violated in practice which we show can significantly degrade performance. To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis on some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery

Pratinav Seth

Michelle Lin

Brefo Dwamena Yaw

Jade Boutot

Mary Kang

David Rolnick

Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the gr… (see more)oundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale Benchmark dataset for this problem, leveraging high-resolution multi-spectral satellite imagery from Planet Labs. Our curated Dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Aligning Protein Conformation Ensemble Generation with Physical Feedback

Jiarui Lu

Xiaoyin Chen

Stephen Zhewen Lu

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Jian Tang

Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-co… (see more)nsuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Aligning Protein Conformation Ensemble Generation with Physical Feedback

Jiarui Lu

Xiaoyin Chen

Stephen Zhewen Lu

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Jian Tang

Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-co… (see more)nsuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Romina Abachi

Igor Gilitschenski

Amir-massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

arxiv.org

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Romina Abachi

Igor Gilitschenski

Amir-massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model’s value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner

Mark Rowland

Yunhao Tang

Murat A Erdogdu

Amir-massoud Farahmand

We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work ana… (see more)lyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner

Mark Rowland

Yunhao Tang

Murat A Erdogdu

Amir-massoud Farahmand

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Compositional Risk Minimization

Charles Arnal

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

openreview.net

Compositional Risk Minimization

Charles Arnal

Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model’s ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Publications

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Popular keywords:

Publications