Manuel Del Verme

CUBE: A Standard for Unifying Agent Benchmarks

Alexandre Lacoste

Nicolas Gontier

Oleh Shliazhko

Aman Jaiswal

Kusha Sareen

Shailesh Nanisetty

Joan Cabezas

Manuel Del Verme

Omar G. Younis

Simone Baratta

Matteo Avalle

Imene Kerboua

Xing Han Lu

Elron Bandel

Michal Shmueli-Scheuer

Asaf Yehudai

Leshem Choshen

Jonathan Lebensold

Sean Hughes

Massimo Caccia … (see 6 more)

Alexandre Drouin

Siva Reddy

Tao Yu

Yu Su

Graham Neubig

Dawn Song

The proliferation of agent benchmarks has created critical fragmentation that threatens research productivity. Each new benchmark requires s… (see more)ubstantial custom integration, creating an "integration tax" that limits comprehensive evaluation. We propose CUBE (Common Unified Benchmark Environments), a universal protocol standard built on MCP and Gym that allows benchmarks to be wrapped once and used everywhere. By separating task, benchmark, package, and registry concerns into distinct API layers, CUBE enables any compliant platform to access any compliant benchmark for evaluation, RL training, or data generation without custom integration. We call on the community to contribute to the development of this standard before platform-specific implementations deepen fragmentation as benchmark production accelerates through 2026.

2026-03-15

arXiv (preprint)

doi.org

arxiv.org

WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin

Maxime Gasse

Massimo Caccia

Issam H. Laradji

Alexandre Lacoste

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuri… (see more)ng the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

GrowSpace: A reinforcement learning environment for plant architecture

Mark Lefsrud

2023-12-31

Computers and Electronics in Agriculture (published)

doi.org

GrowSpace: Learning How to Shape Plants

Mark Lefsrud

Plants are dynamic systems that are integral to our existence and survival. Plants face environment changes and adapt over time to their sur… (see more)rounding conditions. We argue that plant responses to an environmental stimulus are a good example of a real-world problem that can be approached within a reinforcement learning (RL)framework. With the objective of controlling a plant by moving the light source, we propose GrowSpace, as a new RL benchmark. The back-end of the simulator is implemented using the Space Colonisation Algorithm, a plant growing model based on competition for space. Compared to video game RL environments, this simulator addresses a real-world problem and serves as a test bed to visualize plant growth and movement in a faster way than physical experiments. GrowSpace is composed of a suite of challenges that tackle several problems such as control, multi-stage learning,fairness and multi-objective learning. We provide agent baselines alongside case studies to demonstrate the difficulty of the proposed benchmark.

2021-12-31

AAAI.org/2022/Workshop/AIAFS (published)

openreview.net

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Manuel Del Verme

Publications

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Popular keywords:

Manuel Del Verme

Publications