Portrait de Matteo Boglioni

Matteo Boglioni

Collaborateur·rice de recherche - McGill
Superviseur⋅e principal⋅e

Publications

WebArena-Pro: A Heterogeneous, Multimodal, Reproducible Benchmark for Web Agents
Fatemeh Pesaran zadeh
Weijian Qi
Alexander Miller
Junyi Song
Yunjia Tian
Dongjin Kang
Seyeon Choi
Ewen Gueguen
Zeyi Liao
Mengqi Yuan
Alexandre Lacoste
Huan Sun … (voir 2 de plus)
Gunhee Kim
Web agents powered by large language and vision-language models are increasingly applied to realistic browser work that spans heterogeneous … (voir plus)applications, multimodal content, and stateful workflows. However, existing reproducible web-agent benchmarks cover only a small number of web applications drawn from a few software categories, and restrict modality to text and vision. Live benchmarks broaden site coverage but sacrifice reproducibility, since pages and data drift between runs. Moreover, existing benchmarks do not meaningfully evaluate whether agents can understand and use audio and video content embedded within web tasks. To address these gaps, we introduce WebArena-Pro, a benchmark comprising 300 tasks across 20 self-hosted web applications in six domain categories, spanning distinct interface conventions, workflows, and data models. Across the evaluated agents, the best performance is achieved by Gemini 3.1 Pro, which attains 37.0 % success under a 50-step budget, while open-source models' performance does not exceed 27.7% success. Among reproducible, human-curated web agent benchmarks, WebArena-Pro provides the broadest application coverage and the most comprehensive multimodal support to date. The benchmark treats audio and video as core observations alongside text and vision, with dedicated actions for extracting information from each. WebArena-Pro runs each task in isolation and supports reproducible, parallel evaluation. Tasks are authored through a dedicated annotator interface, filtered by LLM-assisted triage, and finally validated by humans before release.