Chris Pal

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Juan A. Rodriguez

Montek Kalsi

M. Tamer Özsu

Sai Rajeswar

Human Annotator

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

openreview.net

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Juan A. Rodriguez

Montek Kalsi

M. Tamer Özsu

Sai Rajeswar

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enh… (voir plus)ance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi

Tianyi Chen

Miguel Muñoz-Mármol

Curtis Fox

Amrutha Varshini Ramesh

Étienne Marcotte

Issam Hadj Laradji

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior b… (voir plus)enchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

2025-09-30

ArXiv (prépublication)

arxiv.org

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi

Tianyi Chen

Miguel Muñoz-Mármol

Curtis Fox

Amrutha Varshini Ramesh

Étienne Marcotte

Issam Hadj Laradji

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior b… (voir plus)enchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

2025-09-30

ArXiv (prépublication)

arxiv.org

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi

Tianyi Chen

Miguel Muñoz-Mármol

Curtis Fox

Amrutha Varshini Ramesh

Étienne Marcotte

Issam Hadj Laradji

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior b… (voir plus)enchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

2025-09-30

ArXiv (prépublication)

arxiv.org

DRBench: A Realistic Benchmark for Enterprise Deep Research

Amirhossein Abaskohi

Tianyi Chen

Miguel Muñoz-Mármol

Curtis Fox

Amrutha Varshini Ramesh

Étienne Marcotte

Issam Hadj Laradji

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior b… (voir plus)enchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.

2025-09-30

ArXiv (prépublication)

arxiv.org

WebArena Verified: Reliable Evaluation for Web Agents

Amine El hattami

Megh Thakkar

Nicolas Chapados

Chris Pal

Autonomous web agents increasingly operate in multi-step browser workflows, yet widely used benchmarks can misestimate performance due to un… (voir plus)derspecified goals and brittle checkers—challenges characteristic of normal benchmark maturation rather than flaws in the paradigm. We present WebArena Verified, a reproducible re-evaluation of WebArena that preserves its containerized environments while strengthening measurement. We audit all 812 tasks, repair misaligned evaluations and clarify ambiguous instructions; replace substring matching with type- and normalization-aware comparators; verify backend state for state-changing tasks; and adopt a structured JSON schema with explicit status codes for deterministic scoring. We provide improved results reporting with template-level macro averages, 95\% confidence intervals, and failure-mode breakdowns. We also introduce WebArena Verified Hard, a 137-task subset that retains difficult cases while reducing evaluation cost by 83\%. On the baseline agent we evaluated, it reduces false negatives by approximately 11\%. WebArena Verified remains drop-in compatible with minimal change to existing agents, supporting faithful and comparable progress. We release our code, data, and evaluation tools in our public repository.

2025-09-28

NeurIPS.cc/2025/Workshop/SEA (poster)

openreview.net

AInstein: Can AI Rediscover Scientific Concepts from First Principles?

Shambhavi Mishra

Jose Dolz

Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (voir plus)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

2025-08-22

ArXiv (prépublication)

doi.org

arxiv.org

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

2025-08-22

ArXiv (prépublication)

doi.org

arxiv.org

Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Gian Mario Favero

Ge Ya Luo

Douglas Arnold

Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progressio… (voir plus)n such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

2025-08-09

ArXiv (prépublication)

doi.org

arxiv.org

Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Gian Mario Favero

Ge Ya Luo

Douglas Arnold

Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progressio… (voir plus)n such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

2025-08-09

ArXiv (prépublication)

arxiv.org

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Éclaireurs autochtones en IA

Avantage IA

Biographie

Étudiants actuels

Billets de blogue

Publications

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Éclaireurs autochtones en IA

Avantage IA

Mots-clés populaires:

Chris Pal

Biographie

Étudiants actuels

Billets de blogue

Publications