Alexandre Drouin

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Gaurav Sahu

Abhay Puri

Juan A. Rodriguez

Amirhossein Abaskohi

Mohammad Chegini

Perouz Taslakian

Valentina Zantedeschi

Alexandre Lacoste

David Vazquez

Chris Pal

Sai Rajeswar

Issam Hadj Laradji

2025-01-22

ICLR.cc/2025/Conference (poster)

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Keunho Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Graham Neubig

Russ Salakhutdinov

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2025-01-01

Trans. Mach. Learn. Res. (publié)

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Graham Neubig

Ruslan Salakhutdinov

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (prépublication)

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Graham Neubig

Ruslan Salakhutdinov

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (prépublication)

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Graham Neubig

Ruslan Salakhutdinov

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (prépublication)

The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Philippe Brouillard

Chandler Squires

Jonas Wahl

Konrad P. Kording

Karen Sachs

Dhanya Sridhar

Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientifi… (voir plus)c disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.

2024-12-02

ArXiv (prépublication)

The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Philippe Brouillard

Chandler Squires

Jonas Wahl

Konrad P. Kording

Karen Sachs

Dhanya Sridhar

Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientifi… (voir plus)c disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.

2024-12-02

ArXiv (prépublication)

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Andrew Robert Williams

Arjun Ashok

Étienne Marcotte

Valentina Zantedeschi

Jithendaraa Subramanian

Roland Riachi

Alexandre Lacoste

Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial… (voir plus) context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce"Context is Key"(CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/ .

2024-10-24

ArXiv (prépublication)

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Andrew Robert Williams

Arjun Ashok

Étienne Marcotte

Valentina Zantedeschi

Jithendaraa Subramanian

Roland Riachi

Alexandre Lacoste

Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial… (voir plus) context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce"Context is Key"(CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/ .

2024-10-24

ArXiv (prépublication)

Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think

Massimo Caccia

Megh Thakkar

Léo Boisvert

Thibault Le Sellier de Chezelles

Alexandre Piché

Maxime Gasse

Alexandre Lacoste

Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital … (voir plus)tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.

2024-10-22

NeurIPS.cc/2024/Workshop/OWA (poster)

Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think

Massimo Caccia

Megh Thakkar

Léo Boisvert

Thibault Le Sellier de Chezelles

Alexandre Piché

Maxime Gasse

Alexandre Lacoste

Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital … (voir plus)tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.

2024-10-22

NeurIPS.cc/2024/Workshop/OWA (poster)