Quentin Cappart

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Léo Boisvert

Abhay Puri

Gabriel Huang

Mihir Bansal

Chandra Kiran Reddy Evuru

Avinandan Bose

Maryam Fazel

Quentin Cappart

Alexandre Lacoste

Jason Stanley

Alexandre Drouin

Krishnamurthy Dj Dvijotham

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (see more) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and

2025-06-08

ICML.cc/2025/Workshop/WCUA (poster)

doi.org

openreview.net

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert

Abhay Puri

Chandra Kiran Reddy Evuru

Joshua Kazdan

Avinandan Bose

Quentin Cappart

Maryam Fazel

Sai Rajeswar

Jason Stanley

Nicolas Chapados

Alexandre Drouin

Krishnamurthy Dj Dvijotham

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (see more)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.

2025-06-08

ICML.cc/2025/Workshop/WCUA (poster)

openreview.net

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Léo Boisvert

Mihir Bansal

Chandra Kiran Reddy Evuru

Gabriel Huang

Abhay Puri

Avinandan Bose

Maryam Fazel

Quentin Cappart

Jason Stanley

Alexandre Lacoste

Alexandre Drouin

Krishnamurthy Dj Dvijotham

2025-04-18

ArXiv (preprint)

arxiv.org

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Maxime Gasse

Alexandre Lacoste

Massimo Caccia

Lawrence Keunho Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Graham Neubig

Quentin Cappart

Russ Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2025-01-01

Trans. Mach. Learn. Res. (published)

doi.org

openreview.net

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Quentin Cappart

Graham Neubig

Ruslan Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (preprint)

doi.org

arxiv.org

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Quentin Cappart

Graham Neubig

Ruslan Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (preprint)

arxiv.org

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Alexandre Lacoste

Massimo Caccia

Lawrence Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Quentin Cappart

Graham Neubig

Ruslan Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-06

ArXiv (preprint)

arxiv.org

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Léo Boisvert

Megh Thakkar

Maxime Gasse

Massimo Caccia

Thibault Le Sellier De Chezelles

Quentin Cappart

Nicolas Chapados

Alexandre Lacoste

Alexandre Drouin

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recen… (see more)t LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena/tree/workarena-plus-plus.

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

doi.org

openreview.net

Learning Valid Dual Bounds in Constraint Programming: Boosted Lagrangian Decomposition with Self-Supervised Learning

Swann Bessa

Darius Dabert

Max Bourgeat

Louis-Martin Rousseau

Quentin Cappart

2024-08-22

ArXiv (preprint)

doi.org

arxiv.org

WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin

Maxime Gasse

Massimo Caccia

Issam Hadj Laradji

Manuel Del Verme

David Vazquez

Alexandre Lacoste

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

openreview.net

Global rewards in multi-agent deep reinforcement learning for autonomous mobility on demand systems

Heiko Hoppe

Tobias Enders

Quentin Cappart

Maximilian Schiffer

We study vehicle dispatching in autonomous mobility on demand (AMoD) systems, where a central operator assigns vehicles to customer requests… (see more) or rejects these with the aim of maximizing its total profit. Recent approaches use multi-agent deep reinforcement learning (MADRL) to realize scalable yet performant algorithms, but train agents based on local rewards, which distorts the reward signal with respect to the system-wide profit, leading to lower performance. We therefore propose a novel global-rewards-based MADRL algorithm for vehicle dispatching in AMoD systems, which resolves so far existing goal conflicts between the trained agents and the operator by assigning rewards to agents leveraging a counterfactual baseline. Our algorithm shows statistically significant improvements across various settings on real-world data compared to state-of-the-art MADRL algorithms with local rewards. We further provide a structural analysis which shows that the utilization of global rewards can improve implicit vehicle balancing and demand forecasting abilities. An extended version of our paper, including an appendix, can be found at https://arxiv.org/abs/2312.08884. Our code is available at https://github.com/tumBAIS/GR-MADRL-AMoD.

2024-06-11

Proceedings of the 6th Annual Learning for Dynamics & Control Conference (published)

doi.org

arxiv.org

Towards a Generic Representation of Combinatorial Problems for Learning-Based Approaches

Léo Boisvert

Hélène Verhaeghe

Quentin Cappart

2024-05-25

Integration of Constraint Programming, Artificial Intelligence, and Operations Research (published)

doi.org

arxiv.org

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Quentin Cappart

Biography

Current Students

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Quentin Cappart

Biography

Current Students

Publications