Portrait of Maxime Gasse

Maxime Gasse

Associate Industry Member
Adjunct Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Senior Research Scientist, ServiceNow
Research Topics
Causality
LLM Agent
Probabilistic Models
Reinforcement Learning

Biography

I am a senior research scientist at ServiceNow in Montréal, where I do research at the intersection of causal inference and reinforcement learning. I am an adjunct professor at Polytechnique Montréal (courtesy appointment) and an associate industry member of Mila – Quebec Artificial Intelligence Institute.

I am fascinated by the question of AI: can we build machines that think? I humbly believe that our attempts at designing thinking machines can be a path towards a fundamental understanding of intelligence and of ourselves. Currently, I am interested in questioning if and how ideas from the field of causality can help in the design of autonomous learning agents.

Current Students

Master's Research - Polytechnique Montréal
Co-supervisor :

Publications

The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier de Chezelles
Alexandre Lacoste
Massimo Caccia
Léo Boisvert
Megh Thakkar
Tom Marty
Rim Assouel
Sahar Omidi Shayegan
Lawrence Keunho Jang
Xing Han Lu
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Russ Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
Mats Leon Richter
Juan A. Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier de Chezelles
Alexandre Lacoste
Massimo Caccia
Léo Boisvert
Megh Thakkar
Tom Marty
Rim Assouel
Sahar Omidi Shayegan
Lawrence Jang
Xing Han Lu
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier de Chezelles
Alexandre Lacoste
Massimo Caccia
Léo Boisvert
Megh Thakkar
Tom Marty
Rim Assouel
Sahar Omidi Shayegan
Lawrence Jang
Xing Han Lu
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier de Chezelles
Alexandre Lacoste
Massimo Caccia
Léo Boisvert
Megh Thakkar
Tom Marty
Rim Assouel
Sahar Omidi Shayegan
Lawrence Jang
Xing Han Lu
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think
Massimo Caccia
Megh Thakkar
Léo Boisvert
Thibault Le Sellier de Chezelles
Alexandre Piché
Alexandre Lacoste
Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital … (see more)tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.
Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think
Massimo Caccia
Megh Thakkar
Léo Boisvert
Thibault Le Sellier de Chezelles
Alexandre Piché
Alexandre Lacoste
Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital … (see more)tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.
AgentMerge: Enhancing Generalization in Fine-Tuned LLM Agents
Megh Thakkar
Léo Boisvert
Thibault Le Sellier de Chezelles
Alexandre Piché
Alexandre Lacoste
Massimo Caccia
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
Léo Boisvert
Megh Thakkar
Massimo Caccia
Thibault Le Sellier de Chezelles
Alexandre Lacoste
The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recen… (see more)t LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena/tree/workarena-plus-plus.