Portrait of Alexandre Drouin

Alexandre Drouin

Associate Industry Member
Adjunct professor, Université Laval, Department of Electrical Engineering and Computer Engineering
Research Scientist, ServiceNow
Research Topics
Causality
Computational Biology
Deep Learning
LLM Agent
Time Series Forecasting

Biography

Alexandre Drouin is a research scientist at ServiceNow Research in Montréal, and an adjunct professor of computer science at Université Laval. He also leads ServiceNow’s Human Decision Support research program, which explores the use of machine learning for decision-making in complex dynamic environments.

Droiun’s main research interest is causal decision-making under uncertainty, where the goal is to answer questions of causal nature (interventions, counterfactual), while accounting for sources of uncertainty, such as ambiguity in causal structures and unmeasured variables. He is also interested in probabilistic time series forecasting and its use in foreseeing the long-term effect of actions. His PhD in computer science from Université Laval was on machine learning algorithms for biomarker discovery in large genomic datasets and their application to the global problem of antibiotic resistance.

Current Students

PhD - Université de Montréal
Principal supervisor :
PhD - Polytechnique Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :

Publications

Causal Differentiating Concepts: Interpreting LM Behavior via Causal Representation Learning
How to Train Your LLM Web Agent: A Statistical Diagnosis
Dheeraj Vattikonda
Nicolas Gontier
Miguel Muñoz-Mármol
Stefania Raimondo
Xue Liu
Alexandre Lacoste
Massimo Caccia
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (see more)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Mihir Bansal
Chandra Kiran Reddy Evuru
Avinandan Bose
Maryam Fazel
Jason Stanley
Alexandre Lacoste
Krishnamurthy Dj Dvijotham
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (see more) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and
On Selecting Robust Approaches for Learning Predictive Biomarkers in Metabolomics Data Sets.
Pier-Luc Plante
J. Corbeil
Pascal Germain
Metabolomics, the study of small molecules within biological systems, offers insights into metabolic processes and, consequently, holds grea… (see more)t promise for advancing health outcomes. Biomarker discovery in metabolomics represents a significant challenge, notably due to the high dimensionality of the data. Recent work has addressed this problem by analyzing the most important variables in machine learning models. Unfortunately, this approach relies on prior hypotheses about the structure of the data and may overlook simple patterns. To assess the true usefulness of machine learning methods, we evaluate them on a collection of 835 metabolomics data sets. This effort provides valuable insights for metabolomics researchers regarding where and when to use machine learning. It also establishes a benchmark for the evaluation of future methods. Nonetheless, the results emphasize the high diversity of data sets in metabolomics and the complexity of finding biologically relevant biomarkers. As a result, we propose a novel approach applicable across all data sets, offering guidance for future analyses. This method involves directly comparing univariate and multivariate models. We demonstrate through selected examples how this approach can guide data analysis across diverse data set structures, representative of the observed variability. Code and data are available for research purposes.
Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning
Chandra Kiran Reddy Evuru
Joshua Kazdan
Avinandan Bose
Maryam Fazel
Sai Rajeswar
Jason Stanley
Krishnamurthy Dj Dvijotham
The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (see more)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Andrew R. Williams
Étienne Marcotte
Valentina Zantedeschi
Alexandre Lacoste
Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to con… (see more)vey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0/.
Learning to Defer for Causal Discovery with Imperfect Experts
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications
Chandler Squires
Jonas Wahl
Konrad Paul Kording
Karen Sachs
Causal discovery aims to automatically uncover causal relationships from data, a capability with significant potential across many scientifi… (see more)c disciplines. However, its real-world applications remain limited. Current methods often rely on unrealistic assumptions and are evaluated only on simple synthetic toy datasets, often with inadequate evaluation metrics. In this paper, we substantiate these claims by performing a systematic review of the recent causal discovery literature. We present applications in biology, neuroscience, and Earth sciences - fields where causal discovery holds promise for addressing key challenges. We highlight available simulated and real-world datasets from these domains and discuss common assumption violations that have spurred the development of new methods. Our goal is to encourage the community to adopt better evaluation practices by utilizing realistic datasets and more adequate metrics.
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Amirhossein Abaskohi
Mohammad Chegini
Valentina Zantedeschi
Alexandre Lacoste
Christopher Pal
Issam Hadj Laradji
Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We intro… (see more)duce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents' ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics and can be accessed here: https://github.com/ServiceNow/insight-bench.
The BrowserGym Ecosystem for Web Agent Research
Maxime Gasse
Alexandre Lacoste
Massimo Caccia
Lawrence Keunho Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
Fine-Tuning Web Agents: It Works, But It's Trickier Than You Think
Recent advancements in large language models (LLMs) have sparked interest in developing autonomous web agents capable of performing digital … (see more)tasks through web interfaces in a human-like manner. However, even the strongest closed-source models often struggle to achieve robust results on several benchmarks, while a notable performance gap exists between them and open-source counterparts. This study investigates the potential of fine-tuning to enhance the performance of a smaller, lower-performing but cost-efficient LLM by leveraging successful traces from stronger LLMs, referred to as experts. We outline a comprehensive pipeline for data collection, filtering, and supervised fine-tuning and explore various behavior cloning parameters. Our experiments provide key insights into the challenges of fine-tuning LLMs into web agents on benchmarks like MiniWoB and WorkArena. Notably, we find that the fine-tuned agents' ability to predict expert trajectories does not consistently lead to improved downstream task performance. This raises issues such as off-policy bias and the loss of reasoning abilities during fine-tuning. We discuss potential solutions to these challenges and make both the codebase and a dataset of 140M tokens open-source for the community to build upon.
Evaluating Interventional Reasoning Capabilities of Large Language Models
Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consid… (see more)er using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. We evaluate six LLMs on the benchmarks, finding that GPT models show promising accuracy at predicting the intervention effects.