Portrait de Alexandre Drouin

Alexandre Drouin

Membre industriel associé
Professeur adjoint, Université Laval, Département de génie électrique et de génie informatique
Chercheur scientifique, ServiceNow
Sujets de recherche
Agent basé sur un LLM
Apprentissage profond
Biologie computationnelle
Causalité
Prévision des séries temporelles

Biographie

Alexandre Drouin est chercheur en intelligence artificielle chez ServiceNow Research à Montréal et professeur associé au Département d’informatique et de génie logiciel de l’Université Laval. Il dirige une équipe de recherche qui explore l’utilisation de l’apprentissage automatique pour la prise de décision dans des environnements dynamiques complexes. Son intérêt de recherche principal est la prise de décision causale, dont le but est de répondre à des questions interventionnelles et contrefactuelles en tenant compte des sources d’incertitude potentielles, par exemple l’ambiguïté des relations causales sous-jacentes à un système et l’effet de variables latentes. Il s’intéresse aussi aux modèles de prédiction probabiliste pour les séries temporelles et à leur utilisation pour prédire l’effet à long terme d’actions.

Il est détenteur d’un doctorat en informatique de l’Université Laval, qu’il a reçu pour son travail sur le développement d’algorithmes d’apprentissage automatique pour la découverte de biomarqueurs en génomique et leur application au problème de résistance aux antibiotiques.

Étudiants actuels

Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - Polytechnique
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :

Publications

DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi
Tianyi Chen
Miguel Muñoz-Mármol
Curtis Fox
Amrutha Varshini Ramesh
Étienne Marcotte
Issam Hadj Laradji
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior b… (voir plus)enchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at https://github.com/ServiceNow/drbench.
Beyond Na\"ive Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs
Andrew Robert Williams
Vincent Zhihao Zheng
Étienne Marcotte
Valentina Zantedeschi
Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often ava… (voir plus)ilable in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via na\"ive direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over na\"ive prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Abhay Puri
Gabriel Huang
Mihir Bansal
Chandra Kiran Reddy Evuru
Avinandan Bose
Maryam Fazel
Alexandre Lacoste
Jason Stanley
Krishnamurthy Dj Dvijotham
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (voir plus) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and
On Selecting Robust Approaches for Learning Predictive Biomarkers in Metabolomics Data Sets.
Thibaud Godon
Pier-Luc Plante
Metabolomics, the study of small molecules within biological systems, offers insights into metabolic processes and, consequently, holds grea… (voir plus)t promise for advancing health outcomes. Biomarker discovery in metabolomics represents a significant challenge, notably due to the high dimensionality of the data. Recent work has addressed this problem by analyzing the most important variables in machine learning models. Unfortunately, this approach relies on prior hypotheses about the structure of the data and may overlook simple patterns. To assess the true usefulness of machine learning methods, we evaluate them on a collection of 835 metabolomics data sets. This effort provides valuable insights for metabolomics researchers regarding where and when to use machine learning. It also establishes a benchmark for the evaluation of future methods. Nonetheless, the results emphasize the high diversity of data sets in metabolomics and the complexity of finding biologically relevant biomarkers. As a result, we propose a novel approach applicable across all data sets, offering guidance for future analyses. This method involves directly comparing univariate and multivariate models. We demonstrate through selected examples how this approach can guide data analysis across diverse data set structures, representative of the observed variability. Code and data are available for research purposes.
How to Train Your LLM Web Agent: A Statistical Diagnosis
Santhoshi Ravichandran
Hadi Nekoei
Thibault Le Sellier de Chezelles
Nicolas Gontier
Miguel Muñoz-Mármol
Stefania Raimondo
Alexandre Piché
Alexandre Lacoste
Massimo Caccia
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (voir plus)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
How to Train Your LLM Web Agent: A Statistical Diagnosis
Santhoshi Ravichandran
Hadi Nekoei
Thibault Le Sellier de Chezelles
Nicolas Gontier
Miguel Muñoz-Mármol
Stefania Raimondo
Alexandre Piché
Alexandre Lacoste
Massimo Caccia
Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bri… (voir plus)dging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3.3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3.1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.
Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning
Abhay Puri
Chandra Kiran Reddy Evuru
Joshua Kazdan
Avinandan Bose
Maryam Fazel
Sai Rajeswar
Jason Stanley
Krishnamurthy Dj Dvijotham
The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (voir plus)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Andrew Robert Williams
Étienne Marcotte
Valentina Zantedeschi
Jithendaraa Subramanian
Roland Riachi
Alexandre Lacoste
Generalization Bounds via Meta-Learned Model Representations: PAC-Bayes and Sample Compression Hypernetworks
Both PAC-Bayesian and Sample Compress learning frameworks have been shown instrumental for deriving tight (non-vacuous) generalization bound… (voir plus)s for neural networks. We leverage these results in a meta-learning scheme, relying on a hypernetwork that outputs the parameters of a downstream predictor from a dataset input. The originality of our approach lies in the investigated hypernetwork architectures that encode the dataset before decoding the parameters: (1) a PAC-Bayesian encoder that expresses a posterior distribution over a latent space, (2) a Sample Compress encoder that selects a small sample of the dataset input along with a message from a discrete set, and (3) a hybrid between both approaches motivated by a new Sample Compress theorem handling continuous messages. The latter theorem exploits the pivotal information transiting at the encoder-decoder junction in order to compute generalization guarantees for each downstream predictor obtained by our meta-learning scheme.
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Mihir Bansal
Chandra Kiran Reddy Evuru
Gabriel Huang
Abhay Puri
Avinandan Bose
Maryam Fazel
Jason Stanley
Alexandre Lacoste
Krishnamurthy Dj Dvijotham
Learning to Defer for Causal Discovery with Imperfect Experts
Oscar Clivio
Sara Magliacane
Valentina Zantedeschi
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (voir plus) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
Learning to Defer for Causal Discovery with Imperfect Experts
Oscar Clivio
Sara Magliacane
Valentina Zantedeschi
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (voir plus) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.