Portrait of Laurent Charlin

Laurent Charlin

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, HEC Montréal, Department of Decision Sciences
Associate Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
AI for Science
Data Mining
Deep Learning
Generative Models
Graph Neural Networks
Information Retrieval
Natural Language Processing
Probabilistic Models
Recommender Systems
Reinforcement Learning
Representation Learning

Biography

Laurent Charlin is the Interim Scientific Director of Mila – Quebec Artificial Intelligence Institute, a Canada CIFAR AI Chair, as well as an associate professor at HEC Montréal, the business school affiliated with Université de Montréal.

Charlin’s research focuses on developing novel machine learning models to aid in decision-making. Recent work has focused on learning from data that changes over time, and on applications in fields such as recommender systems and optimization.

He has a number of highly cited publications on dialogue systems (chatbots). He co-developed the Toronto Paper Matching System (TPMS), which has been widely used by computer science conferences for matching reviewers to papers. He has also given MOOCs, introductory talks and media interviews to contribute to knowledge transfer and improve AI literacy.

Current Students

Master's Research - HEC Montréal
Master's Research - HEC Montréal
PhD - Université de Montréal
Co-supervisor :
Master's Research - HEC Montréal
Master's Research - McGill University
PhD - HEC Montréal
Principal supervisor :
PhD - Université Laval
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Concordia University
Principal supervisor :
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
Postdoctorate - HEC Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal

Publications

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization
Tianyue H. Zhang
Mateo Espinosa Zarlenga
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human unde… (see more)rstandable concepts. However, CBMs typically rely on datasets with assumedly accurate concept labels—an assumption often violated in practice which we show can significantly degrade performance. To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis on some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise
AInstein: Can AI Rediscover Scientific Concepts from First Principles?
Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (see more)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Evaluating and Improving LitLLMs with Deep Research
Issam Hadj Laradji
Krishnamurthy Dj Dvijotham
Jason Stanley
Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (see more)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: (1) Retrieving related works given a query abstract and (2) Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Particularly, our ``Deep Research" retrieval variant improves coverage by over 5x compared to standard keyword search, addressing a key bottleneck in the pipeline. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26\% compared to existing simpler LLM-based generation methods.
How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (see more)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (see more)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (see more)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
How to Train Your LLM Web Agent: A Statistical Diagnosis
Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bri… (see more)dging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3.3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3.1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.
Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization
Tianyue H. Zhang
Mateo Espinosa Zarlenga
LitLLMs, LLMs for Literature Review: Are we there yet?
Issam Hadj Laradji
Krishnamurthy Dj Dvijotham
Jason Stanley
PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS
Tianyue H. Zhang
Mateo Espinosa Zarlenga
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (see more)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.
PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS
Tianyue H. Zhang
Mateo Espinosa Zarlenga
Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (see more)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.
Integrating Present and Past in Unsupervised Continual Learning
Richard Zemel
Mengye Ren
We formulate a unifying framework for *unsupervised continual learning (UCL)*, which disentangles learning objectives that are specific to t… (see more)he present and the past data, encompassing *stability*, *plasticity*, and *cross-task consolidation*. The framework reveals that many existing UCL approaches overlook cross-task consolidation and try to balance plasticity and stability in a shared embedding space. This results in worse performance due to a lack of within-task data diversity and reduced effectiveness in learning the current task. Our method, *Osiris*, which explicitly optimizes all three objectives on separate embedding spaces, achieves state-of-the-art performance on all benchmarks, including two novel ones proposed in this paper featuring semantically structured task sequences. Finally, we show some preliminary evidence that continual models can benefit from such more realistic learning scenarios.