Laurent Charlin

Alexandre Piché

Alexandre Lacoste

Massimo Caccia

Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bri… (see more)dging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3.3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3.1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.

2025-06-08

ICML.cc/2025/Workshop/WCUA (oral)

Addressing Concept Mislabeling in Concept Bottleneck Models Through Preference Optimization

Emiliano Penaloza

Tianyue H. Zhang

Mateo Espinosa Zarlenga

2025-05-01

ICML.cc/2025/Conference (poster)

LitLLMs, LLMs for Literature Review: Are we there yet?

Shubham Agarwal

Gaurav Sahu

Abhay Puri

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

2025-04-01

TMLR (accepted)

PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS

Emiliano Penaloza

Tianyue H. Zhang

Mateo Espinosa Zarlenga

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (see more)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.

2025-03-05

ICLR.cc/2025/Workshop/HAIC (spotlight)

PREFERENCE OPTIMIZATION FOR CONCEPT BOTTLENECK MODELS

Emiliano Penaloza

Tianyue H. Zhang

Mateo Espinosa Zarlenga

Concept Bottleneck Models (CBMs) propose to enhance the trustworthiness of AI systems by constraining their decisions on a set of human-unde… (see more)rstandable concepts. However, CBMs typically assume that datasets contain accurate concept labels—an assumption often violated in practice, which we show can significantly degrade performance (by 25% in some cases). To address this, we introduce the Concept Preference Optimization (CPO) objective, a new loss function based on Direct Preference Optimization, which effectively mitigates the negative impact of concept mislabeling on CBM performance. We provide an analysis of some key properties of the CPO objective showing it directly optimizes for the concept’s posterior distribution, and contrast it against Binary Cross Entropy (BCE) where we show CPO is inherently less sensitive to concept noise. We empirically confirm our analysis finding that CPO consistently outperforms BCE in three real-world datasets with and without added label noise.

2025-03-05

ICLR.cc/2025/Workshop/Bi-Align (oral)

Integrating Present and Past in Unsupervised Continual Learning

Yipeng Zhang

Richard Zemel

Mengye Ren

We formulate a unifying framework for *unsupervised continual learning (UCL)*, which disentangles learning objectives that are specific to t… (see more)he present and the past data, encompassing *stability*, *plasticity*, and *cross-task consolidation*. The framework reveals that many existing UCL approaches overlook cross-task consolidation and try to balance plasticity and stability in a shared embedding space. This results in worse performance due to a lack of within-task data diversity and reduced effectiveness in learning the current task. Our method, *Osiris*, which explicitly optimizes all three objectives on separate embedding spaces, achieves state-of-the-art performance on all benchmarks, including two novel ones proposed in this paper featuring semantically structured task sequences. Finally, we show some preliminary evidence that continual models can benefit from such more realistic learning scenarios.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (published)

proceedings.mlr.press

TEARS: Text Representations for Scrutable Recommendations

Emiliano Penaloza

Olivier Gouvert

Haolun Wu

Traditional recommender systems rely on high-dimensional (latent) embeddings for modeling user-item interactions, often resulting in opaque … (see more)representations that lack interpretability. Moreover, these systems offer limited control to users over their recommendations. Inspired by recent work, we introduce TExtuAl Representations for Scrutable recommendations (TEARS) to address these challenges. Instead of representing a user’s interests through latent embed- dings, TEARS encodes them in natural text, providing transparency and allowing users to edit them. To encode such preferences, we use modern LLMs to generate high-quality user summaries which we find uniquely capture user preferences. Using these summaries we take a hybrid approach where we use an optimal transport procedure to align the summaries’ representations with the repre- sentation of a standard VAE for collaborative filtering. We find this approach can surpass the performance of the three popular VAE models while providing user-controllable recommendations. We further analyze the controllability of TEARS through three simu- lated user tasks to evaluate the effectiveness of user edits on their summaries. Our code and all user-summaries can be seen in an anonymized repository.

2025-01-29

ACM.org/TheWebConf/2025/Conference (poster)

LitLLMs, LLMs for Literature Review: Are we there yet?

Shubham Agarwal

Gaurav Sahu

Abhay Puri

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

2024-12-15

ArXiv (preprint)

arxiv.org

LLMs for Literature Review: Are we there yet?

Shubham Agarwal

Gaurav Sahu

Abhay Puri

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (see more)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.

2024-12-15

ArXiv (preprint)

doi.org

arxiv.org

LLMs for Literature Review: Are we there yet?

Shubham Agarwal

Gaurav Sahu

Abhay Puri

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (see more)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.

2024-12-15

ArXiv (preprint)

doi.org

arxiv.org

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Oleksiy Ostapenko

Zhan Su

Edoardo Ponti

Nicolas Le Roux

Matheus Pereira

Lucas Caccia

Alessandro Sordoni

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

doi.org

Learning to Design Data-structures: A Case Study of Nearest Neighbor Search

Omar Salemohamed

Vatsal Sharan

Shivam Garg

Gregory Valiant

We propose a general framework for automating data-structure design and apply it to the problem of nearest neighbor search. Our model adapts… (see more) to the underlying data distribution and provides fine-grained control over query and space complexity, enabling the discovery of solutions tailored to problem-specific constraints. We are able to reverse-engineer learned algorithms in several settings. In 1D, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble K-d trees in some regimes, while in others, have elements of locality-sensitive hashing.

2024-06-26

ICML.cc/2024/Workshop/Differentiable_Almost_Everything (published)