Sahar Omidi Shayegan

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda

Santhoshi Ravichandran

Emiliano Penaloza

Hadi Nekoei

Megh Thakkar

Thibault Le Sellier De Chezelles

Nicolas Gontier

Miguel Muñoz-Mármol

Sahar Omidi Shayegan

Stefania Raimondo

Xue Liu

Alexandre Drouin

Alexandre Piché

Alexandre Lacoste

Massimo Caccia

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (see more)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Uncovering Hidden Factions through Text-Network Representations: Unsupervised Public Opinion Mapping of Iran on Twitter in the 2022 Unrest

Sahar Omidi Shayegan

Jean-François Godbout

Reihaneh Rabbany

Ideological mapping on social media is typically framed as a supervised classification task that depends on stable party systems and abundan… (see more)t annotated data. These assumptions fail in contexts with weak political institutionalization, such as Iran. We recast ideology detection as a fully unsupervised mapping problem and introduce a text-network representation system, uncovering latent ideological factions on Persian Twitter during the 2022 Mahsa Amini protests. Using hundreds of millions of Persian tweets, we learn joint text–network embeddings by fine-tuning ParsBERT with a combined masked-language-modeling and contrastive objective and by passing the embeddings through a Graph Attention Network trained for link prediction on time-batched subgraphs. The pipeline integrates semantic and structural signals without observing labels. Density-based clustering reveals eight ideological blocs whose spatial relations mirror known political alliances. Alignment with 883 expert-labeled accounts yields 53% accuracy. This label-free framework scales to label-scarce contexts, offering new leverage for studying political debates online.

2025-07-25

colmweb.org/COLM/2025/Workshop/NLPOR (published)

openreview.net

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Maxime Gasse

Alexandre Lacoste

Massimo Caccia

Lawrence Keunho Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Quentin Cappart

Graham Neubig

Ruslan Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-31

Trans. Mach. Learn. Res. (published)

doi.org

openreview.net

An Evaluation of Language Models for Hyperpartisan Ideology Detection in Persian Twitter

Sahar Omidi Shayegan

Isar Nejadgholi

Kellin Pelrine

Hao Yu

Sacha Lévy

Zachary Yang

Jean-François Godbout

Reihaneh Rabbany

Large Language Models (LLMs) have shown significant promise in various tasks, including identifying the political beliefs of English-speakin… (see more)g social media users from their posts. However, assessing LLMs for this task in non-English languages remains unexplored. In this work, we ask to what extent LLMs can predict the political ideologies of users in Persian social media. To answer this question, we first acknowledge that political parties are not well-defined among Persian users, and therefore, we simplify the task to a much simpler task of hyperpartisan ideology detection. We create a new benchmark and show the potential and limitations of both open-source and commercial LLMs in classifying the hyper-partisan ideologies of users. We compare these models with smaller fine-tuned models, both on the Persian language (ParsBERT) and translated data (RoBERTa), showing that they considerably outperform generative LLMs in this task. We further demonstrate that the performance of the generative LLMs degrades when classifying users based on their tweets instead of their bios and even when tweets are added as additional information, whereas the smaller fine-tuned models are robust and achieve similar performance for all classes. This study is a first step toward political ideology detection in Persian Twitter, with implications for future research to understand the dynamics of ideologies in Persian social media.

2023-12-31

EURALI (published)

www.semanticscholar.org

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Sahar Omidi Shayegan

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Sahar Omidi Shayegan

Publications