Dzmitry Bahdanau

How to Get Your LLM to Generate Challenging Problems for Evaluation

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

doi.org

openreview.net

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel

Siva Reddy

Dzmitry Bahdanau

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce **CHASE**, a framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a difficult problem in a bottom-up manner from simpler components in a verifiable way. We implement CHASE to create evaluation benchmarks across three diverse domains on which state-of-the-art LLMs demonstrate severe vulnerabilities.

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

openreview.net

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Alexandre Piché

Ehsan Kamalloo

Rafael Pardinas

Xiaoyin Chen

Dzmitry Bahdanau

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effecti… (voir plus)vely scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately

2025-09-23

ArXiv (prépublication)

arxiv.org

PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation

Alexandre Piché

Ehsan Kamalloo

Rafael Pardinas

Xiaoyin Chen

Dzmitry Bahdanau

Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effecti… (voir plus)vely scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately

2025-09-23

ArXiv (prépublication)

doi.org

arxiv.org

BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry

Abhay Puri

Masoud Hashemi

Juan A. Rodriguez

Megh Thakkar

Khyati Mahajan

Vikas Yadav

Sathwik Tejaswi Madhusudhan

Enamul Hoque

Sai Rajeswar

2025-07-07

colmweb.org/COLM/2025/Conference (accepté)

doi.org

openreview.net

NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild

Shikhar Murty

Dzmitry Bahdanau

Hao Zhu

Christopher D Manning

2025-03-07

ICLR.cc/2025/Workshop/SSI-FM (poster)

openreview.net

NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild

Shikhar Murty

Hao Zhu

Dzmitry Bahdanau

Christopher D Manning

We introduce NNetNav, a method for unsupervised interaction with websites that generates synthetic demonstrations for training browser agent… (voir plus)s. Given any website, NNetNav produces these demonstrations by retroactively labeling action sequences from an exploration policy. Most work on training browser agents has relied on expensive human supervision, and the limited prior work on such interaction-based techniques has failed to provide effective search through the exponentially large space of exploration. In contrast, NNetNav exploits the hierarchical structure of language instructions to make this search more tractable: Complex instructions are typically decomposable into simpler sub-tasks, allowing NNetNav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. \texttt{LLama-3.1-8b} finetuned on 10k NNetNav self-generated demonstrations obtains over 16\% success rate on WebArena, and 35\% on WebVoyager, an improvement of 15pts and 31pts respectively over zero-shot \texttt{LLama-3.1-8b}, outperforming zero-shot GPT-4 and reaching the state-of-the-art among unsupervised methods, for both benchmarks.

2025-03-07

ICLR.cc/2025/Workshop/SSI-FM (poster)

openreview.net

LLMs can learn self-restraint through iterative self-reflection

2025-01-01

Trans. Mach. Learn. Res. (publié)

doi.org

openreview.net

TapeAgents: a Holistic Framework for Agent Development and Optimization

Dzmitry Bahdanau

Nicolas Gontier

Gabriel Huang

Ehsan Kamalloo

Rafael Pardinas

Alex Pich'e

Torsten Scholak

Oleh Shliazhko

Jordan Prince Tremblay

Karam Ghanem

Soham Parikh

Mitul Tiwari

Quaizar Vohra

We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the … (voir plus)session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The environment then reacts to the agent's actions by likewise appending observation steps to the tape. By virtue of this tape-centred design, TapeAgents can provide AI practitioners with holistic end-to-end support. At the development stage, tapes facilitate session persistence, agent auditing, and step-by-step debugging. Post-deployment, one can reuse tapes for evaluation, fine-tuning, and prompt-tuning; crucially, one can adapt tapes from other agents or use revised historical tapes. In this report, we explain the TapeAgents design in detail. We demonstrate possible applications of TapeAgents with several concrete examples of building monolithic agents and multi-agent teams, of optimizing agent prompts and finetuning the agent's LLM. We present tooling prototypes and report a case study where we use TapeAgents to finetune a Llama-3.1-8B form-filling assistant to perform as well as GPT-4o while being orders of magnitude cheaper. Lastly, our comparative analysis shows that TapeAgents's advantages over prior frameworks stem from our novel design of the LLM agent as a resumable, modular state machine with a structured configuration, that generates granular, structured logs and that can transform these logs into training text -- a unique combination of features absent in previous work.

2024-12-11

ArXiv (prépublication)

arxiv.org

TapeAgents: a Holistic Framework for Agent Development and Optimization

Dzmitry Bahdanau

Nicolas Gontier

Gabriel Huang

Ehsan Kamalloo

Rafael Pardinas

Alex Pich'e

Torsten Scholak

Oleh Shliazhko

Jordan Prince Tremblay

Karam Ghanem

Soham Parikh

Mitul Tiwari

Quaizar Vohra

We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the … (voir plus)session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The environment then reacts to the agent's actions by likewise appending observation steps to the tape. By virtue of this tape-centred design, TapeAgents can provide AI practitioners with holistic end-to-end support. At the development stage, tapes facilitate session persistence, agent auditing, and step-by-step debugging. Post-deployment, one can reuse tapes for evaluation, fine-tuning, and prompt-tuning; crucially, one can adapt tapes from other agents or use revised historical tapes. In this report, we explain the TapeAgents design in detail. We demonstrate possible applications of TapeAgents with several concrete examples of building monolithic agents and multi-agent teams, of optimizing agent prompts and finetuning the agent's LLM. We present tooling prototypes and report a case study where we use TapeAgents to finetune a Llama-3.1-8B form-filling assistant to perform as well as GPT-4o while being orders of magnitude cheaper. Lastly, our comparative analysis shows that TapeAgents's advantages over prior frameworks stem from our novel design of the LLM agent as a resumable, modular state machine with a structured configuration, that generates granular, structured logs and that can transform these logs into training text -- a unique combination of features absent in previous work.

2024-12-11

ArXiv (prépublication)

doi.org

arxiv.org

TapeAgents: a Holistic Framework for Agent Development and Optimization

Dzmitry Bahdanau

Nicolas Gontier

Gabriel Huang

Ehsan Kamalloo

Rafael Pardinas

Alexandre Piché

Torsten Scholak

Oleh Shliazhko

Jordan Prince Tremblay

Karam Ghanem

Soham Parikh

Mitul Tiwari

Quaizar Vohra

We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the … (voir plus)session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The environment then reacts to the agent's actions by likewise appending observation steps to the tape. By virtue of this tape-centred design, TapeAgents can provide AI practitioners with holistic end-to-end support. At the development stage, tapes facilitate session persistence, agent auditing, and step-by-step debugging. Post-deployment, one can reuse tapes for evaluation, fine-tuning, and prompt-tuning; crucially, one can adapt tapes from other agents or use revised historical tapes. In this report, we explain the TapeAgents design in detail. We demonstrate possible applications of TapeAgents with several concrete examples of building monolithic agents and multi-agent teams, of optimizing agent prompts and finetuning the agent's LLM. We present tooling prototypes and report a case study where we use TapeAgents to finetune a Llama-3.1-8B form-filling assistant to perform as well as GPT-4o while being orders of magnitude cheaper. Lastly, our comparative analysis shows that TapeAgents's advantages over prior frameworks stem from our novel design of the LLM agent as a resumable, modular state machine with a structured configuration, that generates granular, structured logs and that can transform these logs into training text -- a unique combination of features absent in previous work.

2024-12-11

ArXiv (prépublication)

doi.org

arxiv.org

NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator

Shikhar Murty

Dzmitry Bahdanau

Christopher D. Manning

We introduce NNetscape Navigator (NNetnav), a method for training web agents entirely through synthetic demonstrations. These demonstrations… (voir plus) are collected by first interacting with a browser to generate trajectory rollouts, which are then retroactively labeled into instructions using a language model. Most work on training browser agents has relied on expensive human supervision, and the limited previous work on such interaction-first synthetic data techniques has failed to provide effective search through the exponential space of exploration. In contrast, NNetnav exploits the hierarchical structure of language instructions to make this search more tractable: complex instructions are typically decomposable into simpler subtasks, allowing NNetnav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. We use NNetnav demonstrations from a language model for supervised fine-tuning of a smaller language model policy, and find improvements of 6 points on WebArena and over 20 points on MiniWoB++, two popular environments for web-agents. Notably, on WebArena, we observe that language model policies can be further enhanced when fine-tuned with NNetnav demonstrations derived from the same language model. Finally, we collect and release a dataset of over 6k NNetnav demonstrations on WebArena, spanning a diverse and complex set of instructions.

2024-10-03

ArXiv (prépublication)

doi.org

arxiv.org

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Dzmitry Bahdanau

Biographie

Étudiants actuels

Publications

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Mots-clés populaires:

Dzmitry Bahdanau

Biographie

Étudiants actuels

Publications