Alessandro Sordoni

Gistify: Codebase-Level Understanding via Runtime Execution

Hyunji Lee

Minseon Kim

Chinmay Singh

Matheus Pereira

Atharv Sonwane

Isadora White

Elias Stengel-Eskin

Mohit Bansal

Zhengxiang Shi

Eric Yuan

Lucas Caccia

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is ce… (see more)ntral. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil

Minseon Kim

Maxime Griot

Sheela Agarwal

Francois Beaulieu

Paul Vozila

Jean-Philippe Corbeil, Minseon Kim, Maxime Griot, Sheela Agarwal, Alessandro Sordoni, Francois Beaulieu, Paul Vozila. Proceedings of the 19t… (see more)h Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track). 2026.

2025-12-31

Conference of the European Chapter of the Association for Computational Linguistics (published)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

Reasoning LLMs suffer from quadratic compute growth as their context length increases, making reinforcement learning with verifiable rewards… (see more) (RLVR) and test-time scaling prohibitively expensive. Prior work has tried to lighten the computational burden by shortening reasoning traces through pruning, summarization, or multi-stage training, but these methods remain bound to quadratic costs. We introduce Delethink, a thinking algorithm that realizes the Markovian Thinking Paradigm. Instead of producing one long monolithic reasoning trace, Delethink thinks in a sequence of chunks, the Delethink trace. Each chunk continues reasoning by referring only to a fixed number of prior tokens, which functions as a Markovian state sufficient for progressing reasoning, while deleting the rest. This preserves continuity without carrying the quadratic baggage. As a result, compute scales linearly and peak memory remains constant. In experiments, we show that Delethink can be applied directly to off-the-shelf reasoning models ranging from

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Learning to Extract Context for Context-Aware LLM Inference

Minseon Kim

Lucas Caccia

Zhengyan Shi

Matheus Pereira

Xingdi Yuan

2025-12-11

ArXiv (preprint)

Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models

Gabriele Prato

Shagun Sodhani

A. Chandar

The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. … (see more)However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.

2025-11-30

arXiv (published)

Gistify! Codebase-Level Understanding via Runtime Execution

Hyunji Lee

Minseon Kim

Chinmay Singh

Matheus Pereira

Atharv Sonwane

Isadora White

Elias Stengel-Eskin

Mohit Bansal

Zhengyan Shi

Xingdi Yuan

Lucas Caccia

As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is ce… (see more)ntral. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.

2025-10-29

ArXiv (preprint)

BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills

Atharv Sonwane

Isadora White

Hyunji Lee

Matheus Pereira

Lucas Caccia

Minseon Kim

Zhengyan Shi

Chinmay Singh

Xingdi Yuan

2025-10-21

ArXiv (preprint)

Learning to Solve Complex Problems via Dataset Decomposition

WANRU ZHAO

Lucas Caccia

Zhengyan Shi

Minseon Kim

Weijia Xu

Xingdi Yuan

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Instilling Parallel Reasoning into Language Models

Matthew Macfarlane

Minseon Kim

Nebojsa Jojic

Weijia Xu

Lucas Caccia

Xingdi Yuan

WANRU ZHAO

Zhengyan Shi

Sequential chain-of-thought reasoning significantly improves the performance of Large language models (LLMs) on complex tasks. However, sequ… (see more)ential reasoning has structural limitations: Long chains are expensive due to attention's quadratic complexity, and multiple diverse strategies cannot be considered simultaneously. To address this we propose a method that instills parallel reasoning capabilities in LLMs by distilling parallel reasoning traces from a teacher model. This approach enables models to decompose problems, explore diverse strategies via concurrent reasoning traces, and aggregate trace outputs for the final answer. Evaluating on a variety of math and puzzle benchmarks such as MATH 500, AIME and Countdown, we show our approach can decompose parallelizable problems, and that the performance scales with the number of parallel traces. The resulting model can dynamically allocate reasoning strategies based on problem complexity, outperforming standard sampling methods.

2025-07-08

ICML.cc/2025/Workshop/AI4MATH (poster)

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil

Minseon Kim

Francois Beaulieu

Paul Vozila

As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most e… (see more)xisting risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model's outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and closed-source LLMs. To the best of our knowledge, this work establishes an initial foundation for safer deployment of LLMs in healthcare.

2025-07-08

ArXiv (preprint)

Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

Minseon Kim

Riyasat Ohib

Lucas Caccia

Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (see more)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)