Publications

Adaptive Computation Pruning for the Forgetting Transformer

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on… (see more)-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs *provably safe* pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70\% across different model sizes and context lengths, resulting in a roughly 50\% to 70\% reduction in attention runtime (or a 2--3

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

openreview.net

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lu

Amirhossein Kazemnejad

Karolina Stanczak

Peter Shaw

Christopher J. Pal

Siva Reddy

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (see more)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

2025-07-06

Conference on Language Modeling (accepted)

doi.org

openreview.net

BIGCHARTS-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry

Abhay Puri

Masoud Hashemi

Juan A. Rodriguez

Megh Thakkar

Khyati Mahajan

Vikas Yadav

Sathwik Tejaswi Madhusudhan

Christopher Pal

Enamul Hoque

Sai Rajeswar

Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although … (see more)current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

Joao Monteiro

Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either rele… (see more)vant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose \textbf{BiXSE}, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

Boosting LLM Reasoning via Spontaneous Self-Correction

Xutong Zhao

Tengyu Xu

Xuewei Wang

Zhengxing Chen

Di Jin

Liang Tan

Yen-Ting Lin

Zishun Yu

Zhuokai Zhao

Yun He

Si-Yuan Wang

Han Fang

Sinong Wang

A. Chandar

Chen Zhu

MetaAI

Mila - Québec

AI Institute

Polytechnique Montréal

While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one.… (see more) One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

Clinical test cases for model-based dose calculation algorithm commissioning, QA and benchmarking, for 192Ir HDR brachytherapy of gynecologic cancers

Vasiliki Peppa

Maude Robitaille

F. Akbari

S. Enger

Rowan M. Thomson

F. Mourtada

G. P. Fonseca

KA Gifford

JL Horton

TA Wareing

Purpose: To develop clinically relevant test cases for commissioning Model-Based Dose Calculation Algorithms (MBDCAs) for 192Ir High Dose Ra… (see more)te (HDR) gynecologic brachytherapy following the workflow proposed by the TG-186 report and the WGDCAB report 372. Acquisition and Validation Methods: Two cervical cancer intracavitary HDR brachytherapy patient models were created, using either uniformly structured regions or realistic segmentation. The computed tomography (CT) images of the models were converted to DICOM CT images via MATLAB and imported into two Treatment Planning Systems (TPSs) with MBDCA capability. The clinical segmentation was expanded to include additional organs at risk. The actual clinical treatment plan was generally maintained, with the source replaced by a generic 192Ir HDR source. Dose to medium in medium calculations were performed using the MBDCA option of each TPS, and three different Monte Carlo (MC) simulation codes. MC results agreed within statistical uncertainty, while comparisons between MBDCA and MC dose distributions highlighted both strengths and limitations of the studied MBDCAs, suggesting potential approaches to overcome the challenges. Data Format and Usage Notes: The datasets for the developed cases are available online at http://doi.org/ 10.5281/zenodo.15720996. The DICOM files include the treatment plan for each case, TPS, and the corresponding reference MC dose data. The package also contains a TPS- and case-specific user guide for commissioning the MBDCAs, and files needed to replicate the MC simulations. Potential Applications: The provided datasets and proposed methodology offer a commissioning framework for TPSs using MBDCAs, and serve as a benchmark for brachytherapy researchers using MC methods. They also facilitate intercomparisons of MBDCA performance and provide a quality assurance resource for evaluating future TPS software updates.

2025-07-06

ArXiv (preprint)

arxiv.org

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Léo Boisvert

Mihir Bansal

Chandra Kiran Reddy Evuru

Gabriel Huang

Abhay Puri

Avinandan Bose

Maryam Fazel

Quentin Cappart

Jason Stanley

Alexandre Lacoste

Alexandre Drouin

Krishnamurthy Dj Dvijotham

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (see more) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

A Dynamic Security Pattern Selection Framework Using Deep Reinforcement Learning

Saeid Jamshidi

Amin Nikanjam

Kawser Wazed Nafi

Foutse Khomh

The rapid expansion of the Internet of Things (IoT) has brought transformative benefits across various domains and introduced significant se… (see more)curity challenges, especially in resource-constrained edge gateways. This paper proposes an innovative Intrusion Detection System (IDS) powered by Deep Reinforcement Learning (DRL) to dynamically detect and mitigate network threats by selecting IoT security patterns. Leveraging adaptive IoT security patterns, the system addresses diverse attack scenarios (e.g., Distributed Denial of Service (DDoS), DoS GoldenEye, DoS Hulk, and Port Scanning) with significant efficiency. The system achieves an average detection accuracy of 97% and demonstrates reduced response times and efficient resource utilization, making it well-suited for edge gateways. The experimental evaluations validate the proposed model's ability to enhance security while optimizing CPU and memory usage, reducing energy consumption, and lowering carbon emissions. Furthermore, its adaptability to evolving cyber threats and alignment with green computing principles highlight its potential to support secure and sustainable IoT networks.

2025-07-06

2025 IEEE International Conference on Software Services Engineering (SSE) (published)

doi.org

Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts

Minseon Kim

Riyasat Ohib

Lucas Caccia

Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (see more)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

Ben Lipkin

Benjamin LeBrun

Jacob Hoover Vigly

João Loula

David R. MacIver

Lei Du

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Timothy J. O'Donnell

Alexander K. Lew

Tim Vieira

The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sam… (see more)pling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive---LM vocabularies often exceed 100,000 tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost---estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

openreview.net

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt

Ivan Vulić

Goran Glavaš

David Ifeoluwa Adelani

Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system, since these languag… (see more)es cannot pair automatic speech recognition (ASR) with language models to benefit from language technology. Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Better SLU can strengthen the robustness of massively multilingual ASR by levering language semantics to disambiguate utterances via context or exploiting semantic similarities across languages. However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?

Anthony GX-Chen

Dongyan Lin

Mandana Samiei

Doina Precup

Blake Aaron Richards

Rob Fergus

Kenneth Marino

Language model (LM) agents are increasingly used as autonomous decision-makers who need to actively gather information to guide their decisi… (see more)ons. A crucial cognitive skill for such agents is the efficient exploration and understanding of the causal structure of the world -- key to robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs possess this capability or exhibit systematic biases leading to erroneous conclusions. In this work, we examine LMs' ability to explore and infer causal relationships, using the well-established"Blicket Test"paradigm from developmental psychology. We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally (or sometimes even more) evidenced conjunctive ones. This"disjunctive bias"persists across model families, sizes, and prompting strategies, and performance further declines as task complexity increases. Interestingly, an analogous bias appears in human adults, suggesting that LMs may have inherited deep-seated reasoning heuristics from their training data. To this end, we quantify similarities between LMs and humans, finding that LMs exhibit adult-like inference profiles (but not children-like). Finally, we propose a test-time sampling method which explicitly samples and eliminates hypotheses about causal relationships from the LM. This scalable approach significantly reduces the disjunctive bias and moves LMs closer to the goal of scientific, causally rigorous reasoning.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Publications