Publications

Two-point deterministic equivalence for SGD in random feature models
Alexander Atanasov
Blake Bordelon
Jacob A Zavatone-Veth
Cengiz Pehlevan
Ultrasound and MRI-based evaluation of relationships between morphological and mechanical properties of the lower lumbar multifidus muscle in chronic low back pain.
Neda Naghdi
Sara Masi
Cléo Bertrand
Brent Rosenstein
Hassan Rivaz
Mathieu Roy
Maryse Fortin
How to Train Your LLM Web Agent: A Statistical Diagnosis
Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bri… (voir plus)dging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3.3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3.1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.
How to Train Your LLM Web Agent: A Statistical Diagnosis
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (voir plus)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
Multi-Priority Scheduling for Traffic Management in Future Scalable Payloads.
Zineb Garroussi
Olfa Ben Yahia
Brunilde Sansò
Jean-François Frigon
Stéphane Martel
Guillaume Mantelet
Gunes Karabulut Kurt
Multi-Priority Scheduling for Traffic Management in Future Scalable Payloads
Zineb Garroussi
Olfa Ben Yahia
Brunilde Sansò
Jean-François Frigon
Stéphane Martel
Guillaume Mantelet
Gunes Karabulut Kurt
Through multibeam, frequency reuse, and advanced antenna technology, regenerative non-geostationary orbit (NGSO) extremely high-throughput s… (voir plus)atellites (EHTS) are expected to play a key role in future communications, delivering data rates up to terabits per second. This paper investigates a novel architecture for future regenerative and scalable payloads to satisfy users’ demands for varying quality of service (QoS). This architecture is designed based on multiple modem banks and requires a new flow assignment strategy to efficiently route traffic within the satellite. We propose a multi-commodity path flow optimization problem to manage the load with varying QoS requirements across multiple modems within an NGSO high-throughput satellite (HTS) system and beyond. The simulation results demonstrate that the proposed model consistently maintains low delays and packet losses for the highest-priority traffic and outperforms the classical first-in, first-out (FIFO) approach.
Multi-Priority Scheduling for Traffic Management in Future Scalable Payloads
Zineb Garroussi
Olfa Ben Yahia
Brunilde Sansò
Jean-François Frigon
Stéphane Martel
Guillaume Mantelet
Gunes Karabulut Kurt
Through multibeam, frequency reuse, and advanced antenna technology, regenerative non-geostationary orbit (NGSO) extremely high-throughput s… (voir plus)atellites (EHTS) are expected to play a key role in future communications, delivering data rates up to terabits per second. This paper investigates a novel architecture for future regenerative and scalable payloads to satisfy users’ demands for varying quality of service (QoS). This architecture is designed based on multiple modem banks and requires a new flow assignment strategy to efficiently route traffic within the satellite. We propose a multi-commodity path flow optimization problem to manage the load with varying QoS requirements across multiple modems within an NGSO high-throughput satellite (HTS) system and beyond. The simulation results demonstrate that the proposed model consistently maintains low delays and packet losses for the highest-priority traffic and outperforms the classical first-in, first-out (FIFO) approach.
Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning
Chandra Kiran Reddy Evuru
Joshua Kazdan
Avinandan Bose
Maryam Fazel
Sai Rajeswar
Jason Stanley
Krishnamurthy Dj Dvijotham
The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (voir plus)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.
State Entropy Regularization for Robust Reinforcement Learning
Uri Koren
Yonatan Ashlag
Mirco Mutti
Shie Mannor
State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its the… (voir plus)oretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
Boosting LLM Reasoning via Spontaneous Self-Correction
Tengyu Xu
Xuewei Wang
Zhengxing Chen
Di Jin
Liang Tan
Zishun Yu
Zhuokai Zhao
Yun He
Si-Yuan Wang
Han Fang
Chen Zhu
MetaAI
Mila - Québec
AI Institute
Polytechnique Montréal
While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one.… (voir plus) One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.
Boosting LLM Reasoning via Spontaneous Self-Correction
Tengyu Xu
Xuewei Wang
Zhengxing Chen
Di Jin
Liang Tan
Zishun Yu
Zhuokai Zhao
Yun He
Sinong Wang
Han Fang
Chen Zhu
MetaAI
Mila - Québec
AI Institute
Polytechnique Montréal
While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one.… (voir plus) One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles -- solution proposer and verifier -- to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.
A Self-Supervised Foundation Model for Robust and Generalizable Representation Learning in STED Microscopy
Anthony Bilodeau
Julia Chabbert
Jean-Michel Bellavance
Koraly Lessard
Andréanne Deschênes
Renaud Bernatchez
Paul De Koninck