Publications

seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Current self-supervised algorithms commonly rely on transformations such as data augmentation and masking to learn visual representations. T… (voir plus)his is achieved by enforcing invariance or equivariance with respect to these transformations after encoding two views of an image. This dominant two-view paradigm often limits the flexibility of learned representations for downstream adaptation by creating performance trade-offs between high-level invariance-demanding tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we proposes \emph{seq-JEPA}, a world modeling framework that introduces architectural inductive biases into joint-embedding predictive architectures to resolve this trade-off. Without relying on dual equivariance predictors or loss terms, seq-JEPA simultaneously learns two architecturally segregated representations: one equivariant to specified transformations and another invariant to them. To do so, our model processes short sequences of different views (observations) of inputs. Each encoded view is concatenated with an embedding of the relative transformation (action) that produces the next observation in the sequence. These view-action pairs are passed through a transformer encoder that outputs an aggregate representation. A predictor head then conditions this aggregate representation on the upcoming action to predict the representation of the next observation. Empirically, seq-JEPA demonstrates strong performance on both equivariant and invariant benchmarks without sacrificing one for the other. Furthermore, it excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning

Roger Creus Castanyer

Johan Obando-Ceron

Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure m… (voir plus)ode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

doi.org

openreview.net

State Entropy Regularization for Robust Reinforcement Learning

Yonatan Ashlag

Uri Koren

Mirco Mutti

Esther Derman

Pierre-Luc Bacon

Shie Mannor

State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its the… (voir plus)oretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.

2025-09-17

NeurIPS.cc/2025/Conference (présentation orale)

doi.org

openreview.net

System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Xiaoqiang Wang

Suyuchen Wang

Yun Zhu

Bang Liu

Chain-of-thought (CoT) reasoning enables large language models (LLMs) to move beyond fast System-1 responses and engage in deliberative Syst… (voir plus)em-2 reasoning. However, this comes at the cost of significant inefficiency due to verbose intermediate output. Recent latent-space reasoning methods improve efficiency by operating on hidden states without decoding into language, yet they treat all steps uniformly, failing to distinguish critical deductions from auxiliary steps and resulting in suboptimal use of computational resources. In this paper, we propose System-1.5 Reasoning, an adaptive reasoning framework that dynamically allocates computation across reasoning steps through shortcut paths in latent space. Specifically, System-1.5 Reasoning introduces two types of dynamic shortcuts. The model depth shortcut (DS) adaptively reasons along the vertical depth by early exiting non-critical tokens through lightweight adapter branches, while allowing critical tokens to continue through deeper Transformer layers. The step shortcut (SS) reuses hidden states across the decoding steps to skip trivial steps and reason horizontally in latent space. Training System-1.5 Reasoning involves a two-stage self-distillation process: first distilling natural language CoT into latent-space continuous thought, and then distilling full-path System-2 latent reasoning into adaptive shortcut paths (System-1.5 Reasoning). Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional CoT fine-tuning methods while accelerating inference by over 20x and reducing token generation by 92.31% on average.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Tapered Off-Policy REINFORCE Stable and efficient reinforcement learning for LLMs

Nicolas Le Roux

Bellemare Marc-Emmanuel

Jonathan Lebensoldt

Arnaud Bergeron

Joshua Greaves

Alex Fréchette

Carolyne Pelletier

Éric Thibodeau-Laufer

Sándor Toth

Sam Work

We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an a… (voir plus)symmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

Is the acquisition worth the cost? Surrogate losses for Consistent Two-stage Classifiers

Florence Regol

Joseph Cotnareanu

Theodore Glavas

Mark J. Coates

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

openreview.net

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Amirhossein Kazemnejad

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

Siva Reddy

Christopher Pal

Benno Krojer

Aishwarya Agrawal

While image generation techniques are now capable of producing high-quality images that respect prompts which span multiple sentences, the t… (voir plus)ask of text-guided image editing remains a challenge. Even edit requests that consist of only a few words often fail to be executed correctly. We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

THUNDER: Tile-level Histopathology image UNDERstanding benchmark

Pierre Marza

Leo Fillioux

Sofiène Boutaj

KUNAL MAHATHA

Christian Desrosiers

Pablo Piantanida

Jose Dolz

Stergios Christodoulidis

Maria Vakalopoulou

Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This … (voir plus)is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at https://github.com/MICS-Lab/thunder.

2025-09-17

NeurIPS.cc/2025/Datasets_and_Benchmarks_Track (spotlight)

openreview.net

Tight Lower Bounds and Improved Convergence in Performative Prediction

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (voir plus)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Tracing the Representation Geometry of Language Models from Pretraining to Post-training

Melody Zixuan Li

Kumar Krishna Agrawal

Arna Ghosh

Komal Kumar Teru

Adam Santoro

Guillaume Lajoie

Blake A. Richards

Standard training metrics like loss fail to explain the emergence of complex capabilities in large language models. We take a spectral appro… (voir plus)ach to investigate the geometry of learned representations across pretraining and post-training, measuring effective rank (RankMe) and eigenspectrum decay (

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Brian Bartoldson

James Diffenderfer

Tal Ben-Nun

Johan Obando-Ceron

Bhavya Kailkhura

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post… (voir plus)-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups (

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Transforming Generic Coder LLMs to Effective Binary Code Embedding Models for Similarity Detection

Litao Li

Leo Song

Steven Ding

Benjamin C. M. Fung

Philippe Charland

Cybersecurity and software research have crossed paths with modern deep learning research for a few years. The power of large language model… (voir plus)s (LLMs) in particular has intrigued us to apply them to understanding binary code. In this paper, we investigate some of the many ways LLMs can be applied to binary code similarity detection, as it is a significantly more difficult task compared to source code similarity detection due to the sparsity of information and less meaningful syntax. It also has great practical implications, such as vulnerability and malware detection. We find that pretrained LLMs are mostly capable of detecting similar binary code, even with a zero-shot setting. Our main contributions and findings are to provide several supervised fine-tuning methods that, when combined, significantly surpass zero-shot LLMs and state-of-the-art binary code similarity detection methods. Specifically, we up-train the model through data augmentation, translation-style causal learning, LLM2Vec, and cumulative GTE loss. With a complete ablation study, we show that our training method can transform a generic language model into a powerful binary similarity expert, and is also robust and general enough for cross-optimization, cross-architecture, and cross-obfuscation detection.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Publications

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Mots-clés populaires:

Publications