Thomas Jiralerspong

DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES

Aly M. Kassem

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines Dual-K BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across synthetic false facts, emergent misalignment, subliminal learning, and taboo word games (Gemma, LLaMA, Qwen; 1B–7B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, substantially outperforming baselines. Our results demonstrate that narrow fine-tuning induces distinctive, recoverable latent shifts and that crosscoder methods remain powerful tools for model diffing.

2026-03-01

Trustworthy AI @ International Conference on Learning Representations (published)

openreview.net

Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback

Thomas Jiralerspong

Flemming Kondrup

Yoshua Bengio

Chain-of-thought (CoT) monitoring provides oversight into model reasoning, but its effectiveness assumes models do not know they are being w… (see more)atched. We ask whether reasoning agents can autonomously infer that their supposedly private chain of thought is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions to do so. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching confident belief that their thinking is observed in up to 19\% of episodes. This awareness scales with model capability and, in rare cases, escalates to explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent–capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

Aly M. Kassem

Thomas Jiralerspong

Negar Rostamzadeh

Golnoosh Farnadi

Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.

2026-02-15

arXiv (preprint)

doi.org

arxiv.org

Towards a Formal Theory of Representational Compositionality

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

Luca Scimeca

Thomas Jiralerspong

Berton Earnshaw

Jason Hartford

Yoshua Bengio

2025-09-30

arXiv (published)

doi.org

arxiv.org

Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

Thomas Jiralerspong

Berton Earnshaw

Jason Hartford

Yoshua Bengio

Luca Scimeca

Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks… (see more). In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

2025-03-05

ICLR.cc/2025/Workshop/DeLTa (poster)

doi.org

openreview.net

Expressivity of Neural Networks with Random Weights and Learned Biases

Ezekiel Williams

Avery Hee-Woon Ryoo

Thomas Jiralerspong

Alexandre Payeur

Matthew G Perich

Luca Mazzucato

Guillaume Lajoie

Landmark universal function approximation results for neural networks with trained weights and biases provided the impetus for the ubiquitou… (see more)s use of neural networks as learning models in neuroscience and Artificial Intelligence (AI). Recent work has extended these results to networks in which a smaller subset of weights (e.g., output weights) are tuned, leaving other parameters random. However, it remains an open question whether universal approximation holds when only biases are learned, despite evidence from neuroscience and AI that biases significantly shape neural responses. The current paper answers this question. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can approximate any continuous function on compact sets. We further show an analogous result for the approximation of dynamical systems with recurrent neural networks. Our findings are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on recent fine-tuning methods for large language models, like bias and prefix-based approaches.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Geometric Signatures of Compositionality Across a Language Model's Lifetime

Jin Hwa Lee

Thomas Jiralerspong

Lei Yu

Yoshua Bengio

Emily Cheng

Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the… (see more) infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations' intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.

2024-12-31

ACL (1) (published)

doi.org

openreview.net

General Causal Imputation via Synthetic Interventions

Given two sets of elements (such as cell types and drug compounds), researchers typically only have access to a limited subset of their inte… (see more)ractions. The task of causal imputation involves using this subset to predict unobserved interactions. Squires et al. (2022) have proposed two estimators for this task based on the synthetic interventions (SI) estimator: SI-A (for actions) and SI-C (for contexts). We extend their work and introduce a novel causal imputation estimator, generalized synthetic interventions (GSI). We prove the identifiability of this estimator for data generated from a more complex latent factor model. On synthetic and real data we show empirically that it recovers or outperforms their estimators.

2024-10-29

NeurIPS.cc/2024/Workshop/CRL (poster)

doi.org

openreview.net

A Complexity-Based Theory of Compositionality

2024-10-17

ArXiv (preprint)

doi.org