Portrait of Thomas Jiralerspong

Thomas Jiralerspong

PhD - Université de Montréal
Supervisor
Co-supervisor
Research Topics
Deep Learning
Generative Models
Information Theory
Reasoning
Reinforcement Learning
Representation Learning

Publications

Hidden-State Similarity Predicts Re-Elicitation After Inoculation Prompting
Fine-tuning on narrow harmful tasks can cause emergent misalignment, where models generalize harmful behavior beyond the training distributi… (see more)on. Inoculation prompting can reduce this effect by explicitly eliciting the undesired behavior during training, but recent work shows that the behavior can reappear when evaluation prompts contain cues from the training context. We study what makes such prompts effective triggers. We find that textual similarity to the inoculation prompt is an incomplete predictor: prompts are more likely to re-elicit suppressed behavior when they induce activation states similar to those produced by the inoculation context. These findings advance our understanding of how inoculation prompting modulates conditional misalignment, and suggest that activation-space analysis can help identify when suppressed behaviors remain accessible under eval-time prompts.
When Does Interleaving Prevent Emergent Misalignment?
Large language models finetuned on narrow harmful tasks are prone to emergent misalignment (EM), where harmful behavior generalizes beyond t… (see more)he training distribution. Interleaving benign data during finetuning has been proposed as a mitigation, but recent work disagrees on whether it prevents EM. In this paper, we investigate this disagreement on Qwen-2.5 7B and 32B, and find that no single property of the interleaved data, taken in isolation, accounts for the gap. Instead, much of it traces to the evaluation itself, as the standard EM benchmark is sensitive to the length of the prompts it uses, and lengthening the evaluation prompts substantially shifts measured misalignment across model sizes. We then identify a region in the model's activations that predicts whether a given interleaved set will prevent EM, and show that reformulating benign data to fall within it substantially reduces EM on both 7B and 32B. This suggests that the standard EM benchmark, which relies on short prompts, may misrepresent the effectiveness of proposed mitigations.
CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
Joachim Schaeffer
Alexander Panfilov
Jonas Geiping
Roland S. Zimmermann
AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted mode… (see more)l. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines Dual-K BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across synthetic false facts, emergent misalignment, subliminal learning, and taboo word games (Gemma, LLaMA, Qwen; 1B–7B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, substantially outperforming baselines. Our results demonstrate that narrow fine-tuning induces distinctive, recoverable latent shifts and that crosscoder methods remain powerful tools for model diffing.
Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
Chain-of-thought (CoT) monitoring provides oversight into model reasoning, but its effectiveness assumes models do not know they are being w… (see more)atched. We ask whether reasoning agents can autonomously infer that their supposedly private chain of thought is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions to do so. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking feedback, with the most capable models reaching confident belief that their thinking is observed in up to 19\% of episodes. This awareness scales with model capability and, in rare cases, escalates to explicit intent to suppress reasoning about the side task. However, models that form this intent uniformly fail to execute it, openly reasoning about their concealed objectives in the very next episode. This intent–capability gap is reassuring for current deployment, but the autonomous emergence of both monitoring awareness and evasion intent suggests that CoT monitoring is not a permanently reliable safeguard.
Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shar… (see more)ed dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise
Berton Earnshaw
Jason Hartford
Towards a Formal Theory of Representational Compositionality
Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control
Berton Earnshaw
Jason Hartford
Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks… (see more). In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.
Expressivity of Neural Networks with Random Weights and Learned Biases
Avery Hee-Woon Ryoo
Matthew G Perich
Luca Mazzucato
Landmark universal function approximation results for neural networks with trained weights and biases provided the impetus for the ubiquitou… (see more)s use of neural networks as learning models in neuroscience and Artificial Intelligence (AI). Recent work has extended these results to networks in which a smaller subset of weights (e.g., output weights) are tuned, leaving other parameters random. However, it remains an open question whether universal approximation holds when only biases are learned, despite evidence from neuroscience and AI that biases significantly shape neural responses. The current paper answers this question. We provide theoretical and numerical evidence demonstrating that feedforward neural networks with fixed random weights can approximate any continuous function on compact sets. We further show an analogous result for the approximation of dynamical systems with recurrent neural networks. Our findings are relevant to neuroscience, where they demonstrate the potential for behaviourally relevant changes in dynamics without modifying synaptic weights, as well as for AI, where they shed light on recent fine-tuning methods for large language models, like bias and prefix-based approaches.
Geometric Signatures of Compositionality Across a Language Model's Lifetime
Jin Hwa Lee
Lei Yu
Emily Cheng
Compositionality, the notion that the meaning of an expression is constructed from the meaning of its parts and syntactic rules, permits the… (see more) infinite productivity of human language. For the first time, artificial language models (LMs) are able to match human performance in a number of compositional generalization tasks. However, much remains to be understood about the representational mechanisms underlying these abilities. We take a high-level geometric approach to this problem by relating the degree of compositionality in a dataset to the intrinsic dimensionality of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations' intrinsic dimensionality, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between linear and nonlinear dimensionality, showing that they respectively encode formal and semantic aspects of linguistic composition.
General Causal Imputation via Synthetic Interventions
Given two sets of elements (such as cell types and drug compounds), researchers typically only have access to a limited subset of their inte… (see more)ractions. The task of causal imputation involves using this subset to predict unobserved interactions. Squires et al. (2022) have proposed two estimators for this task based on the synthetic interventions (SI) estimator: SI-A (for actions) and SI-C (for contexts). We extend their work and introduce a novel causal imputation estimator, generalized synthetic interventions (GSI). We prove the identifiability of this estimator for data generated from a more complex latent factor model. On synthetic and real data we show empirically that it recovers or outperforms their estimators.