Portrait of David Scott Krueger

David Scott Krueger

Core Academic Member
Assistant professor, Université de Montréal, Department of Computer Science and Operations Research (DIRO)
Research Topics
Deep Learning
Representation Learning

Biography

David Krueger is an Assistant Professor in Robust, Reasoning and Responsible AI in the Department of Computer Science and Operations Research (DIRO) at University of Montreal, and a Core Academic Member at Mila - Quebec Artificial Intelligence Institute, UC Berkeley's Center for Human-Compatible AI (CHAI), and the Center for the Study of Existential Risk (CSER). His work focuses on reducing the risk of human extinction from artificial intelligence (AI x-risk) through technical research as well as education, outreach, governance and advocacy.

His research spans many areas of Deep Learning, AI Alignment, AI Safety and AI Ethics, including alignment failure modes, algorithmic manipulation, interpretability, robustness, and understanding how AI systems learn and generalize. He has been featured in media outlets including ITV's Good Morning Britain, Al Jazeera's Inside Story, France 24, New Scientist and the Associated Press.

David completed his graduate studies at the University of Montreal and Mila - Quebec Artificial Intelligence Institute, working with Yoshua Bengio, Roland Memisevic, and Aaron Courville.

Current Students

PhD - Université de Montréal
Principal supervisor :
Collaborating researcher

Publications

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Luke Marks
Alasdair Paren
Fazl Barez
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that a… (see more)re not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.
Learning to Forget using Hypernetworks
Jose Miguel Lara Rangel
Usman Anwar
Stefan Schoepf
Jack Foster
Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to … (see more)comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a trained model while maintaining performance on the remaining data. This paper introduces HyperForget, a novel machine unlearning framework that leverages hypernetworks– neural networks that generate parameters for other networks– to dynamically sample models that lack knowledge of targeted data while preserving essential capabilities. Leveraging diffusion models, we implement two Diffusion HyperForget Networks and used them to sample unlearned models in Proof-of-Concept experiments. The unlearned models obtained zero accuracy on the forget set, while preserving good accuracy on the retain sets, highlighting the potential of HyperForget for dynamic targeted data removal and a promising direction for developing adaptive machine unlearning algorithms.
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Tingchen Fu
Mrinank Sharma
Philip Torr
Shay B. Cohen
Fazl Barez
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To addre… (see more)ss this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
Tingchen Fu
Mrinank Sharma
Philip Torr
Shay B. Cohen
Fazl Barez
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To addre… (see more)ss this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Input Space Mode Connectivity in Deep Neural Networks
Jakub Vrabel
Ori Shem-Ur
Yaron Oz
We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Initially studied in parameter space, … (see more)mode connectivity describes the existence of low-loss paths between solutions (loss minimizers) found via gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric phenomenon, present even in untrained models, and can be explained by percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and show its potential for adversarial detection and interpretability.
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
Ekdeep Singh Lubana
Sparse autoencoders (SAEs) have been central to the effort of finding interpretable and disentangled directions of representation spaces in … (see more)neural networks, in both image and text domains. While the efficacy and pitfalls of this method in the vision domain are well-studied, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We define and train language models on a set of formal grammars, and train SAEs on the latent representations of these models under a wide variety of hyperparameter settings. We identify several interpretable latents in the SAEs, and formulate a scaling law defining the relationship between the reconstruction loss of SAEs and their hidden size. We show empirically that the presence of latents correlating to certain features of the input does not imply a causal function in the computation and that the performance of SAEs is highly sensitive to inductive biases.
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Madeline Brumley
Joe Kwon
Dmitrii Krasheninnikov
Usman Anwar
A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desir… (see more)ed behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up"and ``top-down"-- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
Fazl Barez
The Universality Hypothesis in large language models (LLMs) claims that different models converge towards similar concept representations in… (see more) their latent spaces. Providing evidence for this hypothesis would enable researchers to exploit universal properties, facilitating the generalization of mechanistic interpretability techniques across models. Previous works studied if LLMs learned the same features, which are internal representations that activate on specific concepts. Since comparing features across LLMs is challenging due to polysemanticity, in which LLM neurons often correspond to multiple unrelated features rather than to distinct concepts, sparse autoencoders (SAEs) have been employed to disentangle LLM neurons into SAE features corresponding to distinct concepts. In this paper, we introduce a new variation of the universality hypothesis called Analogous Feature Universality: we hypothesize that even if SAEs across different models learn different feature representations, the spaces spanned by SAE features are similar, such that one SAE space is similar to another SAE space under rotation-invariant transformations. Evidence for this hypothesis would imply that interpretability techniques related to latent spaces, such as steering vectors, may be transferred across models via certain transformations. To investigate this hypothesis, we first pair SAE features across different models via activation correlation, and then measure spatial relation similarities between paired features via representational similarity measures, which transform spaces into representations that reveal hidden relational similarities. Our experiments demonstrate high similarities for SAE feature spaces across various LLMs, providing evidence for feature space universality.
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
Fazl Barez
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
Fazl Barez
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
Michael Lan
Philip Torr
Austin Meek
Ashkan Khakzar
Fazl Barez
We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly… (see more) represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allows discoveries about latent representations to generalize across several models. However, comparing features across LLMs is challenging due to polysemanticity, in which individual neurons often correspond to multiple features rather than distinct ones. This makes it difficult to disentangle and match features across different models. To address this issue, we employ a method known as dictionary learning by using sparse autoencoders (SAEs) to transform LLM activations into more interpretable spaces spanned by neurons corresponding to individual features. After matching feature neurons across models via activation correlation, we apply representational space similarity metrics like Singular Value Canonical Correlation Analysis to analyze these SAE features across different LLMs. Our experiments reveal significant similarities in SAE feature spaces across various LLMs, providing new evidence for feature universality.
Steering Clear: A Systematic Study of Activation Steering in a Toy Setup
Dmitrii Krasheninnikov
Activation steering is a promising family of methods for controlling LLM outputs via targeted interventions on model activations. We introdu… (see more)ce a toy multi-label classification setup to systematically study activation steering methods, and experiment with several types of steering adapters — from steering vectors (adding a fixed vector to activations) to more expressive adapters involving projections. We evaluate the adapters across steering tasks of different complexities, for three notions of complexity: 1) how densely the features are packed in the representation space (roughly, number of features divided by the dimensionality of the activations), 2) number of attributes steered, and 3) number of values the steered attribute can take. We find that as task complexity is increased, steering vector methods perform worse, while the more expressive methods only take a performance hit when there is not enough data. On the other hand, steering vectors usually outperform more expressive methods in the low-data regime, regardless of task complexity. We conclude by discussing this work's limitations, which include our toy setup not modeling features represented in superposition or continuous features, and the lack of experiments with LLMs.