Portrait of David Scott Krueger

David Scott Krueger

Core Academic Member
Assistant professor, Université de Montréal, Department of Computer Science and Operations Research (DIRO)
Research Topics
Deep Learning
Representation Learning

Biography

David Krueger is an Assistant Professor in Robust, Reasoning and Responsible AI in the Department of Computer Science and Operations Research (DIRO) at University of Montreal, and a Core Academic Member at Mila - Quebec Artificial Intelligence Institute, UC Berkeley's Center for Human-Compatible AI (CHAI), and the Center for the Study of Existential Risk (CSER). His work focuses on reducing the risk of human extinction from artificial intelligence (AI x-risk) through technical research as well as education, outreach, governance and advocacy.

His research spans many areas of Deep Learning, AI Alignment, AI Safety and AI Ethics, including alignment failure modes, algorithmic manipulation, interpretability, robustness, and understanding how AI systems learn and generalize. He has been featured in media outlets including ITV's Good Morning Britain, Al Jazeera's Inside Story, France 24, New Scientist and the Associated Press.

David completed his graduate studies at the University of Montreal and Mila - Quebec Artificial Intelligence Institute, working with Yoshua Bengio, Roland Memisevic, and Aaron Courville.

Current Students

PhD - Université de Montréal
Principal supervisor :
Collaborating researcher

Publications

Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting"high… (see more)-stakes"interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
Shoaib Ahmed Siddiqui
Adrian Weller
Michael Curtis Mozer
Eleni Triantafillou
Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a s… (see more)mall set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically,
Mitigating Goal Misgeneralization via Minimax Regret
Karim Ahmed Abdel Sadek
Matthew Farrugia-Roberts
Usman Anwar
Hannah Erlebach
Christian Schroeder de Witt
Michael D Dennis
Robustness research in reinforcement learning often focuses on ensuring that the policy consistently exhibits capable, goal-driven behavior.… (see more) However, not every capable behavior is the intended behavior. *Goal misgeneralization* can occur when the policy generalizes capably with respect to a 'proxy goal' whose optimal behavior correlates with the intended goal on the training distribution, but not out of distribution. Though the intended goal would be ambiguous if they were perfectly correlated in training, we show progress can be made if the goals are only *nearly ambiguous*, with the training distribution containing a small proportion of *disambiguating* levels. We observe that the training signal from disambiguating levels could be amplified by regret-based prioritization. We formally show that approximately optimal policies on maximal-regret levels avoid the harmful effects of goal misgeneralization, which may exist without this prioritization. Empirically, we find that current regret-based Unsupervised Environment Design (UED) methods can mitigate the effects of goal misgeneralization, though do not always entirely eliminate it. Our theoretical and empirical results show that as UED methods improve they could further mitigate goal misgeneralization in practice.
Position: Humanity Faces Existential Risk from Gradual Disempowerment
Jan Kulveit
Raymond Douglas
Nora Ammann
Deger Turan
David Duvenaud
PoisonBench: Assessing Language Model Vulnerability to Poisoned Preference Data
Tingchen Fu
Mrinank Sharma
Philip Torr
Shay B. Cohen
Fazl Barez
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To addre… (see more)ss this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 22 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not always enhance resilience against poisoning attacks and the influence on model resilience varies among different model suites. (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Position: Probabilistic Modelling is Sufficient for Causal Inference
Bruno Mlodozeniec
Richard E. Turner
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
Lukas Fluri
Leon Lang
Alessandro Abate
Patrick Forré
Joar Max Viktor Skalse
In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to addre… (see more)ss this issue by *learning* the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an *error-regret mismatch*. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any *fixed* expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.
Understanding (Un)Reliability of Steering Vectors in Language Models
Joschka Braun
Carsten Eickhoff
Seyed Ali Bahrainian
Dmitrii Krasheninnikov
Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. … (see more)Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.
Pitfalls of Evidence-Based AI Policy
Stephen Casper
Dylan Hadfield-Menell
Nations across the world are working to govern AI. However, from a technical perspective, the best way to do this is not yet clear. Meanwhil… (see more)e, recent debates over AI regulation have led to calls for “evidence-based AI policy” which emphasize holding regulatory action to a high evidentiary standard. Evidence is of irreplaceable value to policymaking. However, holding regulatory action to too high an evidentiary standard can lead to systematic neglect of certain risks. In historical policy debates (e.g., over tobacco ca. 1965 and fossil fuels ca. 1990) “evidence-based policy” rhetoric is also a well-precedented strategy to downplay the urgency of action, delay regulation, and protect industry interests. Here, we argue that if the goal is evidence-based AI policy, the first regulatory objective must be to actively facilitate the process of identifying, studying, and deliberating about AI risks. We discuss a set of 16 regulatory goals to facilitate this and show that the EU, UK, USA, Brazil, Canada, and China all have substantial opportunities to adopt further evidence-seeking policies.
Influence Functions for Scalable Data Attribution in Diffusion Models
Bruno Mlodozeniec
Runa Eschenhagen
Juhan Bae
Alexander Immer
Richard E. Turner
Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data… (see more) attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by extending influence functions. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we use a K-FAC approximation based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We show that our recommended method outperforms previously proposed data attribution methods on common data attribution evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.
Input Space Mode Connectivity in Deep Neural Networks
Jakub Vrabel
Ori Shem-Ur
Yaron Oz
We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Initially studied in parameter space, … (see more)mode connectivity describes the existence of low-loss paths between solutions (loss minimizers) found via gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric phenomenon, present even in untrained models, and can be explained by percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and show its potential for adversarial detection and interpretability.
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Thomas Bush
Stephen Chung
Usman Anwar
Adrià Garriga-Alonso
We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a me… (see more)thodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by [Guez et al. (2019)](https://arxiv.org/abs/1901.03559), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL.