Portrait de David Scott Krueger

David Scott Krueger

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Sujets de recherche
Apprentissage de représentations
Apprentissage profond

Biographie

David Krueger est professeur adjoint en IA robuste, raisonnable et responsable au département d'informatique et de recherche opérationnelle (DIRO) et un membre académique principal à Mila - Institut québécois d'intelligence artificielle, au Center for Human-Compatible AI (CHAI) de l'université de Berkeley et au Center for the Study of Existential Risk (CSER). Ses travaux portent sur la réduction du risque d'extinction de l'humanité par l'intelligence artificielle (x-risque IA) par le biais de la recherche technique ainsi que de l'éducation, de la sensibilisation, de la gouvernance et de la défense des droits humains.

Ses recherches couvrent de nombreux domaines de l'apprentissage profond, de l'alignement de l'IA, de la sécurité de l'IA et de l'éthique de l'IA, notamment les modes de défaillance de l'alignement, la manipulation algorithmique, l'interprétabilité, la robustesse et la compréhension de la manière dont les systèmes d'IA apprennent et se généralisent. Il a été présenté dans les médias, notamment dans l'émission Good Morning Britain d'ITV, Inside Story d'Al Jazeera, France 24, New Scientist et l'Associated Press.

David a terminé ses études supérieures à l'Université de Montréal et à Mila - Institut québécois d'intelligence artificielle, où il a travaillé avec Yoshua Bengio, Roland Memisevic et Aaron Courville.

Publications

Fresh in memory: Training-order recency is linearly encoded in language model activations
Dmitrii Krasheninnikov
Richard E. Turner
We show that language models'activations linearly encode when information was learned during training. Our setup involves creating a model w… (voir plus)ith a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples corresponding to the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish"early"vs."late"entities, generalizing to entities unseen during the probes'own training. The model can also be fine-tuned to explicitly report an unseen entity's training stage (~80% accuracy). Interestingly, the training-order encoding does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim
Jin Myung Kwak
Lama Alssum
Bernard Ghanem
Philip Torr
Fazl Barez
Adel Bibi
Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even w… (voir plus)hen using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
Usman Anwar
Johannes Von Oswald
Louis Kirsch
Spencer Frei
In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (voir plus) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
Usman Anwar
Johannes Von Oswald
Louis Kirsch
Spencer Frei
In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (voir plus) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.
Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
Minseon Kim
Jin Myung Kwak
Lama Alssum
Bernard Ghanem
Philip Torr
Fazl Barez
Adel Bibi
Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even w… (voir plus)hen using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting "hig… (voir plus)h-stakes" interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting"high… (voir plus)-stakes"interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie
Urja Pawar
Phil Blandfort
William Bankes
Ekdeep Singh Lubana
Dmitrii Krasheninnikov
Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting"high… (voir plus)-stakes"interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.
Language models’ activations linearly encode training-order recency
Dmitrii Krasheninnikov
Richard E. Turner
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
Shoaib Ahmed Siddiqui
Adrian Weller
Michael Curtis Mozer
Eleni Triantafillou
Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a s… (voir plus)mall set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically,
Mitigating Goal Misgeneralization via Minimax Regret
Karim Ahmed Abdel Sadek
Matthew Farrugia-Roberts
Usman Anwar
Hannah Erlebach
Christian Schroeder de Witt
Michael D Dennis
Robustness research in reinforcement learning often focuses on ensuring that the policy consistently exhibits capable, goal-driven behavior.… (voir plus) However, not every capable behavior is the intended behavior. *Goal misgeneralization* can occur when the policy generalizes capably with respect to a 'proxy goal' whose optimal behavior correlates with the intended goal on the training distribution, but not out of distribution. Though the intended goal would be ambiguous if they were perfectly correlated in training, we show progress can be made if the goals are only *nearly ambiguous*, with the training distribution containing a small proportion of *disambiguating* levels. We observe that the training signal from disambiguating levels could be amplified by regret-based prioritization. We formally show that approximately optimal policies on maximal-regret levels avoid the harmful effects of goal misgeneralization, which may exist without this prioritization. Empirically, we find that current regret-based Unsupervised Environment Design (UED) methods can mitigate the effects of goal misgeneralization, though do not always entirely eliminate it. Our theoretical and empirical results show that as UED methods improve they could further mitigate goal misgeneralization in practice.
Position: Humanity Faces Existential Risk from Gradual Disempowerment
Jan Kulveit
Raymond Douglas
Nora Ammann
Deger Turan
David Duvenaud