David Scott Krueger

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim

Jin Myung Kwak

Lama Alssum

Bernard Ghanem

Philip Torr

Fazl Barez

Adel Bibi

Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even w… (voir plus)hen using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.

2025-08-17

ArXiv (prépublication)

Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens

Usman Anwar

Johannes Von Oswald

Louis Kirsch

Spencer Frei

In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (voir plus) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.

2025-08-05

TMLR (accepté)

Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens

Usman Anwar

Johannes Von Oswald

Louis Kirsch

Spencer Frei

In this work, we make two contributions towards understanding of in-context learning of linear models by transformers. First, we investigate… (voir plus) the adversarial robustness of in-context learning in transformers to hijacking attacks — a type of adversarial attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We show that both linear transformers and transformers with GPT-2 architectures are vulnerable to such hijacking attacks. However, adversarial robustness to such attacks can be significantly improved through adversarial training --- done either at the pretraining or finetuning stage --- and can generalize to stronger attack models. Our second main contribution is a comparative analysis of adversarial vulnerabilities across transformer models and other algorithms for learning linear models. This reveals two novel findings. First, adversarial attacks transfer poorly between larger transformer models trained from different seeds despite achieving similar in-distribution performance. This suggests that transformers of the same architecture trained according to the same recipe may implement different in-context learning algorithms for the same task. Second, we observe that attacks do not transfer well between classical learning algorithms for linear models (single-step gradient descent and ordinary least squares) and transformers. This suggests that there could be qualitative differences between the in-context learning algorithms that transformers implement and these traditional algorithms.

2025-08-05

TMLR (accepté)

Rethinking Safety in LLM Fine-tuning: An Optimization Perspective

Minseon Kim

Jin Myung Kwak

Lama Alssum

Bernard Ghanem

Philip Torr

Fazl Barez

Adel Bibi

Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even w… (voir plus)hen using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16\% to approximately 5\%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model's safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.

2025-07-07

colmweb.org/COLM/2025/Conference (accepté)

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie

Urja Pawar

Phil Blandfort

William Bankes

Ekdeep Singh Lubana

Dmitrii Krasheninnikov

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting "hig… (voir plus)h-stakes" interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

2025-06-12

ArXiv (prépublication)

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie

Urja Pawar

Phil Blandfort

William Bankes

Ekdeep Singh Lubana

Dmitrii Krasheninnikov

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting"high… (voir plus)-stakes"interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

2025-06-12

ArXiv (prépublication)

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie

Urja Pawar

Phil Blandfort

William Bankes

Ekdeep Singh Lubana

Dmitrii Krasheninnikov

Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting"high… (voir plus)-stakes"interactions -- where the text indicates that the interaction might lead to significant harm -- as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and codebase to encourage further study.

2025-06-12

ArXiv (prépublication)

Language models’ activations linearly encode training-order recency

Dmitrii Krasheninnikov

Richard E. Turner

2025-06-10

ICML.cc/2025/Workshop/MemFM (publié)

From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization

Shoaib Ahmed Siddiqui

Adrian Weller

Gintare Karolina Dziugaite

Michael Curtis Mozer

Eleni Triantafillou

Recent unlearning methods for LLMs are vulnerable to relearning attacks: knowledge believed-to-be-unlearned re-emerges by fine-tuning on a s… (voir plus)mall set of (even seemingly-unrelated) examples. We study this phenomenon in a controlled setting for example-level unlearning in vision classifiers. We make the surprising discovery that forget-set accuracy can recover from around 50% post-unlearning to nearly 100% with fine-tuning on just the retain set -- i.e., zero examples of the forget set. We observe this effect across a wide variety of unlearning methods, whereas for a model retrained from scratch excluding the forget set (gold standard), the accuracy remains at 50%. We observe that resistance to relearning attacks can be predicted by weight-space properties, specifically,

2025-05-28

ArXiv (prépublication)

Mitigating Goal Misgeneralization via Minimax Regret

Karim Ahmed Abdel Sadek

Matthew Farrugia-Roberts

Usman Anwar

Hannah Erlebach

Christian Schroeder de Witt

Michael D Dennis

Robustness research in reinforcement learning often focuses on ensuring that the policy consistently exhibits capable, goal-driven behavior.… (voir plus) However, not every capable behavior is the intended behavior. *Goal misgeneralization* can occur when the policy generalizes capably with respect to a 'proxy goal' whose optimal behavior correlates with the intended goal on the training distribution, but not out of distribution. Though the intended goal would be ambiguous if they were perfectly correlated in training, we show progress can be made if the goals are only *nearly ambiguous*, with the training distribution containing a small proportion of *disambiguating* levels. We observe that the training signal from disambiguating levels could be amplified by regret-based prioritization. We formally show that approximately optimal policies on maximal-regret levels avoid the harmful effects of goal misgeneralization, which may exist without this prioritization. Empirically, we find that current regret-based Unsupervised Environment Design (UED) methods can mitigate the effects of goal misgeneralization, though do not always entirely eliminate it. Our theoretical and empirical results show that as UED methods improve they could further mitigate goal misgeneralization in practice.

2025-05-09

rl-conference.cc/RLC/2025/Conference (publié)

Position: Humanity Faces Existential Risk from Gradual Disempowerment

Jan Kulveit

Raymond Douglas

Nora Ammann

Deger Turan

David Duvenaud

2025-05-05

ICML.cc/2025/Position_Paper_Track (poster)

proceedings.mlr.press