David Scott Krueger

Understanding (Un)Reliability of Steering Vectors in Language Models

Joschka Braun

Carsten Eickhoff

Seyed Ali Bahrainian

Dmitrii Krasheninnikov

Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. … (voir plus)Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.

2025-03-05

ICLR.cc/2025/Workshop/Bi-Align (poster)

Pitfalls of Evidence-Based AI Policy

Stephen Casper

Dylan Hadfield-Menell

Nations across the world are working to govern AI. However, from a technical perspective, the best way to do this is not yet clear. Meanwhil… (voir plus)e, recent debates over AI regulation have led to calls for “evidence-based AI policy” which emphasize holding regulatory action to a high evidentiary standard. Evidence is of irreplaceable value to policymaking. However, holding regulatory action to too high an evidentiary standard can lead to systematic neglect of certain risks. In historical policy debates (e.g., over tobacco ca. 1965 and fossil fuels ca. 1990) “evidence-based policy” rhetoric is also a well-precedented strategy to downplay the urgency of action, delay regulation, and protect industry interests. Here, we argue that if the goal is evidence-based AI policy, the first regulatory objective must be to actively facilitate the process of identifying, studying, and deliberating about AI risks. We discuss a set of 16 regulatory goals to facilitate this and show that the EU, UK, USA, Brazil, Canada, and China all have substantial opportunities to adopt further evidence-seeking policies.

2025-01-23

ICLR.cc/2025/BlogPosts (accepté)

Influence Functions for Scalable Data Attribution in Diffusion Models

Bruno Mlodozeniec

Runa Eschenhagen

Juhan Bae

Alexander Immer

Richard E. Turner

Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data… (voir plus) attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by extending influence functions. Influence function-based data attribution methods approximate how a model's output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we use a K-FAC approximation based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We show that our recommended method outperforms previously proposed data attribution methods on common data attribution evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Input Space Mode Connectivity in Deep Neural Networks

Jakub Vrabel

Ori Shem-Ur

Yaron Oz

We extend the concept of loss landscape mode connectivity to the input space of deep neural networks. Initially studied in parameter space, … (voir plus)mode connectivity describes the existence of low-loss paths between solutions (loss minimizers) found via gradient descent. We present theoretical and empirical evidence of its presence in the input space of deep networks, thereby highlighting the broader nature of the phenomenon. We observe that different input images with similar predictions are generally connected, and for trained models, the path tends to be simple, with only a small deviation from being a linear path. We conjecture that input space mode connectivity in high-dimensional spaces is a geometric phenomenon, present even in untrained models, and can be explained by percolation theory. We exploit mode connectivity to obtain new insights about adversarial examples and show its potential for adversarial detection and interpretability.

2025-01-22

ICLR.cc/2025/Conference (poster)

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush

Stephen Chung

Usman Anwar

Adrià Garriga-Alonso

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a me… (voir plus)thodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by [Guez et al. (2019)](https://arxiv.org/abs/1901.03559), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Protecting against simultaneous data poisoning attacks

Neel Alex

Shoaib Ahmed Siddiqui

Amartya Sanyal

Current backdoor defense methods are evaluated against a single attack at a time. This is unrealistic, as powerful machine learning systems … (voir plus)are trained on large datasets scraped from the internet, which may be attacked multiple times by one or more attackers. We demonstrate that multiple backdoors can be simultaneously installed in a single model through parallel data poisoning attacks without substantially degrading clean accuracy. Furthermore, we show that existing backdoor defense methods do not effectively defend against multiple simultaneous attacks. Finally, we leverage insights into the nature of backdoor attacks to develop a new defense, BaDLoss (**Ba**ckdoor **D**etection via **Loss** Dynamics), that is effective in the multi-attack setting. With minimal clean accuracy degradation, BaDLoss attains an average attack success rate in the multi-attack setting of 7.98% in CIFAR-10, 10.29% in GTSRB, and 19.17% in Imagenette, compared to the average of other defenses at 63.44%, 74.83%, and 41.74% respectively. BaDLoss scales to ImageNet-1k, reducing the average attack success rate from 88.57% to 15.61%.

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo

Luke Ong

Philip Torr

Mor Geva

Fazl Barez

2025-01-22

ICLR.cc/2025/Conference (poster)

Open Problems in Machine Unlearning for AI Safety

Fazl Barez

Tingchen Fu

Ameya Prabhu

Stephen Casper

Amartya Sanyal

Adel Bibi

Aidan O'Gara

Robert Kirk

Benjamin Bucknall

Tim Fist

Luke Ong

Philip H. S. Torr

Kwok-Yan Lam

Robert F. Trager

Sören Mindermann

Jose Hernandez-Orallo

Mor Geva

Yarin Gal

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research… (voir plus), and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.

2025-01-09

ArXiv (prépublication)

arxiv.org

Open Problems in Machine Unlearning for AI Safety

Fazl Barez

Tingchen Fu

Ameya Prabhu

Stephen Casper

Amartya Sanyal

Adel Bibi

Aidan O'Gara

Robert Kirk

Benjamin Bucknall

Timothy Fist

Luke Ong

Philip Torr

Kwok-Yan Lam

Robert Trager

Sören Mindermann

Jose Hernandez-Orallo

Mor Geva

Yarin Gal

As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research… (voir plus), and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.

2025-01-09

ArXiv (prépublication)

arxiv.org

Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks

Madeline Brumley

Joe Kwon

Dmitrii Krasheninnikov

Usman Anwar

A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desir… (voir plus)ed behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up"and ``top-down"-- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.

2024-11-11

ArXiv (prépublication)

arxiv.org

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Luke Marks

Alasdair Paren

Fazl Barez

Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that a… (voir plus)re not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.

2024-11-02

ArXiv (prépublication)