David Scott Krueger

2023-11-02

NeurIPS.cc/2023/Workshop/GCRL (published)

How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Tim Rocktäschel

Edward Grefenstette

Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has be… (see more)en little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2023-11-02

NeurIPS.cc/2023/Workshop/UniReps (poster)

What Mechanisms Does Knowledge Distillation Distill?

Cindy Wu

Ekdeep Singh Lubana

Bruno Mlodozeniec

Robert Kirk

2023-11-02

NeurIPS.cc/2023/Workshop/UniReps (poster)

Meta- (out-of-context) learning in neural networks

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of… (see more) a phenomenon we call **meta-out-of-context learning (meta-OCL)** via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily “internalize” the semantic content of text that is, *or appears to be*, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code is available at https://github.com/krasheninnikov/internalization.

2023-11-01

NeurIPS.cc/2023/Workshop/R0-FoMo (poster)

Characterizing Manipulation from AI Systems

Micah Carroll

Alan Chan

Henry Ashton

Manipulation is a concern in many domains, such as social media, advertising, and chatbots. As AI systems mediate more of our digital intera… (see more)ctions, it is important to understand the degree to which AI systems might manipulate humans without the intent of the system designers. Our work clarifies challenges in defining and measuring this kind of manipulation from AI systems. Firstly, we build upon prior literature on manipulation and characterize the space of possible notions of manipulation, which we find to depend upon the concepts of incentives, intent, covertness, and harm. We review proposals on how to operationalize each concept and we outline challenges in including each concept in a definition of manipulation. Second, we discuss the connections between manipulation and related concepts, such as deception and coercion. We then analyze how our characterization of manipulation applies to recommender systems and language models, and give a brief overview of the regulation of manipulation in other domains. While some progress has been made in defining and measuring manipulation from AI systems, many gaps remain. In the absence of a consensus definition and reliable tools for measurement, we cannot rule out the possibility that AI systems learn to manipulate humans without the intent of the system designers. Manipulation could pose a significant threat to human autonomy and precautionary actions to mitigate it are likely warranted.

2023-10-30

Equity and Access in Algorithms, Mechanisms, and Optimization (published)

arxiv.org

Detecting Backdoors with Meta-Models

Lauro Langosco

Neel Alex

William Baker

David John Quarel

Herbie Bradley

It is widely known that it is possible to implant backdoors into neural networks, by which an attacker can choose an input to produce a part… (see more)icular undesirable output (e.g.\ misclassify an image). We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights. To this end we present a meta-model architecture and train it on a dataset of approx.\ 4000 clean and backdoored CNNs trained on CIFAR-10. Our approach is simple and scalable, and is able to detect the presence of a backdoor with

2023-10-28

NeurIPS.cc/2023/Workshop/BUGS (poster)

Noisy ZSC: Breaking The Common Knowledge Assumption In Zero-Shot Coordination Games

Usman Anwar

Jia Wan

Jakob Nicolaus Foerster

Zero-shot coordination (ZSC) is a popular setting for studying the ability of AI agents to coordinate with novel partners. Prior formulation… (see more)s of ZSC make the assumption that the problem setting is common knowledge i.e. each agent has the knowledge of the underlying Dec-POMDP, every agent knows the others have this knowledge, and so on ad infinitum. However, in most real-world situations, different agents are likely to have different models of the (real world) environment, thus breaking this assumption. To address this limitation, we formulate the _noisy zero-shot coordination_ (NZSC) problem, where agents observe different noisy versions of the ground truth Dec-POMDP generated by passing the true Dec-POMDP through a noise model. Only the distribution of the ground truth Dec-POMDPs and the noise model are common knowledge. We show that any noisy ZSC problem can be reformulated as a ZSC problem by designing a meta-Dec-POMDP with an augmented state space consisting of both the ground truth Dec-POMDP and its corresponding state. In our experiments, we analyze various aspects of NZSC and show that achieving good performance in NZSC requires agents to make use of both the noisy observations of ground truth Dec-POMDP, knowledge of each other's noise models and their interactions with the ground truth Dec-POMDP. Through experimental results, we further establish that ignoring the noise in problem specification can result in sub-par ZSC coordination performance, especially in iterated scenarios. On the whole, our work highlights that NZSC adds an orthogonal challenge to traditional ZSC in tackling the uncertainty about the true problem.

2023-10-28

NeurIPS.cc/2023/Workshop/ALOE (poster)

Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models

Alan Chan

Benjamin Bucknall

Herbie Bradley

2023-10-23

NeurIPS.cc/2023/Workshop/SoLaR (spotlight)

Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models

Alan Chan

Benjamin Bucknall

Herbie Bradley

2023-10-23

NeurIPS.cc/2023/Workshop/SoLaR (spotlight)

Meta- (out-of-context) learning in neural networks

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of… (see more) a phenomenon we call meta-out-of-context learning (meta-OCL) via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily"internalize"the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code can be found at https://github.com/krasheninnikov/internalization.

2023-10-23

ArXiv (preprint)

arxiv.org

Thinker: Learning to Plan and Act

Stephen Chung

Ivan Anokhin

We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a le… (see more)arned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.

Mechanistic Mode Connectivity

Ekdeep Singh Lubana

Eric J Bigelow

Robert P. Dick

Hidenori Tanaka

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)