David Scott Krueger

Emily Dardaman

Simeon Campos

Evan Murphy

2024-01-01

SSRN Electronic Journal (published)

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar

Abulhair Saparov

Javier Rando

Daniel Paleka

Miles Turpin

Peter Hase

Ekdeep Singh Lubana

Erik Jenner

Stephen Casper

Oliver Sourbut

Benjamin L. Edelman

Zhaowei Zhang

Mario Günther

Anton Korinek

Jose Hernandez-Orallo

Lewis Hammond

Eric J Bigelow

Alexander Pan

Lauro Langosco

Tomasz Korbak … (see 22 more)

Heidi Chenyu Zhang

Ruiqi Zhong

Sean O hEigeartaigh

Gabriel Recchia

Giulio Corsi

Markus Anderljung

Lilian Edwards

Aleksandar Petrov

Christian Schroeder de Witt

Sumeet Ramesh Motwani

Yoshua Bengio

Danqi Chen

Philip Torr

Samuel Albanie

Tegan Maharaj

Jakob Nicolaus Foerster

Florian Tramèr

He He

Atoosa Kasirzadeh

Yejin Choi

2024-01-01

Trans. Mach. Learn. Res. (published)

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar

Abulhair Saparov

Javier Rando

Daniel Paleka

Miles Turpin

Peter Hase

Ekdeep Singh Lubana

Erik Jenner

Stephen Casper

Oliver Sourbut

Benjamin L. Edelman

Zhaowei Zhang

Mario Günther

Anton Korinek

Jose Hernandez-Orallo

Lewis Hammond

Eric J Bigelow

Alexander Pan

Lauro Langosco

Tomasz Korbak … (see 22 more)

Heidi Chenyu Zhang

Ruiqi Zhong

Sean O hEigeartaigh

Gabriel Recchia

Giulio Corsi

Markus Anderljung

Lilian Edwards

Aleksandar Petrov

Christian Schroeder de Witt

Sumeet Ramesh Motwani

Yoshua Bengio

Danqi Chen

Philip Torr

Samuel Albanie

Tegan Maharaj

Jakob Nicolaus Foerster

Florian Tramèr

He He

Atoosa Kasirzadeh

Yejin Choi

2024-01-01

Trans. Mach. Learn. Res. (published)

dblp.uni-trier.de

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper

Xander Davies

Claudia Shi

Thomas Krendl Gilbert

Jérémy Scheurer

Javier Rando

Rachel Freedman

Tomasz Korbak

David Lindner

Pedro Freire

Tony Tong Wang

Samuel Marks

Charbel-Raphael Segerie

Micah Carroll

Andi Peng

Phillip Christoffersen

Mehul Damani

Stewart Slocum

Usman Anwar

Anand Siththaranjan … (see 12 more)

Max Nadeau

Eric J Michaud

Jacob Pfau

Dmitrii Krasheninnikov

Xin Chen

Lauro Langosco

Peter Hase

Erdem Biyik

Anca Dragan

Dorsa Sadigh

Dylan Hadfield-Menell

2023-12-30

TMLR (accepted)

Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models

Benjamin Bucknall

Herbie Bradley

Public release of the weights of pretrained foundation models, otherwise known as downloadable access \citep{solaiman_gradient_2023}, enable… (see more)s fine-tuning without the prohibitive expense of pretraining. Our work argues that increasingly accessible fine-tuning of downloadable models may increase hazards. First, we highlight research to improve the accessibility of fine-tuning. We split our discussion into research that A) reduces the computational cost of fine-tuning and B) improves the ability to share that cost across more actors. Second, we argue that increasingly accessible fine-tuning methods may increase hazard through facilitating malicious use and making oversight of models with potentially dangerous capabilities more difficult. Third, we discuss potential mitigatory measures, as well as benefits of more accessible fine-tuning. Given substantial remaining uncertainty about hazards, we conclude by emphasizing the urgent need for the development of mitigations.

2023-12-22

ArXiv (preprint)

arxiv.org

(Out-of-context) Meta-learning in Language Models

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Brown et al. (2020) famously introduced the phenomenon of in-context meta-learning in large language models (LLMs). Our work establishes the… (see more) existence of a phenomenon we call out-of-context meta-learning via carefully designed synthetic experiments with large language models. We show that out-of-context meta-learning leads LLMs to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and apply it in appropriate contexts. We further demonstrate internalization in a synthetic computer vision setting, and propose two hypotheses for the emergence of internalization: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based methods may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks.

2023-12-12

NeurIPS.cc/2023/Conference/Rejected_Submission (rejected)

Goal Misgeneralization as Implicit Goal Conditioning

Diego Dorn

Neel Alex

2023-11-02

NeurIPS.cc/2023/Workshop/GCRL (published)

How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Tim Rocktäschel

Edward Grefenstette

Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has be… (see more)en little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2023-11-02

NeurIPS.cc/2023/Workshop/UniReps (poster)

What Mechanisms Does Knowledge Distillation Distill?

Cindy Wu

Ekdeep Singh Lubana

Bruno Mlodozeniec

Robert Kirk

2023-11-02

NeurIPS.cc/2023/Workshop/UniReps (poster)

Meta- (out-of-context) learning in neural networks

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Brown et al. (2020) famously introduced the phenomenon of in-context learning in large language models (LLMs). We establish the existence of… (see more) a phenomenon we call **meta-out-of-context learning (meta-OCL)** via carefully designed synthetic experiments with LLMs. Our results suggest that meta-OCL leads LLMs to more readily “internalize” the semantic content of text that is, *or appears to be*, broadly useful (such as true statements, or text from authoritative sources) and use it in appropriate circumstances. We further demonstrate meta-OCL in a synthetic computer vision setting, and propose two hypotheses for the emergence of meta-OCL: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based optimizers may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks. Our code is available at https://github.com/krasheninnikov/internalization.

2023-11-01

NeurIPS.cc/2023/Workshop/R0-FoMo (poster)

Characterizing Manipulation from AI Systems

Micah Carroll

Henry Ashton

Manipulation is a concern in many domains, such as social media, advertising, and chatbots. As AI systems mediate more of our digital intera… (see more)ctions, it is important to understand the degree to which AI systems might manipulate humans without the intent of the system designers. Our work clarifies challenges in defining and measuring this kind of manipulation from AI systems. Firstly, we build upon prior literature on manipulation and characterize the space of possible notions of manipulation, which we find to depend upon the concepts of incentives, intent, covertness, and harm. We review proposals on how to operationalize each concept and we outline challenges in including each concept in a definition of manipulation. Second, we discuss the connections between manipulation and related concepts, such as deception and coercion. We then analyze how our characterization of manipulation applies to recommender systems and language models, and give a brief overview of the regulation of manipulation in other domains. While some progress has been made in defining and measuring manipulation from AI systems, many gaps remain. In the absence of a consensus definition and reliable tools for measurement, we cannot rule out the possibility that AI systems learn to manipulate humans without the intent of the system designers. Manipulation could pose a significant threat to human autonomy and precautionary actions to mitigate it are likely warranted.

2023-10-30

Equity and Access in Algorithms, Mechanisms, and Optimization (published)

arxiv.org

Detecting Backdoors with Meta-Models

Lauro Langosco

Neel Alex

William Baker

David John Quarel

Herbie Bradley

It is widely known that it is possible to implant backdoors into neural networks, by which an attacker can choose an input to produce a part… (see more)icular undesirable output (e.g.\ misclassify an image). We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to detect backdoors directly from model weights. To this end we present a meta-model architecture and train it on a dataset of approx.\ 4000 clean and backdoored CNNs trained on CIFAR-10. Our approach is simple and scalable, and is able to detect the presence of a backdoor with

2023-10-28

NeurIPS.cc/2023/Workshop/BUGS (poster)