David Scott Krueger

This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are o… (see more)rganized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose

2024-04-15

ArXiv (preprint)

Safety Cases: How to Justify the Safety of Advanced AI Systems

Joshua Clymer

Nick Gabrieli

Thomas Larsen

As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them… (see more). To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and -- if AI systems become much more powerful -- deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.

2024-03-15

ArXiv (preprint)

A Generative Model of Symmetry Transformations

James U. Allingham

Bruno Mlodozeniec

Shreyas Padhy

Javier Antor'an

Richard E. Turner

Eric T. Nalisnick

Jos'e Miguel Hern'andez-Lobato

Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though method… (see more)s incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we construct a generative model that explicitly aims to capture symmetries in the data, resulting in a model that learns which symmetries are present in an interpretable way. We provide a simple algorithm for efficiently learning our generative model and demonstrate its ability to capture symmetries under affine and color transformations. Combining our symmetry model with existing generative models results in higher marginal test-log-likelihoods and robustness to data sparsification.

2024-03-04

ArXiv (preprint)

Blockwise Self-Supervised Learning at Scale

Shoaib Ahmed Siddiqui

Yann LeCun

Stephane Deny

2024-01-30

TMLR (accepted)

Black-Box Access is Insufficient for Rigorous AI Audits

Stephen Casper

Carson Ezell

Charlotte Siegmann

Noam Kolt

Taylor Lynn Curtis

Benjamin Bucknall

Andreas A. Haupt

Kevin Wei

J'er'emy Scheurer

Marius Hobbhahn

Lee Sharkey

Satyapriya Krishna

Marvin von Hagen

Silas Alberti

Alan Chan

Qinyi Sun

Michael Gerovitch

David Bau

Max Tegmark

David Scott Krueger … (see 1 more)

Dylan Hadfield-Menell

External audits of AI systems are increasingly recognized as a key mechanism for AI governance. The effectiveness of an audit, however, depe… (see more)nds on the degree of system access granted to auditors. Recent audits of state-of-the-art AI systems have primarily relied on black-box access, in which auditors can only query the system and observe its outputs. However, white-box access to the system's inner workings (e.g., weights, activations, gradients) allows an auditor to perform stronger attacks, more thoroughly interpret models, and conduct fine-tuning. Meanwhile, outside-the-box access to its training and deployment information (e.g., methodology, code, documentation, hyperparameters, data, deployment details, findings from internal evaluations) allows for auditors to scrutinize the development process and design more targeted evaluations. In this paper, we examine the limitations of black-box audits and the advantages of white- and outside-the-box audits. We also discuss technical, physical, and legal safeguards for performing these audits with minimal security risks. Given that different forms of access can lead to very different levels of evaluation, we conclude that (1) transparency regarding the access and methods used by auditors is necessary to properly interpret audit results, and (2) white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.

2024-01-25

ArXiv (preprint)

Visibility into AI Agents

Alan Chan

Carson Ezell

Max Kaufmann

Kevin Wei

Lewis Hammond

Herbie Bradley

Emma Bluemke

Nitarshan Rajkumar

Noam Kolt

Lennart Heim

Markus Anderljung

2024-01-23

ArXiv (preprint)

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Edward Grefenstette

Tim Rocktäschel

Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning… (see more) systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such ``wrapped capabilities'' are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2024-01-16

ICLR.cc/2024/Conference (poster)

Reward Model Ensembles Help Mitigate Overoptimization

Thomas Coste

Usman Anwar

Robert Kirk

Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As pa… (see more)rt of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the “true” reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger “gold” reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.

2024-01-16

ICLR.cc/2024/Conference (poster)

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper

Xander Davies

Claudia Shi

Thomas Krendl Gilbert

Jérémy Scheurer

Javier Rando

Rachel Freedman

Tomasz Korbak

David Lindner

Pedro Freire

Tony Tong Wang

Samuel Marks

Charbel-Raphael Segerie

Micah Carroll

Andi Peng

Phillip Christoffersen

Mehul Damani

Stewart Slocum

Usman Anwar

Anand Siththaranjan … (see 12 more)

Max Nadeau

Eric J Michaud

Jacob Pfau

Dmitrii Krasheninnikov

Xin Chen

Lauro Langosco

Peter Hase

Erdem Biyik

Anca Dragan

Dorsa Sadigh

Dylan Hadfield-Menell

2023-12-30

TMLR (accepted)

(Out-of-context) Meta-learning in Language Models

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Brown et al. (2020) famously introduced the phenomenon of in-context meta-learning in large language models (LLMs). Our work establishes the… (see more) existence of a phenomenon we call out-of-context meta-learning via carefully designed synthetic experiments with large language models. We show that out-of-context meta-learning leads LLMs to more readily “internalize” the semantic content of text that is, or appears to be, broadly useful (such as true statements, or text from authoritative sources) and apply it in appropriate contexts. We further demonstrate internalization in a synthetic computer vision setting, and propose two hypotheses for the emergence of internalization: one relying on the way models store knowledge in their parameters, and another suggesting that the implicit gradient alignment bias of gradient-descent-based methods may be responsible. Finally, we reflect on what our results might imply about capabilities of future AI systems, and discuss potential risks.

2023-12-12

NeurIPS.cc/2023/Conference/Rejected_Submission (rejected)

Goal Misgeneralization as Implicit Goal Conditioning

Diego Dorn

Neel Alex

2023-11-02

NeurIPS.cc/2023/Workshop/GCRL (published)

How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Tim Rocktäschel

Edward Grefenstette

Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has be… (see more)en little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2023-11-02

NeurIPS.cc/2023/Workshop/UniReps (poster)