David Scott Krueger

arxiv.org

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo

Luke Ong

Philip Torr

Mor Geva

Fazl Barez

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens … (see more)in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70\% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.

2024-10-09

ArXiv (preprint)

arxiv.org

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs

Itamar Pres

Laura Ruis

Ekdeep Singh Lubana

Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipeli… (see more)nes for these methods have primarily relied on subjective demonstrations, instead of quantitative, objective metrics. We aim to take a step towards addressing this issue by advocating for four properties missing from current evaluations: (i) contexts sufficiently similar to downstream tasks should be used for assessing intervention quality; (ii) model likelihoods should be accounted for; (iii) evaluations should allow for standardized comparisons across different target behaviors; and (iv) baseline comparisons should be offered. We introduce an evaluation pipeline grounded in these criteria, offering both a quantitative and visual analysis of how effectively a given method works. We use this pipeline to evaluate two representation engineering methods on how effectively they can steer behaviors such as truthfulness and corrigibility, finding that some interventions are less effective than previously reported.

2024-10-09

NeurIPS.cc/2024/Workshop/MINT (accepted)

A Generative Model of Symmetry Transformations

James Urquhart Allingham

Bruno Mlodozeniec

Shreyas Padhy

Javier Antoran

Richard E. Turner

Eric Nalisnick

José Miguel Hernández-Lobato

Correctly capturing the symmetry transformations of data can lead to efficient models with strong generalization capabilities, though method… (see more)s incorporating symmetries often require prior knowledge. While recent advancements have been made in learning those symmetries directly from the dataset, most of this work has focused on the discriminative setting. In this paper, we take inspiration from group theoretic ideas to construct a generative model that explicitly aims to capture the data's approximate symmetries. This results in a model that, given a prespecified broad set of possible symmetries, learns to what extent, if at all, those symmetries are actually present. Our model can be seen as a generative process for data augmentation. We provide a simple algorithm for learning our generative model and empirically demonstrate its ability to capture symmetries under affine and color transformations, in an interpretable way. Combining our symmetry model with standard generative models results in higher marginal test-log-likelihoods and improved data efficiency.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Interpreting Learned Feedback Patterns in Large Language Models

Luke Marks

Amir Abdullah

Clement Neo

Rauno Arike

Philip Torr

Fazl Barez

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Predicting Future Actions of Reinforcement Learning Agents

Stephen Chung

Scott Niekum

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Stress-Testing Capability Elicitation With Password-Locked Models

Ryan Greenblatt

Fabien Roger

Dmitrii Krasheninnikov

2024-09-25

NeurIPS.cc/2024/Conference (poster)

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Usman Anwar

Abulhair Saparov

Javier Rando

Daniel Paleka

Miles Turpin

Peter Hase

Ekdeep Singh Lubana

Erik Jenner

Stephen Casper

Oliver Sourbut

Benjamin L. Edelman

Zhaowei Zhang

Mario Günther

Anton Korinek

Jose Hernandez-Orallo

Lewis Hammond

Eric J Bigelow

Alexander Pan

Lauro Langosco

Tomasz Korbak … (see 22 more)

Heidi Chenyu Zhang

Ruiqi Zhong

Sean O hEigeartaigh

Gabriel Recchia

Giulio Corsi

Alan Chan

Markus Anderljung

Lilian Edwards

Aleksandar Petrov

Yoshua Bengio

Christian Schroeder de Witt

Danqi Chen

Samuel Albanie

Sumeet Ramesh Motwani

Tegan Maharaj

Jakob Nicolaus Foerster

Philip Torr

Florian Tramèr

He He

Atoosa Kasirzadeh

Yejin Choi

2024-09-02

TMLR (accepted)

Implicitly Bayesian Prediction Rules in Deep Learning

Bruno Mlodozeniec

Richard Turner

The Bayesian approach leads to coherent updates of predictions under new data, which makes adhering to Bayesian principles appealing in deci… (see more)sion-making contexts. Traditionally, integrating Bayesian principles into models like deep neural networks involves setting priors on parameters and approximating posteriors. This is done despite the fact that, typically, priors on parameters reflect any prior beliefs only insofar as they dictate function space behaviour. In this paper, we rethink this approach and consider what properties characterise a prediction rule as being Bayesian. Algorithms meeting such criteria can be deemed implicitly Bayesian — they make the same predictions as some Bayesian model, without explicitly manifesting priors and posteriors. We argue this might be a more fruitful approach towards integrating Bayesian principles into deep learning. In this paper, we propose how to measure how close a general prediction rule is to being implicitly Bayesian, and empirically evaluate multiple prediction strategies using our approach. We also show theoretically that agents relying on non-implicitly Bayesian prediction rules can be easily exploited in adversarial betting settings.

2024-07-29

Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference (published)

proceedings.mlr.press

Implicit meta-learning may lead language models to trust more reliable sources

Dmitrii Krasheninnikov

Egor Krasheninnikov

Bruno Mlodozeniec

Tegan Maharaj

We demonstrate that large language models (LLMs) may learn indicators of document usefulness and modulate their updates accordingly. We intr… (see more)oduce random strings ("tags") as indicators of usefulness in a synthetic fine-tuning dataset. Fine-tuning on this dataset leads to **implicit meta-learning (IML)**: in further fine-tuning, the model updates to make more use of text that is tagged as useful. We perform a thorough empirical investigation of this phenomenon, finding (among other things) that (i) it occurs in both pretrained LLMs and those trained from scratch, as well as on a vision task, and (ii) larger models and smaller batch sizes tend to give more IML. We also use probing to examine how IML changes the way models store knowledge in their parameters. Finally, we reflect on what our results might imply about the capabilities, risks, and controllability of future AI systems.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

A deeper look at depth pruning of LLMs

Shoaib Ahmed Siddiqui

Xin Dong

Greg Heinrich

Thomas Breuel

Jan Kautz

Pavlo Molchanov

Large Language Models (LLMs) are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work h… (see more)as attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that *adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences*. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing ***removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b*** (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. *Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU)*, which is either competitive or superior to the learning-based technique.

2024-06-18

ICML.cc/2024/Workshop/TF2M (poster)

IDs for AI Systems

Alan Chan

Noam Kolt

Peter Wills

Usman Anwar

Christian Schroeder de Witt

Nitarshan Rajkumar

Lewis Hammond

Lennart Heim

Markus Anderljung

2024-06-17

ArXiv (preprint)