Publications

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Tim Rocktäschel

Edward Grefenstette

David Scott Krueger

Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning… (see more) systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a `wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such ``wrapped capabilities'' are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov

Pierluca D'Oro

Shagun Sodhani

Roberta Raileanu

Pierre-Luc Bacon

Pascal Vincent

Amy Zhang

Mikael Henaff

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, … (see more)a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Object centric architectures enable efficient causal representation learning

Amin Mansouri

Jason Hartford

Yan Zhang

Yoshua Bengio

Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees… (see more) (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

PhyloGFN: Phylogenetic inference with generative flow networks

Ming Yang Zhou

Zichao Yan

Elliot Layne

Nikolay Malkin

Dinghuai Zhang

Moksh J. Jain

Mathieu Blanchette

Yoshua Bengio

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning

Maxime Wabartha

Joelle Pineau

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Piecewise Linear Parametrization of Policies: Towards Interpretable Deep Reinforcement Learning

Maxime Wabartha

Joelle Pineau

Learning inherently interpretable policies is a central challenge in the path to developing autonomous agents that humans can trust. Linear … (see more)policies can justify their decisions while interacting in a dynamic environment, but their reduced expressivity prevents them from solving hard tasks. Instead, we argue for the use of piecewise-linear policies. We carefully study to what extent they can retain the interpretable properties of linear policies while reaching competitive performance with neural baselines. In particular, we propose the HyperCombinator (HC), a piecewise-linear neural architecture expressing a policy with a controllably small number of sub-policies. Each sub-policy is linear with respect to interpretable features, shedding light on the decision process of the agent without requiring an additional explanation model. We evaluate HC policies in control and navigation experiments, visualize the improved interpretability of the agent and highlight its trade-off with performance. Moreover, we validate that the restricted model class that the HyperCombinator belongs to is compatible with the algorithmic constraints of various reinforcement learning algorithms.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Poly-View Contrastive Learning

Amitis Shidani

(Rex) Devon Hjelm

Jason Ramapuram

Russell Webb

Eeshan Gunesh Dhekane

Dan Busbridge

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Pre-Training and Fine-Tuning Generative Flow Networks

Ling Pan

Moksh J. Jain

Kanika Madan

Yoshua Bengio

Generative Flow Networks (GFlowNets) are amortized samplers that learn stochastic policies to sequentially generate compositional objects fr… (see more)om a given unnormalized reward distribution. They can generate diverse sets of high-reward objects, which is an important consideration in scientific discovery tasks. However, as they are typically trained from a given extrinsic reward function, it remains an important open challenge about how to leverage the power of pre-training and train GFlowNets in an unsupervised fashion for efficient adaptation to downstream tasks. Inspired by recent successes of unsupervised pre-training in various domains, we introduce a novel approach for reward-free pre-training of GFlowNets. By framing the training as a self-supervised problem, we propose an outcome-conditioned GFlowNet (OC-GFN) that learns to explore the candidate space. Specifically, OC-GFN learns to reach any targeted outcomes, akin to goal-conditioned policies in reinforcement learning. We show that the pre-trained OC-GFN model can allow for a direct extraction of a policy capable of sampling from any new reward functions in downstream tasks. Nonetheless, adapting OC-GFN on a downstream task-specific reward involves an intractable marginalization over possible outcomes. We propose a novel way to approximate this marginalization by learning an amortized predictor enabling efficient fine-tuning. Extensive experimental results validate the efficacy of our approach, demonstrating the effectiveness of pre-training the OC-GFN, and its ability to swiftly adapt to downstream tasks and discover modes more efficiently. This work may serve as a foundation for further exploration of pre-training strategies in the context of GFlowNets.

2024-01-16

ICLR.cc/2024/Conference (spotlight)

doi.org

openreview.net

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq

Qingfeng Lan

Pan Xu

A. Rupam Mahmood

Doina Precup

Animashree Anandkumar

Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (see more)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Raidar: geneRative AI Detection viA Rewriting

Chengzhi Mao

Carl Vondrick

Hao Wang

Junfeng Yang

We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. Th… (see more)is tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Reasoning with Latent Diffusion in Offline Reinforcement Learning

Siddarth Venkatraman

Shivesh Khaitan

Ravi Tej Akella

John Dolan

Jeff Schneider

Glen Berseth

2024-01-16