Jörg Bornschein

Murray P. Shanahan

James L McClelland

Large language models exhibit exciting capabilities, yet can show surprisingly narrow generalization from finetuning. E.g. they can fail to … (see more)generalize to simple reversals of relations they are trained on, or fail to make simple logical deductions based on trained information. These failures to generalize from fine-tuning can hinder practical application of these models. On the other hand, language models' in-context learning shows different inductive biases, and can generalize better in some cases. Here, we explore these differences in generalization between in-context- and fine-tuning-based learning. To do so, we constructed several novel datasets to evaluate and improve models' abilities to generalize from finetuning data. The datasets are designed to create clean tests of generalization, by isolating the knowledge in the dataset from that in pretraining. We expose pretrained large models to controlled subsets of the information in these datasets -- either in context, or through fine-tuning -- and evaluate their performance on test sets that require various types of generalization. We find overall that in data-matched settings, in-context learning can generalize more flexibly than fine-tuning (though we also find some qualifications of prior findings, such as cases when fine-tuning can generalize to reversals embedded in a larger structure of knowledge). We build on these findings to propose a method to enable improved generalization from fine-tuning: adding in-context inferences to finetuning data. We show that this method improves generalization across various splits of our datasets and other benchmarks. Our results have implications for understanding the inductive biases of different modes of learning in language models, and practically improving their performance.

2025-05-01

ArXiv (preprint)

On the generalization of language models from in-context learning and finetuning: a controlled study

Andrew Lampinen

Arslan Chaudhry

Stephanie C.Y. Chan

Cody Wild

Diane Wan

Alexander Y. Ku

Alex Ku

Murray P. Shanahan

James L McClelland

2025-05-01

arXiv (published)

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Thomas Schmied

Jordi Grau-Moya

Markus Wulfmeier

2025-04-22

ArXiv (preprint)

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Thomas Schmied

Jordi Grau-Moya

Markus Wulfmeier

2025-04-01

arXiv (published)

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet

Stephanie Chan

Andrew Lampinen

Soham De

2025-03-27

ArXiv (preprint)

How do language models learn facts? Dynamics, curricula and hallucinations

Nicolas Zucchet

Stephanie Chan

Andrew Lampinen

Soham De

2025-03-27

ArXiv (preprint)

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Amal Rannen-Triki

Marcus Hutter

Andr'as Gyorgy

Alexandre Galashov

Yee Whye Teh

Michalis K. Titsias

We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is… (see more) generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

2024-03-01

arXiv (published)

Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies

Shiva Kanth Sujit

Pedro Braga

Samira Ebrahimi Kahou

Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from … (see more)scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive, such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there isn't a single established protocol for evaluating offline RL methods. In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.

2023-11-10

TMLR (accepted)