Razvan Pascanu

openreview.net

State Soup: In-Context Skill Learning, Retrieval and Mixing

Maciej Pi'oro

Maciej Wolczyk

Johannes Von Oswald

João Sacramento

A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Suc… (voir plus)h models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.

2024-05-31

arXiv (publié)

Transformers meet Neural Algorithmic Reasoners

Wilfried Bounsi

Borja Ibarz

Andrew Joseph Dudzik

Jessica B. Hamrick

Larisa Markeeva

Alex Vitvitskyi

Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text da… (voir plus)tasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution.

2024-05-31

arXiv (publié)

Transformers need glasses! Information over-squashing in language tasks

Federico Barbero

Andrea Banino

Steven Kapturowski

Dharshan Kumaran

João Guilherme Madeira Araújo

Alex Vitvitskyi

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large langu… (voir plus)age models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

2024-05-31

arXiv (publié)

Deep Grokking: Would Deep Neural Networks Generalize Better?

Simin Fan

Martin Jaggi

Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization be… (voir plus)haviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

2024-05-28

ArXiv (prépublication)

No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO

Skander Moalla

Andrea Miele

Daniil Pyatko

Caglar Gulçehre

Reinforcement learning (RL) is inherently rife with non-stationarity since the states and rewards the agent observes during training depend … (voir plus)on its changing policy. Therefore, networks in deep RL must be capable of adapting to new observations and fitting new targets. However, previous works have observed that networks trained under non-stationarity exhibit an inability to continue learning, termed loss of plasticity, and eventually a collapse in performance. For off-policy deep value-based RL methods, this phenomenon has been correlated with a decrease in representation rank and the ability to fit random targets, termed capacity loss. Although this correlation has generally been attributed to neural network learning under non-stationarity, the connection to representation dynamics has not been carefully studied in on-policy policy optimization methods. In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and MuJoCo environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss. We show that this is aggravated by stronger non-stationarity, ultimately driving the actor's performance to collapse, regardless of the performance of the critic. We ask why the trust region, specific to methods like PPO, cannot alleviate or prevent the collapse and find a connection between representation collapse and the degradation of the trust region, one exacerbating the other. Finally, we present Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with other interventions, shows that regularizing the representation dynamics mitigates the performance collapse of PPO agents.

2024-04-30

ArXiv (prépublication)

Asynchronous Algorithmic Alignment with Cocycles

Andrew Joseph Dudzik

Tamara von Glehn

State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinc… (voir plus)tion between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph. But more importantly, many intermediate GNN steps have to learn the identity functions, which is a non-trivial learning problem. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks. Our analysis yields several practical implementations of synchronous scalable GNN layers that are provably invariant under various forms of asynchrony.

2024-04-16

Proceedings of the Second Learning on Graphs Conference (publié)

proceedings.mlr.press

Latent Space Representations of Neural Algorithmic Reasoners

Vladimir V. Mirjani'c

Petar Velivckovi'c University of Cambridge

Google Deepmind

Neural Algorithmic Reasoning (NAR) is a research area focused on designing neural architectures that can reliably capture classical computat… (voir plus)ion, usually by learning to execute algorithms. A typical approach is to rely on Graph Neural Network (GNN) architectures, which encode inputs in high-dimensional latent spaces that are repeatedly transformed during the execution of the algorithm. In this work we perform a detailed analysis of the structure of the latent space induced by the GNN when executing algorithms. We identify two possible failure modes: (i) loss of resolution, making it hard to distinguish similar values; (ii) inability to deal with values outside the range observed during training. We propose to solve the first issue by relying on a softmax aggregator, and propose to decay the latent space in order to deal with out-of-range values. We show that these changes lead to improvements on the majority of algorithms in the standard CLRS-30 benchmark when using the state-of-the-art Triplet-GMPNN processor. Our code is available at https://github.com/mirjanic/nar-latent-spaces

2024-04-16

Proceedings of the Second Learning on Graphs Conference (publié)

proceedings.mlr.press

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev

Soham De

Samuel L. Smith

Anushan Fernando

George-Cristian Muraru

Ruba Haroun

Leonard Berrada

Pier Giuseppe Sessa

Robert Dadashi

L'eonard Hussenot

Johan Ferret

Sertan Girgin

Olivier Bachem

Alek Andreev

Kathleen Kenealy

Thomas Mesnard

Cassidy Hardin

Surya Bhupatiraju

Shreya Pathak … (voir 43 de plus)

Laurent Sifre

Morgane Rivière

Mihir Kale

J Christopher Love

Juliette Love

Pouya Dehghani Tafti

Armand Joulin

Noah Fiedel

Evan Senter

Yutian Chen 0001

Srivatsan Srinivasan

Guillaume Desjardins

David Mark Budden

Arnaud Doucet

Sharad Mandyam Vikram

Adam Paszke

Trevor Gale

Sebastian Borgeaud

Charlie Chen

Andy Brock

Antonia Paterson

Jenny Brennan

Meg Risdal

Raj Gundluru

N. Devanathan

Paul Mooney

Nilay Chauhan

Phil Culliton

Luiz GUStavo Martins

Elisa Bandy

David W. Huntsperger

Glenn Cameron

Arthur Zucker

Tris Brian Warkentin

Ludovic Peran

Minh Giang

Zoubin Ghahramani

Clément Farabet

Koray Kavukcuoglu

Demis Hassabis

Raia Hadsell

Yee Whye Teh

Nando de Frietas

We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurr… (voir plus)ences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

2024-03-31

arXiv (publié)

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Amal Rannen-Triki

Jörg Bornschein

Marcus Hutter

Andr'as Gyorgy

Alexandre Galashov

Yee Whye Teh

Michalis K. Titsias

We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is… (voir plus) generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

2024-02-29

arXiv (publié)