Razvan Pascanu

Petar Veličković

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large langu… (voir plus)age models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.

2024-06-01

arXiv (publié)

Deep Grokking: Would Deep Neural Networks Generalize Better?

Simin Fan

Martin Jaggi

Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization be… (voir plus)haviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

2024-05-29

ArXiv (prépublication)

No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO

Skander Moalla

Andrea Miele

Daniil Pyatko

Reinforcement learning (RL) is inherently rife with non-stationarity since the states and rewards the agent observes during training depend … (voir plus)on its changing policy. Therefore, networks in deep RL must be capable of adapting to new observations and fitting new targets. However, previous works have observed that networks trained under non-stationarity exhibit an inability to continue learning, termed loss of plasticity, and eventually a collapse in performance. For off-policy deep value-based RL methods, this phenomenon has been correlated with a decrease in representation rank and the ability to fit random targets, termed capacity loss. Although this correlation has generally been attributed to neural network learning under non-stationarity, the connection to representation dynamics has not been carefully studied in on-policy policy optimization methods. In this work, we empirically study representation dynamics in Proximal Policy Optimization (PPO) on the Atari and MuJoCo environments, revealing that PPO agents are also affected by feature rank deterioration and capacity loss. We show that this is aggravated by stronger non-stationarity, ultimately driving the actor's performance to collapse, regardless of the performance of the critic. We ask why the trust region, specific to methods like PPO, cannot alleviate or prevent the collapse and find a connection between representation collapse and the degradation of the trust region, one exacerbating the other. Finally, we present Proximal Feature Optimization (PFO), a novel auxiliary loss that, along with other interventions, shows that regularizing the representation dynamics mitigates the performance collapse of PPO agents.

2024-05-01

ArXiv (prépublication)

Asynchronous Algorithmic Alignment with Cocycles

Andrew Joseph Dudzik

Tamara von Glehn

Petar Veličković

State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinc… (voir plus)tion between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph. But more importantly, many intermediate GNN steps have to learn the identity functions, which is a non-trivial learning problem. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks. Our analysis yields several practical implementations of synchronous scalable GNN layers that are provably invariant under various forms of asynchrony.

2024-04-17

Proceedings of the Second Learning on Graphs Conference (publié)

Latent Space Representations of Neural Algorithmic Reasoners

Vladimir V. Mirjani'c

Petar Velivckovi'c University of Cambridge

Petar Veličković

Google Deepmind

Neural Algorithmic Reasoning (NAR) is a research area focused on designing neural architectures that can reliably capture classical computat… (voir plus)ion, usually by learning to execute algorithms. A typical approach is to rely on Graph Neural Network (GNN) architectures, which encode inputs in high-dimensional latent spaces that are repeatedly transformed during the execution of the algorithm. In this work we perform a detailed analysis of the structure of the latent space induced by the GNN when executing algorithms. We identify two possible failure modes: (i) loss of resolution, making it hard to distinguish similar values; (ii) inability to deal with values outside the range observed during training. We propose to solve the first issue by relying on a softmax aggregator, and propose to decay the latent space in order to deal with out-of-range values. We show that these changes lead to improvements on the majority of algorithms in the standard CLRS-30 benchmark when using the state-of-the-art Triplet-GMPNN processor. Our code is available at https://github.com/mirjanic/nar-latent-spaces

2024-04-17

Proceedings of the Second Learning on Graphs Conference (publié)

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Aleksandar Botev

Soham De

Samuel L. Smith

Anushan Fernando

George-Cristian Muraru

Ruba Haroun

Leonard Berrada

Pier Giuseppe Sessa

Robert Dadashi

L'eonard Hussenot

Johan Ferret

Sertan Girgin

Olivier Bachem

Alek Andreev

Kathleen Kenealy

Thomas Mesnard

Cassidy Hardin

Surya Bhupatiraju

Shreya Pathak … (voir 43 de plus)

Laurent Sifre

Morgane Rivière

Mihir Kale

J Christopher Love

Juliette Love

Pouya Dehghani Tafti

Armand Joulin

Noah Fiedel

Evan Senter

Yutian Chen 0001

Srivatsan Srinivasan

Guillaume Desjardins

David Mark Budden

Arnaud Doucet

Sharad Mandyam Vikram

Adam Paszke

Trevor Gale

Sebastian Borgeaud

Charlie Chen

Andy Brock

Antonia Paterson

Jenny Brennan

Meg Risdal

Raj Gundluru

N. Devanathan

Paul Mooney

Nilay Chauhan

Phil Culliton

Luiz GUStavo Martins

Elisa Bandy

David W. Huntsperger

Glenn Cameron

Arthur Zucker

Tris Brian Warkentin

Ludovic Peran

Minh Giang

Zoubin Ghahramani

Clément Farabet

Koray Kavukcuoglu

Demis Hassabis

Raia Hadsell

Yee Whye Teh

Nando de Frietas

We introduce RecurrentGemma, a family of open language models which uses Google's novel Griffin architecture. Griffin combines linear recurr… (voir plus)ences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both. Our models achieve comparable performance to similarly-sized Gemma baselines despite being trained on fewer tokens.

2024-04-01

arXiv (publié)

Revisiting Dynamic Evaluation: Online Adaptation for Large Language Models

Amal Rannen-Triki

Jörg Bornschein

Marcus Hutter

Andr'as Gyorgy

Alexandre Galashov

Yee Whye Teh

Michalis K. Titsias

We consider the problem of online fine tuning the parameters of a language model at test time, also known as dynamic evaluation. While it is… (voir plus) generally known that this approach improves the overall predictive performance, especially when considering distributional shift between training and evaluation data, we here emphasize the perspective that online adaptation turns parameters into temporally changing states and provides a form of context-length extension with memory in weights, more in line with the concept of memory in neuroscience. We pay particular attention to the speed of adaptation (in terms of sample efficiency),sensitivity to the overall distributional drift, and the computational overhead for performing gradient computations and parameter updates. Our empirical study provides insights on when online adaptation is particularly interesting. We highlight that with online adaptation the conceptual distinction between in-context learning and fine tuning blurs: both are methods to condition the model on previously observed tokens.

2024-03-01

arXiv (publié)

Disentangling the Causes of Plasticity Loss in Neural Networks

Clare Lyle

Zeyu Zheng

Khimya Khetarpal

Hado van Hasselt

James Martens

Will Dabney

Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption… (voir plus): that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.

2024-02-29

ArXiv (prépublication)

Disentangling the Causes of Plasticity Loss in Neural Networks

Clare Lyle

Zeyu Zheng

Khimya Khetarpal

Hado van Hasselt

James Martens

Will Dabney

Underpinning the past decades of work on the design, initialization, and optimization of neural networks is a seemingly innocuous assumption… (voir plus): that the network is trained on a \textit{stationary} data distribution. In settings where this assumption is violated, e.g.\ deep reinforcement learning, learning algorithms become unstable and brittle with respect to hyperparameters and even random seeds. One factor driving this instability is the loss of plasticity, meaning that updating the network's predictions in response to new information becomes more difficult as training progresses. While many recent works provide analyses and partial solutions to this phenomenon, a fundamental question remains unanswered: to what extent do known mechanisms of plasticity loss overlap, and how can mitigation strategies be combined to best maintain the trainability of a network? This paper addresses these questions, showing that loss of plasticity can be decomposed into multiple independent mechanisms and that, while intervening on any single mechanism is insufficient to avoid the loss of plasticity in all cases, intervening on multiple mechanisms in conjunction results in highly robust learning algorithms. We show that a combination of layer normalization and weight decay is highly effective at maintaining plasticity in a variety of synthetic nonstationary learning tasks, and further demonstrate its effectiveness on naturally arising nonstationarities, including reinforcement learning in the Arcade Learning Environment.

2024-02-29

ArXiv (prépublication)

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De

Samuel L. Smith

Anushan Fernando

Aleksandar Botev

George Cristian-Muraru

Albert Gu

Ruba Haroun

Leonard Berrada

Yutian Chen 0001

Srivatsan Srinivasan

Guillaume Desjardins

Arnaud Doucet

David Mark Budden

Yee Whye Teh

Nando de Freitas

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to sc… (voir plus)ale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

2024-02-29

ArXiv (prépublication)

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De

Samuel L. Smith

Anushan Fernando

Aleksandar Botev

George Cristian-Muraru

Albert Gu

Ruba Haroun

Leonard Berrada

Yutian Chen 0001

Srivatsan Srinivasan

Guillaume Desjardins

Arnaud Doucet

David Mark Budden

Yee Whye Teh

Nando de Freitas

2024-02-29

ArXiv (prépublication)

Building on Efficient Foundations: Effective Training of LLMs with Structured Feedforward Layers.

Xiuying Wei

Skander Moalla