Portrait of Razvan Pascanu

Razvan Pascanu

Affiliate Member
Senior Research Scientist, Google DeepMind
Research Topics
Continual Learning
Deep Learning
Deep Neural Networks
Few-Shot Learning
Generalization
Geometric Deep Learning
Graph Neural Networks
Lifelong Learning
Machine Learning Theory
Mechanistic Interpretability
Neural Networks
Optimization
Recurrent Neural Networks
Reinforcement Learning
Representation Learning

Publications

Meta-learning how to Share Credit among Macro-Actions
Ionel-Alexandru Hosu
Traian Rebedea
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes Von Oswald
Nino Scherrer
Seijin Kobayashi
Luca Versari
Songlin Yang
Maximilian Schlegel
Kaitlin Maile
Yanick Schimpf
Oliver Sieberling
Alexander Meulemans
Rif A. Saurous
Charlotte Frenkel
Blaise Aguera y Arcas
João Sacramento
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (see more)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes Von Oswald
Nino Scherrer
Seijin Kobayashi
Luca Versari
Songlin Yang
Maximilian Schlegel
Kaitlin Maile
Yanick Schimpf
Oliver Sieberling
Alexander Meulemans
Rif A. Saurous
Charlotte Frenkel
Blaise Aguera y Arcas
João Sacramento
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (see more)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
Wojciech Masarczyk
Mateusz Ostaszewski
Tin Sum Cheng
Tomasz Trzci'nski
Aurélien Lucchi
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (see more) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
Wojciech Masarczyk
Mateusz Ostaszewski
Tin Sum Cheng
Tomasz Trzci'nski
Aurélien Lucchi
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (see more) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.
Plasticity as the Mirror of Empowerment
David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Clare Lyle
Georgios Piliouras
Jonathan Richens
Mark Rowland
Tom Schaul
Satinder Singh
Plasticity as the Mirror of Empowerment
David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Clare Lyle
Georgios Piliouras
Jonathan Richens
Mark Rowland
Tom Schaul
Satinder Singh
On the generalization of language models from in-context learning and finetuning: a controlled study
Andrew Lampinen
Arslan Chaudhry
Stephanie C.Y. Chan
Cody Wild
Diane Wan
Alexander Y. Ku
Alex Ku
Jorg Bornschein
Murray P. Shanahan
James L McClelland
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Thomas Schmied
Jorg Bornschein
Jordi Grau-Moya
Markus Wulfmeier
Why do LLMs attend to the first token?
Federico Barbero
'Alvaro Arroyo
Xiangming Gu
Christos Perivolaropoulos
Michael M. Bronstein
Petar Velivckovi 'c
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Thomas Schmied
Jorg Bornschein
Jordi Grau-Moya
Markus Wulfmeier
Why do LLMs attend to the first token?
Federico Barbero
'Alvaro Arroyo
Xiangming Gu
Christos Perivolaropoulos
Michael M. Bronstein
Petar Veličković