Portrait de Razvan Pascanu

Razvan Pascanu

Membre affilié
Chercheur scientifique principal, Google DeepMind
Sujets de recherche
Apprentissage à quelques exemples
Apprentissage continu
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Apprentissage profond géométrique
Apprentissage tout au long de la vie
Généralisation
Interprétabilité mécanistique
Optimisation
Réseaux de neurones
Réseaux de neurones en graphes
Réseaux de neurones profonds
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique

Publications

Filter Equivariant Functions: A symmetric account of length-general extrapolation on lists
Owen Lewis
Neil Ghani
Andrew Joseph Dudzik
Christos Perivolaropoulos
Petar Veličković
How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models
Dharshan Kumaran
Stephen M Fleming
Larisa Markeeva
Joseph Heyward
Andrea Banino
Mrinal Mathur
Simon Kayode Osindero
Benedetto De Martino
Petar Veličković
Viorica Patraucean
Large language models (LLMs) exhibit strikingly conflicting behaviors: they can appear steadfastly overconfident in their initial answers wh… (voir plus)ilst at the same time being prone to excessive doubt when challenged. To investigate this apparent paradox, we developed a novel experimental paradigm, exploiting the unique ability to obtain confidence estimates from LLMs without creating memory of their initial judgments -- something impossible in human participants. We show that LLMs -- Gemma 3, GPT4o and o1-preview -- exhibit a pronounced choice-supportive bias that reinforces and boosts their estimate of confidence in their answer, resulting in a marked resistance to change their mind. We further demonstrate that LLMs markedly overweight inconsistent compared to consistent advice, in a fashion that deviates qualitatively from normative Bayesian updating. Finally, we demonstrate that these two mechanisms -- a drive to maintain consistency with prior commitments and hypersensitivity to contradictory feedback -- parsimoniously capture LLM behavior in a different domain. Together, these findings furnish a mechanistic account of LLM confidence that explains both their stubbornness and excessive sensitivity to criticism.
Optimizers Qualitatively Alter Solutions And We Should Leverage This
Clare Lyle
Ionut-Vlad Modoranu
Naima Elosegui Borras
Dan Alistarh
Petar Veličković
Soham De
James Martens
Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when us… (voir plus)ing optimizers relying only on local information, such as SGD. Indeed, this was a primary source of skepticism regarding the feasibility of DNNs in the early days of the field. The past decades of progress in deep learning have revealed this skepticism to be misplaced, and a large body of empirical evidence shows that sufficiently large DNNs following standard training protocols exhibit well-behaved optimization dynamics that converge to performant solutions. This success has biased the community to use convex optimization as a mental model for learning, leading to a focus on training efficiency, either in terms of required iteration, FLOPs or wall-clock time, when improving optimizers. We argue that, while this perspective has proven extremely fruitful, another perspective specific to DNNs has received considerably less attention: the optimizer not only influences the rate of convergence, but also the qualitative properties of the learned solutions. Restated, the optimizer can and will encode inductive biases and change the effective expressivity of a given class of models. Furthermore, we believe the optimizer can be an effective way of encoding desiderata in the learning process. We contend that the community should aim at understanding the biases of already existing methods, as well as aim to build new optimizers with the explicit intent of inducing certain properties of the solution, rather than solely judging them based on their convergence rates. We hope our arguments will inspire research to improve our understanding of how the learning process can impact the type of solution we converge to, and lead to a greater recognition of optimizers design as a critical lever that complements the roles of architecture and data in shaping model outcomes.
RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
Xiuying Wei
Anunay Yadav
Caglar Gulcehre
Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major… (voir plus) computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a \(7\times\) improvement in training speed with 100K token sequences and \(9\times\) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT
What Can Grokking Teach Us About Learning Under Nonstationarity?
Clare Lyle
Gharda Sokar
Andr'as Gyorgy
In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to ch… (voir plus)anges in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.
Meta-learning how to Share Credit among Macro-Actions
Ionel-Alexandru Hosu
Traian Rebedea
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes Von Oswald
Nino Scherrer
Seijin Kobayashi
Luca Versari
Songlin Yang
Maximilian Schlegel
Kaitlin Maile
Yanick Schimpf
Oliver Sieberling
Alexander Meulemans
Rif A. Saurous
Charlotte Frenkel
Blaise Aguera y Arcas
João Sacramento
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (voir plus)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
Wojciech Masarczyk
Mateusz Ostaszewski
Tin Sum Cheng
Tomasz Trzci'nski
Aurélien Lucchi
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (voir plus) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes Von Oswald
Nino Scherrer
Seijin Kobayashi
Luca Versari
Songlin Yang
Maximilian Schlegel
Kaitlin Maile
Yanick Schimpf
Oliver Sieberling
Alexander Meulemans
Rif A. Saurous
Charlotte Frenkel
Blaise Aguera y Arcas
João Sacramento
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (voir plus)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization
Wojciech Masarczyk
Mateusz Ostaszewski
Tin Sum Cheng
Tomasz Trzci'nski
Aurélien Lucchi
The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (voir plus) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.
Plasticity as the Mirror of Empowerment
David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Clare Lyle
Georgios Piliouras
Jonathan Richens
Mark Rowland
Tom Schaul
Satinder Singh
Plasticity as the Mirror of Empowerment
David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Clare Lyle
Georgios Piliouras
Jonathan Richens
Mark Rowland
Tom Schaul
Satinder Singh