Portrait de Razvan Pascanu

Razvan Pascanu

Membre affilié
Chercheur scientifique principal, Google DeepMind
Sujets de recherche
Apprentissage à quelques exemples
Apprentissage continu
Apprentissage de représentations
Apprentissage par renforcement
Apprentissage profond
Apprentissage profond géométrique
Apprentissage tout au long de la vie
Généralisation
Interprétabilité mécanistique
Optimisation
Réseaux de neurones
Réseaux de neurones en graphes
Réseaux de neurones profonds
Réseaux de neurones récurrents
Théorie de l'apprentissage automatique

Publications

Attention as a Hypernetwork
Simon Schug
Seijin Kobayashi
Yassir Akram
João Sacramento
Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during t… (voir plus)raining, but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances. To further examine the hypothesis that the intrinsic hypernetwork of multi-head attention supports compositional generalization, we ablate whether making the hypernetwork-generated linear value network nonlinear strengthens compositionality. We find that this modification improves compositional generalization on abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven's Progressive Matrices human intelligence test, which gives us precise control over the problem compositions encountered during training and evaluation. We demonstrate on this task how scaling model size and data enables compositional generalization in transformers and gives rise to a functionally structured latent space.
Hadamard product in deep learning: Introduction, Advances and Challenges.
Grigorios G Chrysos
Yongtao Wu
Philip Torr
Volkan Cevher
Meta-learning how to Share Credit among Macro-Actions
Ionel-Alexandru Hosu
Traian Rebedea
One proposed mechanism to improve exploration in reinforcement learning is through the use of macro-actions. Paradoxically though, in many s… (voir plus)cenarios the naive addition of macro-actions does not lead to better exploration, but rather the opposite. It has been argued that this was caused by adding non-useful macros and multiple works have focused on mechanisms to discover effectively environment-specific useful macros. In this work, we take a slightly different perspective. We argue that the difficulty stems from the trade-offs between reducing the average number of decisions per episode versus increasing the size of the action space. Namely, one typically treats each potential macro-action as independent and atomic, hence strictly increasing the search space and making typical exploration strategies inefficient. To address this problem we propose a novel regularization term that exploits the relationship between actions and macro-actions to improve the credit assignment mechanism by reducing the effective dimension of the action space and, therefore, improving exploration. The term relies on a similarity matrix that is meta-learned jointly with learning the desired policy. We empirically validate our strategy looking at macro-actions in Atari games, and the StreetFighter II environment. Our results show significant improvements over the Rainbow-DQN baseline in all environments. Additionally, we show that the macro-action similarity is transferable to related environments. We believe this work is a small but important step towards understanding how the similarity-imposed geometry on the action space can be exploited to improve credit assignment and exploration, therefore making learning more effective.
Torque-Aware Momentum
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (voir plus)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.
Non-Stationary Learning of Neural Networks with Automatic Soft Parameter Reset
Alexandre Galashov
Michalis K. Titsias
Andr'as Gyorgy
Clare Lyle
Yee Whye Teh
Maneesh Sahani
Neural networks are traditionally trained under the assumption that data come from a stationary distribution. However, settings which violat… (voir plus)e this assumption are becoming more popular; examples include supervised learning under distributional shifts, reinforcement learning, continual learning and non-stationary contextual bandits. In this work we introduce a novel learning approach that automatically models and adapts to non-stationarity, via an Ornstein-Uhlenbeck process with an adaptive drift parameter. The adaptive drift tends to draw the parameters towards the initialisation distribution, so the approach can be understood as a form of soft parameter reset. We show empirically that our approach performs well in non-stationary supervised and off-policy reinforcement learning settings.
Retrieval-Augmented Decision Transformer: External Memory for In-context RL
Thomas Schmied
Fabian Paischer
Vihang P. Patil
Markus Hofmarcher
Sepp Hochreiter
In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP,… (voir plus) this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent's context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.
Switching between tasks can cause AI to lose the ability to learn
Clare Lyle
When can transformers compositionally generalize in-context?
Seijin Kobayashi
Simon Schug
Yassir Akram
Florian Redhardt
Johannes Von Oswald
João Sacramento
Many tasks can be composed from a few independent components. This gives rise to a combinatorial explosion of possible tasks, only some of w… (voir plus)hich might be encountered during training. Under what circumstances can transformers compositionally generalize from a subset of tasks to all possible combinations of tasks that share similar components? Here we study a modular multitask setting that allows us to precisely control compositional structure in the data generation process. We present evidence that transformers learning in-context struggle to generalize compositionally on this task despite being in principle expressive enough to do so. Compositional generalization becomes possible only when introducing a bottleneck that enforces an explicit separation between task inference and task execution.
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
Xiuying Wei
Skander Moalla
State-of-the-art LLMs often rely on scale with high computational costs, which has sparked a research agenda to reduce parameter counts and … (voir plus)costs without significantly impacting performance. Our study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. In contrast to previous works, (i) we explore low-rank parametrization at scale, up to 1.3B parameters; (ii) within Transformer language models rather than convolutional architectures; and (iii) starting from training from scratch. Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6
Improving fine-grained understanding in image-text pre-training
Ioana Bica
Anastasija Ili'c
Matthias Bauer
Goker Erdogan
Matko Bovsnjak
Christos Kaplanis
Alexey A. Gritsenko
Matthias Minderer
Charles Blundell
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations… (voir plus) from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.
Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues
Deep neural networks based on linear RNNs interleaved with position-wise MLPs are gaining traction as competitive approaches for sequence mo… (voir plus)deling. Examples of such architectures include state-space models (SSMs) like S4, LRU, and Mamba: recently proposed models that achieve promising performance on text, genetics, and other data that require long-range reasoning. Despite experimental evidence highlighting these architectures’ effectiveness and computational efficiency, their expressive power remains relatively unexplored, especially in connection to specific choices crucial in practice - e.g., carefully designed initialization distribution and potential use of complex numbers. In this paper, we show that combining MLPs with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of regular causal sequence-to-sequence maps. At the heart of our proof, we rely on a separation of concerns: the linear RNN provides a lossless encoding of the input sequence, and the MLP performs non-linear processing on this encoding. While we show that real diagonal linear recurrences are enough to achieve universality in this architecture, we prove that employing complex eigenvalues near unit disk - i.e., empirically the most successful strategy in S4 - greatly helps the RNN in storing information. We connect this finding with the vanishing gradient issue and provide experiments supporting our claims.
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
Xiuying Wei
Skander Moalla
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a re… (voir plus)search agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at https://github.com/CLAIRE-Labo/StructuredFFN/tree/main.