Razvan Pascanu

Hadamard product in deep learning: Introduction, Advances and Challenges.

Grigorios G Chrysos

Yongtao Wu

Philip Torr

Volkan Cevher

2025-01-01

IEEE Transactions on Pattern Analysis and Machine Intelligence (published)

NoProp: Training Neural Networks without Back-propagation or Forward-propagation

Qinyu Li

Yee Whye Teh

2025-01-01

arXiv.org (preprint)

RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling

Xiuying Wei

Anunay Yadav

Caglar Gulçehre

Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major… (see more) computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a \(7\times\) improvement in training speed with 100K token sequences and \(9\times\) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at https://github.com/CLAIRE-Labo/RAT

2025-01-01

arXiv.org (preprint)

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero

Alex Vitvitskyi

Christos Perivolaropoulos

Petar Veličković

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism wit… (see more)h important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

2025-01-01

ICLR (published)

Reza Babanezhad Harikandeh

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Gintare Karolina Dziugaite

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (preprint)

Reza Babanezhad Harikandeh

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Gintare Karolina Dziugaite

Sarath Chandar

2024-12-25

ArXiv (preprint)

Reza Babanezhad Harikandeh

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Gintare Karolina Dziugaite

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (preprint)

TRecViT: A Recurrent Video Transformer

Viorica Puatruaucean

Xu Owen He

Joseph Heyward

Chuhan Zhang

Mehdi S. M. Sajjadi

George-Cristian Muraru

Artem Zholus

Mahdi Karami

Ross Goroshin

Yutian Chen 0001

Simon Kayode Osindero

João Carreira

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (see more)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having

2024-12-18

ArXiv (preprint)

Non-Stationary Learning of Neural Networks with Automatic Soft Parameter Reset

Alexandre Galashov

Michalis K. Titsias

Andr'as Gyorgy

Clare Lyle

Yee Whye Teh

Maneesh Sahani

Neural networks are traditionally trained under the assumption that data come from a stationary distribution. However, settings which violat… (see more)e this assumption are becoming more popular; examples include supervised learning under distributional shifts, reinforcement learning, continual learning and non-stationary contextual bandits. In this work we introduce a novel learning approach that automatically models and adapts to non-stationarity, via an Ornstein-Uhlenbeck process with an adaptive drift parameter. The adaptive drift tends to draw the parameters towards the initialisation distribution, so the approach can be understood as a form of soft parameter reset. We show empirically that our approach performs well in non-stationary supervised and off-policy reinforcement learning settings.

2024-11-06

ArXiv (preprint)

Retrieval-Augmented Decision Transformer: External Memory for In-context RL

Thomas Schmied

Fabian Paischer

Vihang P. Patil

Markus Hofmarcher

Sepp Hochreiter

In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP,… (see more) this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent's context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid-worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.

2024-10-09

ArXiv (preprint)

Normalization and effective learning rates in reinforcement learning

Clare Lyle

Zeyu Zheng

Khimya Khetarpal

James Martens

Hado van Hasselt

Will Dabney

2024-09-25

NeurIPS.cc/2024/Conference (poster)