Xu Owen He

Adaptive Computation Pruning for the Forgetting Transformer

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on… (voir plus)-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs *provably safe* pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70\% across different model sizes and context lengths, resulting in a roughly 50\% to 70\% reduction in attention runtime (or a 2--3

2025-07-07

colmweb.org/COLM/2025/Conference (accepté)

openreview.net

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus

Carl Doersch

Yi Yang

Skanda Koppula

Viorica Patraucean

Xu Owen He

Ignacio Rocco

Mehdi S. M. Sajjadi

Sarath Chandar

Ross Goroshin

2025-04-08

ArXiv (prépublication)

arxiv.org

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus

Carl Doersch

Yi Yang

Skanda Koppula

Viorica Patraucean

Xu Owen He

Ignacio Rocco

Mehdi S. M. Sajjadi

Sarath Chandar

Ross Goroshin

2025-04-01

arXiv (publié)

doi.org

arxiv.org

TRecViT: A Recurrent Video Transformer

Viorica Puatruaucean

Xu Owen He

Joseph Heyward

Chuhan Zhang

Mehdi S. M. Sajjadi

George-Cristian Muraru

Artem Zholus

Mahdi Karami

Ross Goroshin

Yutian Chen 0001

Simon Kayode Osindero

João Carreira

Razvan Pascanu

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (voir plus)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having

2024-12-18

ArXiv (prépublication)

doi.org

arxiv.org

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Publications

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Mots-clés populaires:

Xu Owen He

Publications