Ross Goroshin

2025-04-08

ArXiv (preprint)

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Artem Zholus

Carl Doersch

Yi Yang

Skanda Koppula

Viorica Patraucean

Xu Owen He

Ignacio Rocco

Mehdi S. M. Sajjadi

Sarath Chandar

2025-04-01

arXiv (published)

Scaling 4D Representations

João Carreira

Dilara Gokay

Michael King

Chuhan Zhang

Ignacio Rocco

Aravindh Mahendran

T. Keck

Joseph Heyward

Skanda Koppula

Etienne Pot

Goker Erdogan

Yana Hasson

Yi Yang

Klaus Greff

Guillaume Le Moing

Sjoerd van Steenkiste

Daniel Zoran

Drew A. Hudson

Pedro V'elez

Luisa F. Polan'ia … (see 15 more)

Luke Friedman

Chris Duvarney

Kelsey Allen

Jacob Walker

Rishabh Kabra

Eric Aboussouan

Jennifer Sun

Thomas Kipf

Carl Doersch

Viorica Puatruaucean

Dima Damen

Pauline Luc

Mehdi S. M. Sajjadi

Andrew Zisserman

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations… (see more) on semantic-related tasks

2024-12-19

ArXiv (preprint)

TRecViT: A Recurrent Video Transformer

Viorica Puatruaucean

Xu Owen He

Joseph Heyward

Chuhan Zhang

Mehdi S. M. Sajjadi

George-Cristian Muraru

Artem Zholus

Mahdi Karami

Yutian Chen 0001

Simon Kayode Osindero

João Carreira

Razvan Pascanu

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (see more)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having

2024-12-18

ArXiv (preprint)

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Carl Doersch

Yi Yang

Dilara Gokay

Pauline Luc

Skanda Koppula

Ankush Gupta

Joseph Heyward

Ignacio Rocco

João Carreira

Andrew Zisserman

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform… (see more) in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

2024-12-08

Lecture Notes in Computer Science (published)

Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping

Vishal Batchu

A. Wilson

Betty Peng

Carl D. Elkin

Umangi Jain

Christopher Van Arsdale

Varun Gulshan

The transition to renewable energy, particularly solar, is key to mitigating climate change. Google's Solar API aids this transition by esti… (see more)mating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API's reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API's potential to promote solar adoption.

2024-08-01

arXiv (published)

Course Correcting Koopman Representations

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Block-State Transformers

2023-09-21

NeurIPS.cc/2023/Conference (poster)

openreview.net

Block-State Transformers

2023-06-15

ArXiv (preprint)

Block-State Transformers

2023-06-15

ArXiv (preprint)

Block-State Transformers

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long… (see more) sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

2023-06-15

ArXiv (preprint)