Portrait de Ross Goroshin

Ross Goroshin

Membre industriel principal
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Google DeepMind
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
IA appliquée
Systèmes dynamiques
Vision par ordinateur

Biographie

Ross Goroshin est chercheur chez Google DeepMind, à Montréal et membre industriel principal à Mila - Institut québécois d'intelligence artificielle. Il est titulaire d'un doctorat en informatique de l'Université de New York, où il a été supervisé par Yann LeCun. Il est également titulaire d'un baccalauréat en génie électrique de l'Université Concordia et d'une maîtrise en génie électrique de Georgia Tech. Ses recherches portent sur la vision par ordinateur, l'apprentissage auto-supervisé et le contrôle optimal.

En plus de ses rôles chez Google DeepMind et Mila, Ross est professeur adjoint au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal.

Publications

TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Artem Zholus
Carl Doersch
Yi Yang
Skanda Koppula
Viorica Patraucean
Xu Owen He
Ignacio Rocco
Mehdi S. M. Sajjadi
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Artem Zholus
Carl Doersch
Yi Yang
Skanda Koppula
Viorica Patraucean
Xu Owen He
Ignacio Rocco
Mehdi S. M. Sajjadi
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
Aravindh Mahendran
T. Keck
Joseph Heyward
Skanda Koppula
Etienne Pot
Goker Erdogan
Yana Hasson
Yi Yang
Klaus Greff
Guillaume Le Moing
Sjoerd van Steenkiste
Daniel Zoran
Drew A. Hudson
Pedro V'elez
Luisa F. Polan'ia … (voir 15 de plus)
Luke Friedman
Chris Duvarney
Kelsey Allen
Jacob Walker
Rishabh Kabra
Eric Aboussouan
Jennifer Sun
Thomas Kipf
Carl Doersch
Viorica Puatruaucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations… (voir plus) on semantic-related tasks
TRecViT: A Recurrent Video Transformer
Viorica Puatruaucean
Xu Owen He
Joseph Heyward
Chuhan Zhang
Mehdi S. M. Sajjadi
George-Cristian Muraru
Artem Zholus
Mahdi Karami
Yutian Chen 0001
Simon Kayode Osindero
João Carreira
We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (voir plus)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having
BootsTAP: Bootstrapped Training for Tracking-Any-Point
Carl Doersch
Yi Yang
Dilara Gokay
Pauline Luc
Skanda Koppula
Ankush Gupta
Joseph Heyward
Ignacio Rocco
João Carreira
Andrew Zisserman
To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform… (voir plus) in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/
Satellite Sunroof: High-res Digital Surface Models and Roof Segmentation for Global Solar Mapping
Vishal Batchu
A. Wilson
Betty Peng
Carl D. Elkin
Umangi Jain
Christopher Van Arsdale
Varun Gulshan
The transition to renewable energy, particularly solar, is key to mitigating climate change. Google's Solar API aids this transition by esti… (voir plus)mating solar potential from aerial imagery, but its impact is constrained by geographical coverage. This paper proposes expanding the API's reach using satellite imagery, enabling global solar potential assessment. We tackle challenges involved in building a Digital Surface Model (DSM) and roof instance segmentation from lower resolution and single oblique views using deep learning models. Our models, trained on aligned satellite and aerial datasets, produce 25cm DSMs and roof segments. With ~1m DSM MAE on buildings, ~5deg roof pitch error and ~56% IOU on roof segmentation, they significantly enhance the Solar API's potential to promote solar adoption.
Course Correcting Koopman Representations
Mahan Fathi
Clement Gehring
Jonathan Pilault
David Kanaa
Block-State Transformers
Jonathan Pilault
Mahan Fathi
Orhan Firat
Block-State Transformers
Mahan Fathi
Jonathan Pilault
Orhan Firat
Block-State Transformers
Mahan Fathi
Jonathan Pilault
Orhan Firat
Block-State Transformers
Mahan Fathi
Jonathan Pilault
Orhan Firat
State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long… (voir plus) sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.
Block-State Transformers
Mahan Fathi
Jonathan Pilault
Orhan Firat