Artem Zholus

Martin Sundermeyer

Carl Doersch

Ross Goroshin

David Joseph Tan

Sarath Chandar

Rudolph Triebel

Federico Tombari

Tracking-Any-Point (TAP) models aim to track any point through a video which is a crucial task in AR/XR and robotics applications. The recen… (see more)tly introduced TAPNext approach proposes an end-to-end, recurrent transformer architecture to track points frame-by-frame in a purely online fashion -- demonstrating competitive performance at minimal latency. However, we show that TAPNext struggles with longer video sequences and also frequently fails to re-detect query points that reappear after being occluded or leaving the frame. In this work, we present TAPNext++, a model that tracks points in sequences that are orders of magnitude longer while preserving the low memory and compute footprint of the architecture. We train the recurrent video transformer using several data-driven solutions, including training on long 1024-frame sequences enabled by sequence parallelism techniques. We highlight that re-detection performance is a blind spot in the current literature and introduce a new metric, Re-Detection Average Jaccard (

2026-04-11

arXiv (preprint)

Emergent Reasoning via Recursive Latent Reinforcement Pretraining

Large language models (LLMs) often rely on explicit chain-of-thought (CoT) traces to solve multi-step reasoning problems, but these traces i… (see more)ncrease inference cost, expose brittle prompt dependence, and complicate training objectives. We study an alternative: \emph{latent deliberation} implemented as a small recurrent refinement module that performs multiple internal ``thinking`` steps while keeping the external sequence length fixed. We introduce \textbf{Recursive Latent Reinforcement Pretraining (RLRP)}, a training recipe that augments a base causal LLM with a shared latent head executed for

2026-03-04

LLM_Reasoning @ International Conference on Learning Representations (published)

openreview.net

TRecViT: A Recurrent Video Transformer

Viorica Patraucean

Xu Owen He

Joseph Heyward

Chuhan Zhang

Mehdi S. M. Sajjadi

George-Cristian Muraru

Mahdi Karami

Ross Goroshin

Yutian Chen 0001

Simon Kayode Osindero

João Carreira

Razvan Pascanu

We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gate… (see more)d linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having

2025-12-31

Trans. Mach. Learn. Res. (published)

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran

Adrien Bardes

David Fan

Quentin Garrido

Russell Howes

Mojtaba Komeili

Matthew J. Muckley

Ammar Rizvi

Claire Roberts

Koustuv Sinha

Sergio Arnaud

Abha Gejji

Ada Martin

Francois Hogan

Daniel Dugas

Piotr Bojanowski

Vasil Khalidov

Patrick Labatut

Francisco Massa … (see 13 more)

Marc Szafraniec

K. Krishnakumar

Ying Li

Xiaodong Ma

A. Chandar

Franziska Meier

Yann Lecun

Michael G. Rabbat

Nicolas Ballas

Fair at Meta

Mila - Québec

AI Institute

Polytechnique Montréal

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

2025-06-10

ArXiv (preprint)

BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

Maksim Kuznetsov

Roman Schutski

Shayakhmetov Rim

Daniil Polykovskiy

A. Chandar

Alex Zhavoronkov

Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding … (see more)of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.

2025-04-10

Proceedings of the AAAI Conference on Artificial Intelligence (published)

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Carl Doersch

Yi Yang

Skanda Koppula

Viorica Patraucean

Xu Owen He

Ignacio Rocco

Mehdi S. M. Sajjadi

A. Chandar

Ross Goroshin

2025-04-07

ArXiv (preprint)

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys

George Adamopoulos

Mehrab Hamidi

Stephanie Milani

Mohammad Reza Samsami

Özgür Şimşek

Understanding the mechanisms behind decisions taken by large foundation models in sequential tasks is critical to ensuring that such systems… (see more) operate transparently and safely. However, interpretability methods have not yet been applied extensively to large-scale agents based on reinforcement learning. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We try to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task --- crafting a diamond pickaxe. The agent seems to pay attention to the 4 last frames and several key-frames further back. This provides clues as to how it maintains coherence in the task that takes 3-10 minutes, despite the agent's short memory span of only six seconds. Second, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk and punches it to death, when positioned stationary under green tree leaves. We demonstrate similar misbehavior in a related agent (STEVE-1), which motivates the use of VPT as a model organism for large-scale vision-based agent interpretability.

2024-06-23

MI @ International Conference on Machine Learning (poster)

openreview.net

Mastering Memory Tasks with World Models

Mohammad Reza Samsami

Janarthanan Rajendran

A. Chandar

Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. This limits their ability to effectively solv… (see more)e tasks involving extended time gaps between actions and outcomes, or tasks demanding the recalling of distant observations to inform current actions. To improve temporal coherence, we integrate a new family of state space models (SSMs) in world models of MBRL agents to present a new method, Recall to Imagine (R2I). This integration aims to enhance both long-term memory and long-horizon credit assignment. Through a diverse set of illustrative tasks, we systematically demonstrate that R2I not only establishes a new state-of-the-art for challenging memory and credit assignment RL tasks, such as BSuite and POPGym, but also showcases superhuman performance in the complex memory domain of Memory Maze. At the same time, it upholds comparable performance in classic RL tasks, such as Atari and DMC, suggesting the generality of our method. We also show that R2I is faster than the state-of-the-art MBRL method, DreamerV3, resulting in faster wall-time convergence.

2024-01-15

ICLR.cc/2024/Conference (oral)

openreview.net

From Words to Blocks: Building Objects by Grounding Language Models with Reinforcement Learning

Michael Ahn

Anthony Brohan

Noah Brown

liang Dai

Dan Su

Holy Lovenia Ziwei Bryan Wilie

Tiezheng Yu

Willy Chung

Quyet V. Do

Paul Barde

Tristan Karch

D. Nowrouzezahrai

C. Bonial

Mitchell Abrams

David R. Traum

Hyung Won

Le Hou

Shayne Longpre

Yi Zoph

William Tay … (see 32 more)

Eric Fedus

Xuezhi Li

Lasse Espeholt

Hubert Soyer

Remi Munos

Karen Si-801

Vlad Mnih

Tom Ward

Yotam Doron

Wenlong Huang

Pieter Abbeel

Deepak Pathak

Julia Kiseleva

Ziming Li

Mohammad Aliannejadi

Shrestha Mohanty

Maartje Ter Hoeve

Mikhail Burtsev

Alexey Skrynnik