Publications

Georg Martius

Maximilian Seitzer

2025-01-22

ICLR.cc/2025/Conference (poster)

On the Transfer of Object-Centric Representation Learning

Aniket Rajiv Didolkar

Andrii Zadaianchuk

Anirudh Goyal

Michael Curtis Mozer

Georg Martius

Maximilian Seitzer

The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities… (voir plus) into individual vectors. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing features from pre-trained foundation models like DINO. However, so far, these object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the underlying foundation models, which have been shown to be applicable to a wide range of data and tasks. Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on diverse real-world images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards General-Purpose Model-Free Reinforcement Learning

Scott Fujimoto

Pierluca D'Oro

Amy Zhang

Yuandong Tian

Michael Rabbat

Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored t… (voir plus)o specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive general results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate our algorithm, MR.Q, on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and general baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

2025-01-22

ICLR.cc/2025/Conference (spotlight)

Towards Improving Exploration through Sibling Augmented GFlowNets

Kanika Madan

Alex Lamb

Emmanuel Bengio

Glen Berseth

Exploration is a key factor for the success of an active learning agent, especially when dealing with sparse extrinsic terminal rewards and … (voir plus)long trajectories. We introduce Sibling Augmented Generative Flow Networks (SA-GFN), a novel framework designed to enhance exploration and training efficiency of Generative Flow Networks (GFlowNets). SA-GFN uses a decoupled dual network architecture, comprising of a main Behavior Network and an exploratory Sibling Network, to enable a diverse exploration of the underlying distribution using intrinsic rewards. Inspired by the ideas on exploration from reinforcement learning, SA-GFN provides a general-purpose exploration and learning paradigm that integrates with multiple GFlowNet training objectives and is especially helpful for exploration over a wide range of sparse or low reward distributions and task structures. An extensive set of experiments across a diverse range of tasks, reward structures and trajectory lengths, along with a thorough set of ablations, demonstrate the superior performance of SA-GFN in terms of exploration efficacy and convergence speed as compared to the existing methods. In addition, SA-GFN's versatility and compatibility with different GFlowNet training objectives and intrinsic reward methods underscores its broad applicability in various problem domains.

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards Improving Exploration through Sibling Augmented GFlowNets.

Kanika Madan

Alex Lamb

Emmanuel Bengio

Glen Berseth

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo

Luke Ong

Philip Torr

Mor Geva

David Scott Krueger

Fazl Barez

2025-01-22

ICLR.cc/2025/Conference (poster)

Towards whole-genome inference of polygenic scores with fast and memory-efficient algorithms

Shadi Zabad

Chirayu Anant Haryan

Simon Gravel

Sanchit Misra

Yue Li

2025-01-22

bioRxiv (prépublication)

A Unifying Framework for Action-Conditional Self-Predictive Reinforcement Learning

Khimya Khetarpal

Zhaohan Daniel Guo

Bernardo Avila Pires

Yunhao Tang

Clare Lyle

Mark Rowland

Nicolas Heess

Diana Borsa

Arthur Guez

Will Dabney

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

VCR: Pixel-Level Complex Reasoning by Restoring Occluded Text

Tianyu Zhang

Suyuchen Wang

Lu Liu

Ge Zhang

Perouz Taslakian

Sai Rajeswar

Jie Fu

Bang Liu

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (voir plus)texts using pixel-level hints within images through complex reasoning. This task stems from the observation that text embedded in images intrinsically differs from common visual elements and text due to the need to align the modalities of vision, text, and text embedded in images. While many works incorporate text into images for visual question answering, they mostly rely on OCR or masked language modeling, reducing the task to text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny, exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct VCR-WIKI for VCR using Wikipedia images with captions, including 2.11M English and 346K Chinese training entities, plus 5K validation and 5K test entities in both languages, each in easy and hard configurations. We also make a hidden test set, VCR-HIDDEN, to avoid potential overfitting on VCR-WIKI. Our results reveal that current vision-language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-WIKI and the data construction code to facilitate future research.

2025-01-22

ICLR.cc/2025/Conference (poster)

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text

Tianyu Zhang

Suyuchen Wang

Lu Liu

Ge Zhang

Perouz Taslakian

Sai Rajeswar

Jie Fu

Bang Liu

2025-01-22

ICLR.cc/2025/Conference (poster)

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text

Tianyu Zhang

Suyuchen Wang

Lu Li

Ge Zhang

Perouz Taslakian

Sai Rajeswar

Jie Fu

Bang Liu

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (voir plus)texts using pixel-level hints within images through complex reasoning. This task stems from the observation that text embedded in images intrinsically differs from common visual elements and text due to the need to align the modalities of vision, text, and text embedded in images. While many works incorporate text into images for visual question answering, they mostly rely on OCR or masked language modeling, reducing the task to text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny, exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct VCR-WIKI for VCR using Wikipedia images with captions, including 2.11M English and 346K Chinese training entities, plus 5K validation and 5K test entities in both languages, each in easy and hard configurations. We also make a hidden test set, VCR-HIDDEN, to avoid potential overfitting on VCR-WIKI. Our results reveal that current vision-language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-WIKI and the data construction code to facilitate future research.

2025-01-22

ICLR.cc/2025/Conference (poster)

What Secrets Do Your Manifolds Hold? Understanding the Local Geometry of Generative Models

Ahmed Imtiaz Humayun

Ibtihel Amara

Candice Schumann

Cristina Nader Vasconcelos

Golnoosh Farnadi

Deepak Ramachandran

Negar Rostamzadeh

Junfeng He

Mohammad Havaei

Katherine Heller

2025-01-22

ICLR.cc/2025/Conference (poster)