Rim Assouel

PGT: Procedurally Generated Tasks for improving fine-grained understanding in MLLMs

Amir Bar

Adriana Romero-Soriano

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. … (voir plus)In this work, we propose **Procedurally Generated Tasks (PGT)** a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20\% on the What’sUp benchmark and +13.3\% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5\% on What’sUp and +8.3\% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

2025-12-31

International Conference on Machine Learning (Accept (regular))

openreview.net

Object-centric Binding in Contrastive Language-Image Pretraining

Rim Assouel

Pietro Astolfi

Florian Bordes

Michal Drozdzal

Adriana Romero-Soriano

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual informa… (voir plus)tion with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Visual symbolic mechanisms: Emergent symbol processing in vision language models

Rim Assouel

Declan Campbell

Taylor Webb

To accurately process a visual scene, observers must bind features together to represent individual objects. This capacity is necessary, for… (voir plus) instance, to distinguish an image containing a red square and a blue circle from an image containing a blue square and a red circle. Recent work has found that language models solve this'binding problem'via a set of symbol-like, content-independent indices, but it is unclear whether similar mechanisms are employed by vision language models (VLMs). This question is especially relevant, given the persistent failures of VLMs on tasks that require binding. Here, we identify a set of emergent symbolic mechanisms that support binding in VLMs via a content-independent, spatial indexing scheme. Moreover, we find that binding errors can be traced directly to failures in these mechanisms. Taken together, these results shed light on the mechanisms that support symbol-like processing in VLMs, and suggest possible avenues for addressing the persistent binding failures exhibited by these models.

2025-06-17

ArXiv (prépublication)

doi.org

arxiv.org

Action Abstractions for Amortized Sampling

Lena Nehale Ezzine

Nikolay Malkin

As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignm… (voir plus)ent and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking' them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

2025-01-21

International Conference on Learning Representations (poster)

doi.org

openreview.net

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles

Maxime Gasse

Alexandre Lacoste

Massimo Caccia

Lawrence Keunho Jang

Ori Yoran

Dehan Kong

Frank F. Xu

Siva Reddy

Quentin Cappart

Graham Neubig

Ruslan Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2024-12-31

Trans. Mach. Learn. Res. (publié)

doi.org

openreview.net

OC-CLIP : Object-centric binding in Contrastive Language-Image Pretraining

Rim Assouel

Pietro Astolfi

Florian Bordes

Michal Drozdzal

Adriana Romero

Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP which learn to associate visual inform… (voir plus)ation with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.

2024-10-09

NeurIPS.cc/2024/Workshop/Compositional_Learning (poster)

openreview.net

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 22 de plus)

Vasu Sharma

Huijuan Xu 0001

Hu Xu

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

2024-05-26

arXiv (prépublication)

doi.org

arxiv.org

The Unsolved Challenges of LLMs as Generalist Web Agents: A Case Study

Rim Assouel

Tom Marty

Massimo Caccia

Issam Hadj Laradji

Alexandre Drouin

Sai Rajeswar

Hector Palacios

Quentin Cappart

David Vázquez

Nicolas Chapados

Maxime Gasse

Alexandre Lacoste

2023-11-06

NeurIPS.cc/2023/Workshop/FMDM (publié)

openreview.net

OC-NMN: Object-centric Compositional Neural Module Network for Generative Visual Analogical Reasoning

Pau Rodríguez

A key aspect of human intelligence is the ability to imagine -- composing learned concepts in novel ways -- to make sense of new scenarios. … (voir plus)Such capacity is not yet attained for machine learning systems. In this work, in the context of visual reasoning, we show how modularity can be leveraged to derive a compositional data augmentation framework inspired by imagination. Our method, denoted Object-centric Compositional Neural Module Network (OC-NMN), decomposes visual generative reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that our modular architectural choices can be used to generate new training tasks that lead to better out-of-distribution generalization. We compare our model to existing and new baselines in proposed visual reasoning benchmark that consists of applying arithmetic operations to MNIST digits.

2023-10-27

ArXiv (prépublication)

doi.org

arxiv.org

VIM: Variational Independent Modules for Video Prediction

Lluis Castrejon

2022-06-27

Proceedings of the First Conference on Causal Learning and Reasoning (publié)

proceedings.mlr.press

Object-centric Compositional Imagination for Visual Abstract Reasoning

Pau Rodríguez

Like humans devoid of imagination, current machine learning systems lack the ability to adapt to new, unexpected situations by foreseeing th… (voir plus)em, which makes them unable to solve new tasks by analogical reasoning. In this work, we introduce a new compositional imagination framework that improves a model's ability to generalize. One of the key components of our framework is object-centric inductive biases that enables models to perceive the environment as a series of objects, properties, and transformations. By composing these key ingredients, it is possible to generate new unseen tasks that, when used to train the model, improve generalization. Experiments on a simplified version of the Abstraction and Reasoning Corpus (ARC) demonstrate the effectiveness of our framework.

2022-03-24

ICLR.cc/2022/Workshop/OSC (poster)

openreview.net