Lucas Gomez

Doctorat - McGill

Superviseur⋅e principal⋅e

Pouya Bashivan

Sujets de recherche

Apprentissage profond

Site web

Publications

Caption This, Reason That: VLMs Caught in the Middle

Zihan Weng

Lucas Gomez

Taylor Whittington Webb

Pouya Bashivan

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capab… (voir plus)ilities in specific visual tasks such as counting or relational reasoning. To understand the underlying limitations, we adopt methodologies from cognitive science, analyzing VLM performance along core cognitive axes: Perception, Attention, and Memory. Using a suite of tasks targeting these abilities, we evaluate state-of-the-art VLMs, including GPT-4o. Our analysis reveals distinct cognitive profiles: while advanced models approach ceiling performance on some tasks (e.g. category identification), a significant gap persists, particularly in tasks requiring spatial understanding or selective attention. Investigating the source of these failures and potential methods for improvement, we employ a vision-text decoupling analysis, finding that models struggling with direct visual reasoning show marked improvement when reasoning over their own generated text captions. These experiments reveal a strong need for improved VLM Chain-of-Thought (CoT) abilities, even in models that consistently exceed human performance. Furthermore, we demonstrate the potential of targeted fine-tuning on composite visual reasoning tasks and show that fine-tuning smaller VLMs substantially improves core cognitive abilities. While this improvement does not translate to large enhancements on challenging, out-of-distribution benchmarks, we show broadly that VLM performance on our datasets strongly correlates with performance on these other benchmarks. Our work provides a detailed analysis of VLM cognitive strengths and weaknesses and identifies key bottlenecks in simultaneous perception and reasoning while also providing an effective and simple solution.

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

doi.org

openreview.net

iWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei

Lucas Gomez

Hao Yuan Bai

Pouya Bashivan

The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are no… (voir plus)t only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.

2023-12-31

CoLLAs (publié)

doi.org

proceedings.mlr.press

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Lucas Gomez

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Lucas Gomez

Publications