Publications

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo

Luke Ong

Philip Torr

Mor Geva

David Scott Krueger

Fazl Barez

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens … (see more)in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization of object information, the evolution of visual token representations across layers, and the mechanism of integrating visual information for predictions. Through ablation studies, we demonstrated that object identification accuracy drops by over 70\% when object-specific tokens are removed. We observed that visual token representations become increasingly interpretable in the vocabulary space across layers, suggesting an alignment with textual tokens corresponding to image content. Finally, we found that the model extracts object information from these refined representations at the last token position for prediction, mirroring the process in text-only language models for factual association tasks. These findings provide crucial insights into how VLMs process and integrate visual information, bridging the gap between our understanding of language and vision models, and paving the way for more interpretable and controllable multimodal systems.

2024-10-09

ArXiv (preprint)

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs

Itamar Pres

Laura Ruis

Ekdeep Singh Lubana

David Scott Krueger

Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipeli… (see more)nes for these methods have primarily relied on subjective demonstrations, instead of quantitative, objective metrics. We aim to take a step towards addressing this issue by advocating for four properties missing from current evaluations: (i) contexts sufficiently similar to downstream tasks should be used for assessing intervention quality; (ii) model likelihoods should be accounted for; (iii) evaluations should allow for standardized comparisons across different target behaviors; and (iv) baseline comparisons should be offered. We introduce an evaluation pipeline grounded in these criteria, offering both a quantitative and visual analysis of how effectively a given method works. We use this pipeline to evaluate two representation engineering methods on how effectively they can steer behaviors such as truthfulness and corrigibility, finding that some interventions are less effective than previously reported.

2024-10-09

NeurIPS.cc/2024/Workshop/MINT (accepted)

VCR: Visual Caption Restoration

Tianyu Zhang

Suyuchen Wang

Lu Liu

Ge Zhang

Perouz Taslakian

Sai Rajeswar

Jie Fu

Bang Liu

Yoshua Bengio

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (see more)texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.

2024-10-09

NeurIPS.cc/2024/Workshop/Sys2-Reasoning (poster)

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple r… (see more)easoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

VinePPO: Accurate Credit Assignment in RL for LLM Mathematical Reasoning

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly required to solve complex reasoning tasks, like mathematical problems, that involve multiple r… (see more)easoning steps before feedback is received. Effectively identifying and prioritizing key steps by accurately assigning credit to these intermediate steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm for finetuning LLMs, addresses the credit assignment problem by employing value networks to predict the expected cumulative rewards of intermediate states. In this work, we identify significant limitations with this value estimation method. To address this, we propose \methodname that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates of the intermediate values. VinePPO consistently outperforms standard PPO, doing so more efficiently and with lower divergence from the reference model. Our findings underscore the critical importance of accurate credit assignment in LLM post-training and present a simple, yet effective solution.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

Visual Writing: Writing by Manipulating Visual Representations of Stories

Damien Masson

Zixin Zhao

Fanny Chevalier

We introduce "visual writing", an approach to writing stories by manipulating visuals instead of words. Visual writing relies on editable vi… (see more)sual representations of time, entities, events, and locations to offer representations more suited to specific editing tasks. We propose a taxonomy for these representations and implement a prototype software supporting the visual writing workflow. The system allows writers to edit the story by alternating between modifying the text and manipulating visual representations to edit entities, actions, locations, and order of events. We evaluate this workflow with eight creative writers and find visual writing can help find specific passages, keep track of story elements, specify edits, and explore story variations in a way that encourages creativity.

2024-10-09

ArXiv (preprint)

Visual Writing: Writing by Manipulating Visual Representations of Stories

Damien Masson

Zixin Zhao

Fanny Chevalier

We introduce"visual writing", an approach to writing stories by manipulating visuals instead of words. Visual writing relies on editable vis… (see more)ual representations of time, entities, events, and locations to offer representations more suited to specific editing tasks. We propose a taxonomy for these representations and implement a prototype software supporting the visual writing workflow. The system allows writers to edit the story by alternating between modifying the text and manipulating visual representations to edit entities, actions, locations, and order of events. We evaluate this workflow with eight creative writers and find visual writing can help find specific passages, keep track of story elements, specify edits, and explore story variations in a way that encourages creativity.

2024-10-09

ArXiv (preprint)

Differentiation Through Black-Box Quadratic Programming Solvers

Connor W. Magoon

Fengyu Yang

Noam Aigerman

Shahar Kovalsky

2024-10-08

ArXiv (preprint)

Path-filtering in path-integral simulations of open quantum systems using GFlowNets

Jeremy Lackman-Mincoff

Moksh J. Jain

Nikolay Malkin

Yoshua Bengio

Lena Simine

2024-10-08

Journal of Chemical Physics (published)

Alexia Jolicoeur-Martineau

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo

Gian Mario Favero

Zhi Hao Luo

Chris Pal

The Fr\'echet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectivene… (see more)ss relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

2024-10-07

ArXiv (preprint)

Alexia Jolicoeur-Martineau

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality

Ge Ya Luo

Gian Favero

Zhi Hao Luo

Chris Pal

The Fr\'echet Video Distance (FVD) is a widely adopted metric for evaluating video generation distribution quality. However, its effectivene… (see more)ss relies on critical assumptions. Our analysis reveals three significant limitations: (1) the non-Gaussianity of the Inflated 3D Convnet (I3D) feature space; (2) the insensitivity of I3D features to temporal distortions; (3) the impractical sample sizes required for reliable estimation. These findings undermine FVD's reliability and show that FVD falls short as a standalone metric for video generation evaluation. After extensive analysis of a wide range of metrics and backbone architectures, we propose JEDi, the JEPA Embedding Distance, based on features derived from a Joint Embedding Predictive Architecture, measured using Maximum Mean Discrepancy with polynomial kernel. Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.

2024-10-07

ArXiv (preprint)