Aishwarya Agrawal

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (see more)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

2025-06-10

ArXiv (preprint)

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics

Shravan Nayak

Mehar Bhatia

Xiaofeng Zhang

Verena Rieser

Lisa Anne Hendricks

Sjoerd van Steenkiste

Yash Goyal

Karolina Stanczak

The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (see more)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.

2025-06-10

ICML.cc/2025/Workshop/MoFA (poster)

openreview.net

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

Shivam Chandhok

Qian Yang

Oscar Mañas

Kanishk Jain

Leonid Sigal

Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale d… (see more)atasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.

2025-06-01

ArXiv (preprint)

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Le Zhang

Bo Wang

Xipeng Qiu

Siva Reddy

We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, sign… (see more)ificantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.

2025-05-01

arXiv (published)

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Juan A. Rodriguez

Montek Kalsi

Rabiul Awal

Nicolas Chapados

M. Tamer Özsu

David Vazquez

Chris Pal

Perouz Taslakian

Spandana Gella

Sai Rajeswar

Human Annotator

2025-05-01

ICML.cc/2025/Conference (poster)

openreview.net

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Shravan Nayak

Xiangru Jian

Kevin Qinghong Lin

Juan A. Rodriguez

Montek Kalsi

Rabiul Awal

Nicolas Chapados

M. T. ¨Ozsu

David Vazquez

Chris Pal

Perouz Taslakian

Spandana Gella

Sai Rajeswar

Human Annotator

2025-03-19

ArXiv (preprint)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Rabiul Awal

Mahsa Massoud

Zichao Li

Aarash Feizi

Suyuchen Wang

Chris Pal

David Vazquez

Siva Reddy

Juan A. Rodriguez

Perouz Taslakian

Spandana Gella

Sai Rajeswar

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (see more)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.

2025-03-05

ICLR.cc/2025/Workshop/DL4C (published)

openreview.net

Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang

Qian Yang

How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (see more)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/

2024-12-05

ArXiv (preprint)

Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang

Qian Yang

How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (see more)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/

2024-12-05

ArXiv (preprint)

Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang

Qian Yang

How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (see more)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/

2024-12-05

ArXiv (preprint)

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Oscar Mañas

Pietro Astolfi

Melissa Hall

Candace Ross

Jack Urbanek

Adina Williams