Le Zhang

MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

Le Zhang

Yihong Wu

Fengran Mo

Jian-Yun Nie

Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, t… (voir plus)ables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.

2023-10-07

EMNLP/2023/Conference (publié)

doi.org

openreview.net

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Md. Rabiul Awal

Le Zhang

Aishwarya Agrawal

In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contem… (voir plus)porary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs"see"the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

2023-06-16

ArXiv (prépublication)

doi.org

arxiv.org

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Le Zhang

Rabiul Awal

Aishwarya Agrawal

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream … (voir plus)tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in"bag-of-words"representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

2023-06-15

ArXiv (prépublication)

arxiv.org

Single-cell analysis reveals inflammatory interactions driving macular degeneration

Manik Kuchroo

Marcello DiStasio

Eric Song

Eda Calapkulu

Le Zhang

Maryam Ige

Amar H. Sheth

Abdelilah Majdoubi

Madhvi Menon

Alexander Tong

Abhinav Godavarthi

Yu Xing

Scott Gigante

Holly Steach

Jessie Huang

Je-chun Huang

Guillaume Huguet

Janhavi Narain

Kisung You

George Mourgkos … (voir 6 de plus)

Rahul M. Dhodapkar

Matthew Hirn

Bastian Rieck

Guy Wolf

Smita Krishnaswamy

Brian P. Hafler

2023-05-05

Nature Communications (publié)

doi.org

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding

Le Zhang

Md. Rabiul Awal

Aishwarya Agrawal

2023-01-01

arXiv.org (prépublication)

doi.org

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Publications

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Mots-clés populaires:

Le Zhang

Publications