Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

Improving Automatic VQA Evaluation Using Large Language Models

Oscar Mañas

Benno Krojer

8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accur… (see more)acy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.

2024-03-24

Proceedings of the AAAI Conference on Artificial Intelligence (published)

Molar Pregnancy in a Quadruplet Conception Following IVF: A Case Report

Madhuri A Mehendale

Meenal Shailesh Sarmalkar

Prerna Kailashchand Gupta

Agraj S Doshi

2024-02-23

Journal of South Asian Federation of Obstetrics and Gynaecology (published)

Recent Excavation of Nanoethosomes in Current Drug Delivery.

Sankha Bhattacharya

Aalind Joshi

In the current era, the Transdermal delivery of bioactive molecules has become an area of research interest. The transdermal route of admini… (see more)stration enables direct entry of bioactive molecules into the systemic circulation with better and easy accessibility, bypassing the hepatic metabolism and improving patient compliance. Permeation through the skin has always been a barrier. To overcome this challenge, an efficient route by the vesicular system has been adopted so as to have better skin permeation of the bioactive molecules. A novel vesicular and non-invasive drug delivery system called Nanoethosomes was developed. Nanoethosomes are lipid-based vesicular carriers that are used for deeper permeation of the bioactive agents into the skin. The main components of Nanoethosomes are Phospholipids, water, and ethanol. High ethanol concentration in Nanoethosomes distinguishes them from other nano-formulation and results in deeper permeation and smaller vesicular size. This review article gives detailed information on the formulation techniques, and characterization parameters of nanoethosomes along with the research work done by various researchers in the same field. The compiled manuscript gives detailed elaboration about the various drugs used to treat different diseases which when incorporated in nanoethosomes resulted in better permeability and enhanced bioavailability.

2024-02-01

Current Drug Delivery (published)

Benchmarking Vision Language Models for Cultural Understanding

Shravan Nayak

Kanishk Jain

Rabiul Awal

Siva Reddy

Sjoerd van Steenkiste

Lisa Anne Hendricks

Karolina Stanczak

Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of vi… (see more)sual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.

2024-01-01

EMNLP (published)

MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

Le Zhang

Yihong Wu

Fengran Mo

Jian-Yun Nie

Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, t… (see more)ables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.

2023-10-07

EMNLP/2023/Conference (published)

openreview.net

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Rabiul Awal

Le Zhang

In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contem… (see more)porary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs"see"the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

2023-06-16

ArXiv (preprint)

An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics

Saba Ahmadi

2023-05-24

ArXiv (preprint)

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Oscar Mañas

Pau Rodriguez

Saba Ahmadi

Aida Nematzadeh

Yash Goyal

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We p… (see more)ropose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.

2023-05-01

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (published)