Portrait of Aishwarya Agrawal

Aishwarya Agrawal

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Google DeepMind, Montréal
Research Topics
Computer Vision
Deep Learning
Multimodal Learning
Natural Language Processing

Biography

Aishwarya Agrawal is an assistant professor in the Department of Computer Science and Operations Research at Université de Montréal, a Canada CIFAR AI Chair, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

Agrawal also works as a research scientist one day a week at DeepMind. Previously, she held this position full time (August 2019 to December 2020). She completed her PhD in August 2019 at Georgia Tech, where she worked with Dhruv Batra and Devi Parikh.

Her research interests lie at the intersection of the following sub-disciplines of AI: computer vision, deep learning and natural language processing. The focus is developing AI systems that can ‘see’ (i.e., understand the contents of an image: who, what, where, doing what?) and ‘talk’ (i.e., communicate the understanding to humans in free-form natural language).

Aishwarya has received many awards and scholarships: Georgia Tech 2020 Sigma Xi Best PhD Thesis Award, 2020 Georgia Tech College of Computing Dissertation Award, 2019 Google Fellowship (declined due to graduation), 2019–2020 Facebook Fellowship (declined due to graduation) and 2018–2019 NVIDIA Graduate Fellowship. She was one of two runners-up in the 2019 AAAI/ACM SIGAI Dissertation Award, and was selected as a 2018 Rising Star in EECS.

She holds a bachelor's degree in electrical engineering with a minor in computer science and engineering from the Indian Institute of Technology Gandhinagar (2014).

Current Students

Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal

Publications

VisMin: Visual Minimal-Change Understanding
Rabiul Awal
Saba Ahmadi
Le Zhang
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evalua… (see more)te VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images given a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects. To curate our benchmark, we built an automatic pipeline using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks. Additionally, such fine-tuning improves CLIP's general image-text alignment capabilities too. All resources including the benchmark, the training data, and the finetuned model checkpoints will be released.
VisMin: Visual Minimal-Change Understanding
Rabiul Awal
Saba Ahmadi
Le Zhang
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing … (see more)benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
Qian Yang
Weixiang Yan
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate… (see more) and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose \textbf{De}compose and \textbf{C}ompare \textbf{C}onsistency (\texttt{DeCC}) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, \texttt{DeCC} measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show \texttt{DeCC}'s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream … (see more)tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in"bag-of-words"representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.
An Introduction to Vision-Language Modeling
Florian Bordes
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Oscar Mañas
Zhiqiu Lin
Anas Mahmoud
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Jonathan Lebensold
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 21 more)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Reyhane Askari Hemmat
Jun Chen
Kushal Tirumala
Rim Assouel
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Oscar Mañas
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Michal Drozdzal
Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas
Benno Krojer
8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accur… (see more)acy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.
Benchmarking Vision Language Models for Cultural Understanding
Shravan Nayak
Kanishk Jain
Rabiul Awal
Sjoerd van Steenkiste
Lisa Anne Hendricks
Karolina Sta'nczak
Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of vi… (see more)sual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Le Zhang
Yihong Wu
Fengran Mo
Jian-Yun Nie
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, t… (see more)ables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Rabiul Awal
Le Zhang
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contem… (see more)porary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs"see"the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas
Pau Rodriguez
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We p… (see more)ropose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.