Portrait de Aishwarya Agrawal

Aishwarya Agrawal

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheuse scientifique, Google DeepMind, Montréal

Biographie

Aishwarya Agrawal est professeure agrégée au Département d'informatique et de recherche opérationnelle de l'Université de Montréal. Elle est également titulaire d'une chaire en IA Canada-CIFAR et membre académique principale de Mila – Institut québécois d’intelligence artificielle.

Elle passe également un jour par semaine chez DeepMind en tant que chercheuse scientifique; d'août 2019 à décembre 2020, elle y a été chercheuse scientifique à plein temps. Détentrice d’un baccalauréat en génie électrique avec une mineure en informatique, Aishwarya a obtenu en août 2019 un doctorat de Georgia Tech, en travaillant avec Dhruv Batra et Devi Parikh. Ses intérêts de recherche se situent à l'intersection des sous-disciplines suivantes de l'IA : vision par ordinateur, apprentissage profond et traitement du langage naturel, avec un accent sur le développement de systèmes d'IA capables de « voir » (c'est-à-dire de comprendre le contenu d'une image : qui, quoi, où, qui fait quoi ?) et de « parler » (c'est-à-dire de communiquer cette compréhension aux humains en langage naturel libre).

Elle a reçu plusieurs prix et bourses, dont le prix des chaires en IA Canada-CIFAR, le prix de la meilleure thèse de doctorat Sigma Xi 2020 et le prix de la dissertation 2020 du College of Computing de Georgia Tech, la bourse Google 2019 et la bourse Facebook 2019-2020 (toutes deux refusées en raison de l'obtention du diplôme), ainsi que la bourse d’études supérieures NVIDIA 2018-2019. Aishwarya a été l'une des deux finalistes du prix de la meilleure thèse 2019 de l'AAAI / ACM SIGAI. Elle a également été sélectionnée pour les Rising Stars in EECS 2018.

Étudiants actuels

Maîtrise recherche - Université de Montréal
Stagiaire de recherche - Université de Montréal
Doctorat - Université de Montréal
Maîtrise recherche - Université de Montréal
Doctorat - Université de Montréal
Maîtrise recherche - Université de Montréal
Maîtrise recherche - Université de Montréal
Co-superviseur⋅e :
Maîtrise professionnelle - Université de Montréal
Maîtrise professionnelle - Université de Montréal

Publications

Improving Text-to-Image Consistency via Automatic Prompt Optimization
Oscar Mañas
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Michal Drozdzal
Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas
Benno Krojer
8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accur… (voir plus)acy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Le Zhang
Yihong Wu
Fengran Mo
Jian-Yun Nie
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, t… (voir plus)ables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Md. Rabiul Awal
Le Zhang
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contem… (voir plus)porary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs"see"the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream … (voir plus)tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in"bag-of-words"representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas
Pau Rodriguez
Saba Ahmadi
Aida Nematzadeh
Yash Goyal
Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We p… (voir plus)ropose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/oscmansan/mapl.
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Ivana Kaji'c
Emanuele Bugliarello
Elnaz Davoodi
Anita Gergely
Phil Blunsom
Aida Nematzadeh
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding
Le Zhang
Md. Rabiul Awal
Measuring Progress in Fine-grained Vision-and-Language Understanding
Emanuele Bugliarello
Laurent Sartran
Lisa Anne Hendricks
Aida Nematzadeh
While pretraining on large-scale image–text data from the Web has facilitated rapid progress on many vision-and-language (V&L) tasks, rece… (voir plus)nt work has demonstrated that pretrained models lack “fine-grained” understanding, such as the ability to recognise relationships, verbs, and numbers in images. This has resulted in an increased interest in the community to either develop new benchmarks or models for such capabilities. To better understand and quantify progress in this direction, we investigate four competitive V&L models on four fine-grained benchmarks. Through our analysis, we find that X-VLM (Zeng et al., 2022) consistently outperforms other baselines, and that modelling innovations can impact performance more than scaling Web data, which even degrades performance sometimes. Through a deeper investigation of X-VLM, we highlight the importance of both novel losses and rich data sources for learning fine-grained skills. Finally, we inspect training dynamics, and discover that for some tasks, performance peaks early in training or significantly fluctuates, never converging.
Vision-Language Pretraining: Current Trends and the Future
Damien Teney
Aida Nematzadeh
In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger bu… (voir plus)t noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., Lu et al., 2019; Radford et al., 2021). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. (e.g., Lu et al., 2019; Chen et al.,2020; Tan and Bansal, 2019; Li et al., 2020a,b). In addition to the larger pretraining datasets, the transformer architecture (Vaswani et al., 2017) and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrained models on downstream tasks (Hendricks et al., 2021). In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image–language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal representation learning.
Vision-Language Pretraining: Current Trends and the Future
Damien Teney
Aida Nematzadeh
In the last few years, there has been an increased interest in building multimodal (vision-language) models that are pretrained on larger bu… (voir plus)t noisier datasets where the two modalities (e.g., image and text) loosely correspond to each other (e.g., Lu et al., 2019; Radford et al., 2021). Given a task (such as visual question answering), these models are then often fine-tuned on task-specific supervised datasets. (e.g., Lu et al., 2019; Chen et al.,2020; Tan and Bansal, 2019; Li et al., 2020a,b). In addition to the larger pretraining datasets, the transformer architecture (Vaswani et al., 2017) and in particular self-attention applied to two modalities are responsible for the impressive performance of the recent pretrained models on downstream tasks (Hendricks et al., 2021). In this tutorial, we focus on recent vision-language pretraining paradigms. Our goal is to first provide the background on image–language datasets, benchmarks, and modeling innovations before the multimodal pretraining area. Next we discuss the different family of models used for vision-language pretraining, highlighting their strengths and shortcomings. Finally, we discuss the limits of vision-language pretraining through statistical learning, and the need for alternative approaches such as causal representation learning.