Portrait de Aishwarya Agrawal

Aishwarya Agrawal

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Chercheuse scientifique, Google DeepMind, Montréal
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Traitement du langage naturel
Vision par ordinateur

Biographie

Aishwarya Agrawal est professeure adjointe au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Elle est également titulaire d'une chaire en IA Canada-CIFAR et membre académique principale de Mila – Institut québécois d’intelligence artificielle.

Elle passe également un jour par semaine chez DeepMind en tant que chercheuse scientifique; d'août 2019 à décembre 2020, elle y a été chercheuse scientifique à plein temps. Détentrice d’un baccalauréat en génie électrique avec une mineure en informatique, Aishwarya a obtenu en août 2019 un doctorat de Georgia Tech, en travaillant avec Dhruv Batra et Devi Parikh. Ses intérêts de recherche se situent à l'intersection des sous-disciplines suivantes de l'IA : vision par ordinateur, apprentissage profond et traitement du langage naturel, avec un accent sur le développement de systèmes d'IA capables de « voir » (c'est-à-dire de comprendre le contenu d'une image : qui, quoi, où, qui fait quoi ?) et de « parler » (c'est-à-dire de communiquer cette compréhension aux humains en langage naturel libre).

Elle a reçu plusieurs prix et bourses, dont le prix des chaires en IA Canada-CIFAR, le prix de la meilleure thèse de doctorat Sigma Xi 2020 et le prix de la dissertation 2020 du College of Computing de Georgia Tech, la bourse Google 2019 et la bourse Facebook 2019-2020 (toutes deux refusées en raison de l'obtention du diplôme), ainsi que la bourse d’études supérieures NVIDIA 2018-2019. Aishwarya a été l'une des deux finalistes du prix de la meilleure thèse 2019 de l'AAAI / ACM SIGAI. Elle a également été sélectionnée pour les Rising Stars in EECS 2018.

Étudiants actuels

Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche - University of British Columbia
Visiteur de recherche indépendant - Michigan State University
Maîtrise recherche - UdeM

Publications

An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (voir 21 de plus)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
Improving Automatic VQA Evaluation Using Large Language Models
8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accur… (voir plus)acy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.
Molar Pregnancy in a Quadruplet Conception Following IVF: A Case Report
Madhuri A Mehendale
Meenal Shailesh Sarmalkar
Prerna Kailashchand Gupta
Agraj S Doshi
Recent Excavation of Nanoethosomes in Current Drug Delivery.
Sankha Bhattacharya
Aalind Joshi
In the current era, the Transdermal delivery of bioactive molecules has become an area of research interest. The transdermal route of admini… (voir plus)stration enables direct entry of bioactive molecules into the systemic circulation with better and easy accessibility, bypassing the hepatic metabolism and improving patient compliance. Permeation through the skin has always been a barrier. To overcome this challenge, an efficient route by the vesicular system has been adopted so as to have better skin permeation of the bioactive molecules. A novel vesicular and non-invasive drug delivery system called Nanoethosomes was developed. Nanoethosomes are lipid-based vesicular carriers that are used for deeper permeation of the bioactive agents into the skin. The main components of Nanoethosomes are Phospholipids, water, and ethanol. High ethanol concentration in Nanoethosomes distinguishes them from other nano-formulation and results in deeper permeation and smaller vesicular size. This review article gives detailed information on the formulation techniques, and characterization parameters of nanoethosomes along with the research work done by various researchers in the same field. The compiled manuscript gives detailed elaboration about the various drugs used to treat different diseases which when incorporated in nanoethosomes resulted in better permeability and enhanced bioavailability.
Benchmarking Vision Language Models for Cultural Understanding
Sjoerd van Steenkiste
Lisa Anne Hendricks
Karolina Stanczak
Foundation models and vision-language pre-training have notably advanced Vision Language Models (VLMs), enabling multimodal processing of vi… (voir plus)sual and linguistic data. However, their performance has been typically assessed on general scene understanding - recognizing objects, attributes, and actions - rather than cultural comprehension. This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing VLM's geo-diverse cultural understanding. We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents. The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions. Benchmarking VLMs on CulturalVQA, including GPT-4V and Gemini, reveals disparity in their level of cultural understanding across regions, with strong cultural understanding capabilities for North America while significantly lower performance for Africa. We observe disparity in their performance across cultural facets too, with clothing, rituals, and traditions seeing higher performances than food and drink. These disparities help us identify areas where VLMs lack cultural understanding and demonstrate the potential of CulturalVQA as a comprehensive evaluation set for gauging VLM progress in understanding diverse cultures.
MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Yihong Wu
Fengran Mo
Jian-Yun Nie
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, t… (voir plus)ables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contem… (voir plus)porary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs' performance in many cases, even though VLMs"see"the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs' alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.