Portrait of Aishwarya Agrawal

Aishwarya Agrawal

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Google DeepMind, Montréal
Research Topics
Computer Vision
Deep Learning
Multimodal Learning
Natural Language Processing

Biography

Aishwarya Agrawal is an assistant professor in the Department of Computer Science and Operations Research at Université de Montréal, a Canada CIFAR AI Chair, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

Agrawal also works as a research scientist one day a week at DeepMind. Previously, she held this position full time (August 2019 to December 2020). She completed her PhD in August 2019 at Georgia Tech, where she worked with Dhruv Batra and Devi Parikh.

Her research interests lie at the intersection of the following sub-disciplines of AI: computer vision, deep learning and natural language processing. The focus is developing AI systems that can ‘see’ (i.e., understand the contents of an image: who, what, where, doing what?) and ‘talk’ (i.e., communicate the understanding to humans in free-form natural language).

Aishwarya has received many awards and scholarships: Georgia Tech 2020 Sigma Xi Best PhD Thesis Award, 2020 Georgia Tech College of Computing Dissertation Award, 2019 Google Fellowship (declined due to graduation), 2019–2020 Facebook Fellowship (declined due to graduation) and 2018–2019 NVIDIA Graduate Fellowship. She was one of two runners-up in the 2019 AAAI/ACM SIGAI Dissertation Award, and was selected as a 2018 Rising Star in EECS.

She holds a bachelor's degree in electrical engineering with a minor in computer science and engineering from the Indian Institute of Technology Gandhinagar (2014).

Current Students

PhD - Université de Montréal
Master's Research - McGill University
Principal supervisor :
Collaborating researcher - Korea University
Master's Research - Université de Montréal
PhD - Université de Montréal
Collaborating researcher - International Institute of Information Technology
PhD - Université de Montréal
PhD - Université Laval
Principal supervisor :
Professional Master's - Université de Montréal
Master's Research - Université de Montréal
Professional Master's - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal

Publications

CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Verena Rieser
Lisa Anne Hendricks
Sjoerd van Steenkiste
Karolina Stanczak
The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (see more)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, sign… (see more)ificantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Adriana Romero-Soriano
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate ae… (see more)sthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
Controlling Multimodal LLMs via Reward-guided Decoding
Enhancing Multi-Agent Multi-Modal Collaboration with Fine-Grained Reward Modeling
Multi-Modal Large Language Models (MLLMs) have significantly advanced multi-modal reasoning but still struggle with compositional reasoning … (see more)tasks. Multi-agent collaboration provides a promising solution by leveraging the distinct capabilities of different agents. Specifically, a decomposer agent to handle task breakdown and an answerer agent to generate responses. While there have been efforts to adaptively decompose tasks based on the answerer agent's capabilities, such as using in-context learning, these methods often prove insufficient for fully effective decomposition. We address this issue by enhancing collaboration through fine-grained reward modeling, where each generated sub-question is assigned a specialized reward without requiring extra annotation or tuning of a reward model. Our proposed method dynamically optimizes the decomposition process, enabling better alignment between agents. Experimental results on four vision-language tasks demonstrate consistent improvements, with a 5.5\% absolute increase in mean performance over traditional approaches. These findings highlight the efficacy of fine-grained reward modeling for enhancing multi-agent, multi-modal collaboration.
Visual Language Alignment Tuning
VisMin: Visual Minimal-Change Understanding
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing … (see more)benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate… (see more) and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose \textbf{De}compose and \textbf{C}ompare \textbf{C}onsistency (\texttt{DeCC}) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, \texttt{DeCC} measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show \texttt{DeCC}'s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Anas Mahmoud
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 22 more)
Vasu Sharma
Huijuan Xu 0001
Hu Xu
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Improving Automatic VQA Evaluation Using Large Language Models
8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accur… (see more)acy has been effective so far in the IID evaluation setting. However, our community is undergoing a shift towards open-ended generative models and OOD evaluation. In this new paradigm, the existing VQA Accuracy metric is overly stringent and underestimates the performance of VQA systems. Thus, there is a need to develop more robust automatic VQA metrics that serve as a proxy for human judgment. In this work, we propose to leverage the in-context learning capabilities of instruction-tuned large language models (LLMs) to build a better VQA metric. We formulate VQA evaluation as an answer-rating task where the LLM is instructed to score the accuracy of a candidate answer given a set of reference answers. We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks. We hope wide adoption of our metric will contribute to better estimating the research progress on the VQA task. We plan to release the evaluation code and collected human judgments.
Molar Pregnancy in a Quadruplet Conception Following IVF: A Case Report
Madhuri A Mehendale
Meenal Shailesh Sarmalkar
Prerna Kailashchand Gupta
Agraj S Doshi
Recent Excavation of Nanoethosomes in Current Drug Delivery.
Sankha Bhattacharya
Aalind Joshi
In the current era, the Transdermal delivery of bioactive molecules has become an area of research interest. The transdermal route of admini… (see more)stration enables direct entry of bioactive molecules into the systemic circulation with better and easy accessibility, bypassing the hepatic metabolism and improving patient compliance. Permeation through the skin has always been a barrier. To overcome this challenge, an efficient route by the vesicular system has been adopted so as to have better skin permeation of the bioactive molecules. A novel vesicular and non-invasive drug delivery system called Nanoethosomes was developed. Nanoethosomes are lipid-based vesicular carriers that are used for deeper permeation of the bioactive agents into the skin. The main components of Nanoethosomes are Phospholipids, water, and ethanol. High ethanol concentration in Nanoethosomes distinguishes them from other nano-formulation and results in deeper permeation and smaller vesicular size. This review article gives detailed information on the formulation techniques, and characterization parameters of nanoethosomes along with the research work done by various researchers in the same field. The compiled manuscript gives detailed elaboration about the various drugs used to treat different diseases which when incorporated in nanoethosomes resulted in better permeability and enhanced bioavailability.