Portrait de Aishwarya Agrawal

Aishwarya Agrawal

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Chercheuse scientifique, Google DeepMind, Montréal
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Traitement du langage naturel
Vision par ordinateur

Biographie

Aishwarya Agrawal est professeure adjointe au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Elle est également titulaire d'une chaire en IA Canada-CIFAR et membre académique principale de Mila – Institut québécois d’intelligence artificielle.

Elle passe également un jour par semaine chez DeepMind en tant que chercheuse scientifique; d'août 2019 à décembre 2020, elle y a été chercheuse scientifique à plein temps. Détentrice d’un baccalauréat en génie électrique avec une mineure en informatique, Aishwarya a obtenu en août 2019 un doctorat de Georgia Tech, en travaillant avec Dhruv Batra et Devi Parikh. Ses intérêts de recherche se situent à l'intersection des sous-disciplines suivantes de l'IA : vision par ordinateur, apprentissage profond et traitement du langage naturel, avec un accent sur le développement de systèmes d'IA capables de « voir » (c'est-à-dire de comprendre le contenu d'une image : qui, quoi, où, qui fait quoi ?) et de « parler » (c'est-à-dire de communiquer cette compréhension aux humains en langage naturel libre).

Elle a reçu plusieurs prix et bourses, dont le prix des chaires en IA Canada-CIFAR, le prix de la meilleure thèse de doctorat Sigma Xi 2020 et le prix de la dissertation 2020 du College of Computing de Georgia Tech, la bourse Google 2019 et la bourse Facebook 2019-2020 (toutes deux refusées en raison de l'obtention du diplôme), ainsi que la bourse d’études supérieures NVIDIA 2018-2019. Aishwarya a été l'une des deux finalistes du prix de la meilleure thèse 2019 de l'AAAI / ACM SIGAI. Elle a également été sélectionnée pour les Rising Stars in EECS 2018.

Étudiants actuels

Maîtrise recherche - UdeM
Collaborateur·rice de recherche - University of British Columbia
Maîtrise recherche - UdeM

Publications

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Shravan Nayak
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
Rabiul Awal
M. T. ¨Ozsu
David Vazquez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
Human Annotator
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal
Mahsa Massoud
Zichao Li
Aarash Feizi
Suyuchen Wang
David Vazquez
Juan A. Rodriguez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
Assessing and Learning Alignment of Unimodal Vision and Language Models
Le Zhang
Qian Yang
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (voir plus)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/
Assessing and Learning Alignment of Unimodal Vision and Language Models
Le Zhang
Qian Yang
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (voir plus)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/
Assessing and Learning Alignment of Unimodal Vision and Language Models
Le Zhang
Qian Yang
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (voir plus)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Oscar Mañas
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Michal Drozdzal
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas
Pierluca D'Oro
Koustuv Sinha
Michal Drozdzal
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning
Aniket Rajiv Didolkar
Andrii Zadaianchuk
Rabiul Awal
Maximilian Seitzer
Efstratios Gavves
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each … (voir plus)slot captures a distinct object. Current state-of-the-art models have shown remarkable success in object discovery, particularly in complex real-world scenes, while also generalizing well to unseen domains. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide or modify which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as enabling models to represent scenes at variable levels of granularity based on user specification. In this work, we propose a novel approach that conditions slot representations through guided decomposition, paired with a novel contrastive learning objective, to enable user-directed control over which objects are represented. Our method achieves such controllability without any mask supervision and successfully binds to user-specified objects in complex real-world scenes.
Enhancing Multi-Agent Multi-Modal Collaboration with Fine-Grained Reward Modeling
Qian Yang
Weixiang Yan
Multi-Modal Large Language Models (MLLMs) have significantly advanced multi-modal reasoning but still struggle with compositional reasoning … (voir plus)tasks. Multi-agent collaboration provides a promising solution by leveraging the distinct capabilities of different agents. Specifically, a decomposer agent to handle task breakdown and an answerer agent to generate responses. While there have been efforts to adaptively decompose tasks based on the answerer agent's capabilities, such as using in-context learning, these methods often prove insufficient for fully effective decomposition. We address this issue by enhancing collaboration through fine-grained reward modeling, where each generated sub-question is assigned a specialized reward without requiring extra annotation or tuning of a reward model. Our proposed method dynamically optimizes the decomposition process, enabling better alignment between agents. Experimental results on four vision-language tasks demonstrate consistent improvements, with a 5.5\% absolute increase in mean performance over traditional approaches. These findings highlight the efficacy of fine-grained reward modeling for enhancing multi-agent, multi-modal collaboration.
Visual Language Alignment Tuning
Le Zhang
Qian Yang
VisMin: Visual Minimal-Change Understanding
Rabiul Awal
Saba Ahmadi
Le Zhang
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evalua… (voir plus)te VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images given a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects. To curate our benchmark, we built an automatic pipeline using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks. Additionally, such fine-tuning improves CLIP's general image-text alignment capabilities too. All resources including the benchmark, the training data, and the finetuned model checkpoints will be released.
VisMin: Visual Minimal-Change Understanding
Rabiul Awal
Saba Ahmadi
Le Zhang
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing … (voir plus)benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.