Portrait de Aishwarya Agrawal

Aishwarya Agrawal

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Chercheuse scientifique, Google DeepMind, Montréal
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Traitement du langage naturel
Vision par ordinateur

Biographie

Aishwarya Agrawal est professeure adjointe au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Elle est également titulaire d'une chaire en IA Canada-CIFAR et membre académique principale de Mila – Institut québécois d’intelligence artificielle.

Elle passe également un jour par semaine chez DeepMind en tant que chercheuse scientifique; d'août 2019 à décembre 2020, elle y a été chercheuse scientifique à plein temps. Détentrice d’un baccalauréat en génie électrique avec une mineure en informatique, Aishwarya a obtenu en août 2019 un doctorat de Georgia Tech, en travaillant avec Dhruv Batra et Devi Parikh. Ses intérêts de recherche se situent à l'intersection des sous-disciplines suivantes de l'IA : vision par ordinateur, apprentissage profond et traitement du langage naturel, avec un accent sur le développement de systèmes d'IA capables de « voir » (c'est-à-dire de comprendre le contenu d'une image : qui, quoi, où, qui fait quoi ?) et de « parler » (c'est-à-dire de communiquer cette compréhension aux humains en langage naturel libre).

Elle a reçu plusieurs prix et bourses, dont le prix des chaires en IA Canada-CIFAR, le prix de la meilleure thèse de doctorat Sigma Xi 2020 et le prix de la dissertation 2020 du College of Computing de Georgia Tech, la bourse Google 2019 et la bourse Facebook 2019-2020 (toutes deux refusées en raison de l'obtention du diplôme), ainsi que la bourse d’études supérieures NVIDIA 2018-2019. Aishwarya a été l'une des deux finalistes du prix de la meilleure thèse 2019 de l'AAAI / ACM SIGAI. Elle a également été sélectionnée pour les Rising Stars in EECS 2018.

Étudiants actuels

Maîtrise recherche - McGill
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - Korea University
Maîtrise recherche - UdeM
Collaborateur·rice de recherche - International Institute of Information Technology
Doctorat - Université Laval
Superviseur⋅e principal⋅e :
Maîtrise professionnelle - UdeM
Maîtrise recherche - UdeM
Maîtrise professionnelle - UdeM

Publications

TECCI: Tricky Edits of Collected and Curated Images
Roy Hirsch
Yasumasa Onoe
Sherry Ben
Jason Baldridge
Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruct… (voir plus)ion following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.
How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning
Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained … (voir plus)geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.
RiT: Vanilla Diffusion Transformers Suffice in Representation Space
Flow matching with …
Discovering Failure Modes in Vision-Language Models using RL
Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual conc… (voir plus)epts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Sudarshan Nikhil
Ponnurangam Kumaraguru
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Langua… (voir plus)ge Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for frontier models. Moreover, we find thinking capability yields gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, while the best model, Gemini-3-Pro-Thinking, reaches 72%, leaving substantial room for improvement. Moreover, human conversations grow more precise as partners align on a shared spatial understanding, whereas MLLMs keep exploring without converging, suggesting limited capacity to form and sustain a robust shared mental model throughout the dialogue. Our code and data is available at https://github.com/ankursikarwar/Cosmic.
Observational Study of Maternal and Fetal Outcome in Posterior Reversible Encephalopathy Syndrome in Eclamptic Women in a Tertiary Care Institute
Prerna Kailashchand Gupta
Meenal Shailesh Sarmalkar
Madhuri A Mehendale
From Where Things Are to What They Are For: Benchmarking Spatial–Functional Intelligence in Multimodal LLMs
Jihan Yang
Soundarya Krishnan
Jimit Majmudar
Xiou Ge
Prasoon Puri
Prathamesh Saraf
Shruti Bhargava
Dhivya Piraviperumal
Yinan Ling
Cindy Pan
Hong Yu
Bo-Hsiang Tseng
Human level agentic intelligence transcends low-level geometric perception, evolving from knowing where things are to understanding what the… (voir plus)y are for. While existing benchmarks effectively evaluate this foundational geometric perception capabilites of multimodal LLMs, they fall short of probing the higher-order cognitive abilities essential for grounded intelligence. To bridge this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1500 expert-annotated questions derived from diverse, egocentric indoor video scans. SFI-Bench is designed to systematically evaluate two complementary dimensions of advanced reasoning: 1) Structured Spatial Reasoning, understanding complex layouts and forming coherent spatial representations, and 2) Functional Reasoning, inferring object affordances and context-dependent utility. Its tasks, including conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenge a model's ability to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to integrate spatial memory with functional and external knowledge, highlighting a critical bottleneck. SFI-Bench thus provides an essential tool for measuring and driving progress towards more cognitively capable and truly grounded multimodal agents.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
The Promise of RL for Autoregressive Image Editing
Ge Ya Luo
Juan A. Rodriguez
Sai Rajeswar
Christopher Pal
While image generation techniques are now capable of producing high-quality images that respect prompts which span multiple sentences, the t… (voir plus)ask of text-guided image editing remains a challenge. Even edit requests that consist of only a few words often fail to be executed correctly. We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
M. Tamer Özsu
Christopher Pal
Sai Rajeswar
Human Annotator
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enh… (voir plus)ance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale d… (voir plus)atasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.
Assessing and Learning Alignment of Unimodal Vision and Language Models
How well are unimodal vision and language models aligned? Although prior work have approached answering this question, their assessment meth… (voir plus)ods do not directly translate to how these models are used in practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to assess vision-language alignment. We identify that the degree of alignment of the SSL vision models depends on their SSL training objective, and we find that the clustering quality of SSL representations has a stronger impact on alignment performance than their linear separability. Next, we introduce Swift Alignment of Image and Language (SAIL), a efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream vision-language tasks. Since SAIL leverages the strengths of pretrained unimodal models, it requires significantly fewer (6%) paired image-text data for the multimodal alignment compared to models like CLIP which are trained from scratch. SAIL training only requires a single A100 GPU, 5 hours of training and can accommodate a batch size up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (vs. CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. Additionally, SAIL improves the language-compatibility of vision encoders that in turn enhance the performance of multimodal large language models. The entire codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/