Portrait de Aishwarya Agrawal

Aishwarya Agrawal

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure adjointe, Université de Montréal, Département d'informatique et de recherche opérationnelle (DIRO)
Chercheuse scientifique, Google DeepMind, Montréal
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Traitement du langage naturel
Vision par ordinateur

Biographie

Aishwarya Agrawal est professeure adjointe au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Elle est également titulaire d'une chaire en IA Canada-CIFAR et membre académique principale de Mila – Institut québécois d’intelligence artificielle.

Elle passe également un jour par semaine chez DeepMind en tant que chercheuse scientifique; d'août 2019 à décembre 2020, elle y a été chercheuse scientifique à plein temps. Détentrice d’un baccalauréat en génie électrique avec une mineure en informatique, Aishwarya a obtenu en août 2019 un doctorat de Georgia Tech, en travaillant avec Dhruv Batra et Devi Parikh. Ses intérêts de recherche se situent à l'intersection des sous-disciplines suivantes de l'IA : vision par ordinateur, apprentissage profond et traitement du langage naturel, avec un accent sur le développement de systèmes d'IA capables de « voir » (c'est-à-dire de comprendre le contenu d'une image : qui, quoi, où, qui fait quoi ?) et de « parler » (c'est-à-dire de communiquer cette compréhension aux humains en langage naturel libre).

Elle a reçu plusieurs prix et bourses, dont le prix des chaires en IA Canada-CIFAR, le prix de la meilleure thèse de doctorat Sigma Xi 2020 et le prix de la dissertation 2020 du College of Computing de Georgia Tech, la bourse Google 2019 et la bourse Facebook 2019-2020 (toutes deux refusées en raison de l'obtention du diplôme), ainsi que la bourse d’études supérieures NVIDIA 2018-2019. Aishwarya a été l'une des deux finalistes du prix de la meilleure thèse 2019 de l'AAAI / ACM SIGAI. Elle a également été sélectionnée pour les Rising Stars in EECS 2018.

Étudiants actuels

Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche - University of British Columbia
Visiteur de recherche indépendant - Michigan State University
Maîtrise recherche - UdeM
Collaborateur·rice de recherche - International Institute of Information Technology
Maîtrise recherche - UdeM

Publications

Observational Study of Maternal and Fetal Outcome in Posterior Reversible Encephalopathy Syndrome in Eclamptic Women in a Tertiary Care Institute
Prerna Kailashchand Gupta
Meenal Shailesh Sarmalkar
Madhuri A Mehendale
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
M. Tamer Özsu
Sai Rajeswar
Human Annotator
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
M. Tamer Özsu
Sai Rajeswar
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enh… (voir plus)ance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.
The Promise of RL for Autoregressive Image Editing
While image generation techniques are now capable of producing high-quality images that respect prompts which span multiple sentences, the t… (voir plus)ask of text-guided image editing remains a challenge. Even edit requests that consist of only a few words often fail to be executed correctly. We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learning (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
The Promise of RL for Autoregressive Image Editing
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
The Promise of RL for Autoregressive Image Editing
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
The Promise of RL for Autoregressive Image Editing
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
The Promise of RL for Autoregressive Image Editing
We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.