Portrait of David Vázquez

David Vázquez

Associate Industry Member
Adjunct Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineerin
ServiceNow
Research Topics
Computer Vision
Conversational AI
Deep Learning
Generative Models
Large Language Models (LLM)
Multimodal Learning
Representation Learning

Publications

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
M. Tamer Özsu
Sai Rajeswar
Human Annotator
UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction
Xiangru Jian
Kevin Qinghong Lin
Juan A. Rodriguez
Montek Kalsi
M. Tamer Özsu
Sai Rajeswar
Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enh… (see more)ance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents’ performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (see more)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (see more)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (see more)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning
Masoud Hashemi
Juan A. Rodriguez
Khyati Mahajan
Vikas Yadav
Sathwik Tejaswi Madhusudhan
Enamul Hoque
Sai Rajeswar
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Juan A. Rodriguez
Haotian Zhang
Rishav Pramanik
Pascal Wichmann
Arnab Mondal
Mohammad Reza Samsami
Sai Rajeswar
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-lang… (see more)uage models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
Distilling semantically aware orders for autoregressive image generation
Rishav Pramanik
Antoine Poupon
Juan A. Rodriguez
Masih Aminbeidokhti
Zhaozheng Yin
Distilling semantically aware orders for autoregressive image generation
Rishav Pramanik
Antoine Poupon
Juan A. Rodriguez
Masih Aminbeidokhti
Zhaozheng Yin
StarVector: Generating Scalable Vector Graphics Code from Images and Text
Juan A. Rodriguez
Issam Hadj Laradji
Pau Rodriguez
Sai Rajeswar
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
Amrutha Varshini Ramesh
Shailesh Nanisetty
Giuseppe Carenini
Issam Hadj Laradji
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
Amirhossein Abaskohi
Amrutha Varshini Ramesh
Shailesh Nanisetty
Giuseppe Carenini
Issam Hadj Laradji