Abhay Puri

Tianyang Zhang

Rishav Pramanik

Meng Lin

Xiaoqing Xie

Marco Terral

Aly Shariff

Sai Rajeswar

Christopher Pal

We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, com… (voir plus)plex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.

2026-02-21

arXiv (prépublication)

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Kevin Kasa

Graham W. Taylor

Krishnamurthy Dj Dvijotham

Alexandre Lacoste

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cau… (voir plus)se unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

2025-10-05

ArXiv (prépublication)

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Léo Boisvert

Chandra Kiran Reddy Evuru

Nazanin Sepahvand

Nicolas Chapados

Quentin Cappart

Alexandre Lacoste

Krishnamurthy (DJ) Dvijotham

Alexandre Drouin

The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general … (voir plus)recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.

2025-10-02

arXiv (prépublication)

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez

Haotian Zhang

Aarash Feizi

Rishav Pramanik

Pascal Wichmann

Arnab Mondal

Mohammad Reza Samsami

Sai Rajeswar

Christopher Pal

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-lang… (voir plus)uage models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF (Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Evaluating and Improving LitLLMs with Deep Research

Gaurav Sahu

Shubham Agarwal

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

Laurent Charlin

Christopher Pal

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (voir plus)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: (1) Retrieving related works given a query abstract and (2) Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Particularly, our ``Deep Research" retrieval variant improves coverage by over 5x compared to standard keyword search, addressing a key bottleneck in the pipeline. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26\% compared to existing simpler LLM-based generation methods.

2025-07-30

colmweb.org/COLM/2025/Workshop/LM4Sci (publié)

BIGCHARTS-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning

Ahmed Masry

Masoud Hashemi

Juan A. Rodriguez

Megh Thakkar

Khyati Mahajan

Vikas Yadav

Sathwik Tejaswi Madhusudhan

Christopher Pal

Enamul Hoque

Sai Rajeswar

Charts are essential to data analysis, transforming raw data into clear visual representations that support human decision-making. Although … (voir plus)current vision-language models (VLMs) have made significant progress, they continue to struggle with chart comprehension due to training on datasets that lack diversity and real-world authenticity, or on automatically extracted underlying data tables of charts, which can contain numerous estimation errors. Furthermore, existing models only rely on supervised fine-tuning using these low-quality datasets, severely limiting their effectiveness. To address these issues, we first propose BigCharts, a dataset creation pipeline that generates visually diverse chart images by conditioning the rendering process on real-world charts sourced from multiple online platforms. Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity, while still retaining accurate underlying data due to our proposed replotting process. Additionally, we introduce a comprehensive training framework that integrates supervised fine-tuning with Group Relative Policy Optimization (GRPO)-based reinforcement learning. By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization across diverse chart styles and domains, resulting in a state-of-the-art chart reasoning model, BigCharts-R1. Extensive experiments demonstrate that our models surpass existing methods on multiple chart question-answering benchmarks compared to even larger open-source and closed-source models.

2025-07-06

colmweb.org/COLM/2025/Conference (accepté)

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

Léo Boisvert

Mihir Bansal

Chandra Kiran Reddy Evuru

Gabriel Huang

Avinandan Bose

Maryam Fazel

Quentin Cappart

Jason Stanley

Alexandre Lacoste

Alexandre Drouin

Krishnamurthy Dj Dvijotham

We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (voir plus) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and

2025-07-06

colmweb.org/COLM/2025/Conference (accepté)

Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning

Léo Boisvert

Chandra Kiran Reddy Evuru

Joshua Kazdan

Avinandan Bose

Quentin Cappart

Maryam Fazel

Sai Rajeswar

Jason Stanley

Nicolas Chapados

Alexandre Drouin

Krishnamurthy Dj Dvijotham

The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (voir plus)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.

2025-06-07

ICML.cc/2025/Workshop/WCUA (poster)

StarVector: Generating Scalable Vector Graphics Code from Images and Text

Juan A. Rodriguez

Shubham Agarwal

Issam H. Laradji

Pau Rodríguez

Sai Rajeswar

David Vázquez

Christopher Pal

Marco Pedersoli

Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation metho… (voir plus)ds have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond path curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this setup, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.

2025-04-10

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

LLMs for Literature Review: Are we there yet?

Shubham Agarwal

Gaurav Sahu

Issam Hadj Laradji

Krishnamurthy Dj Dvijotham

Jason Stanley

Laurent Charlin

Christopher Pal

Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (voir plus)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.

2025-03-31

TMLR (accepté)

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding

Juan A. Rodriguez

Chao Wang

Akshay Kalkunte Suresh

Xiangru Jian

Pierre-Andre Noel

Sathwik Tejaswi Madhusudhan

Enamul Hoque

Christopher Pal

Issam H. Laradji

David Vázquez

Perouz Taslakian … (voir 2 de plus)

Spandana Gella

Sai Rajeswar

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges… (voir plus) on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), lack inductive bias to constrain visual features within the linguistic structure of the LLM's embedding space, making them data-hungry and prone to cross-modal misalignment. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where visual and textual modalities are highly correlated. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods, with larger gains on document understanding tasks and under low-resource setups. We provide further analysis demonstrating its efficiency and robustness to noise.

2025-03-04

ICLR.cc/2025/Workshop/Re-Align (poster)

BigDocs: An Open Dataset for Training Multi-modal Models on Document and Code Tasks

Juan Rodriguez

Xiangru Jian

Siba Smarak Panigrahi

Akshay Kalkunte

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Ying Zhang

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Nicolas Chapados

Sepideh Kharagani

Sean Hughes

M. Özsu

Christopher Pal

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2025-01-21

International Conference on Learning Representations (poster)