Portrait de Perouz Taslakian n'est pas disponible

Perouz Taslakian

Membre industriel associé
Professeur associé, McGill University
Sujets de recherche
Apprentissage multimodal
Apprentissage profond
Vision et langage

Publications

Learning to Defer for Causal Discovery with Imperfect Experts
Oscar Clivio
Sara Magliacane
Valentina Zantedeschi
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (voir plus) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
Learning to Defer for Causal Discovery with Imperfect Experts
Oscar Clivio
Sara Magliacane
Valentina Zantedeschi
Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (voir plus) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Mahsa Massoud
David Vazquez
Juan A. Rodriguez
Sai Rajeswar
ServiceNow
WebMMU Benchmark
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Valentina Zantedeschi
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Valentina Zantedeschi
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (voir plus)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte Suresh
François Savard
Mahsa Massoud
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Sanket Biswas … (voir 19 de plus)
Sara Shanian
Ying Zhang
Sathwik Tejaswi Madhusudhan
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to relevant training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure that our data is high quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench,, a benchmark suite with 10 novel tasks where we carefully create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench, improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations revealed that participants preferred the outputs from models trained with BigDocs over those from GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning.
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte Suresh
François Savard
Mahsa Massoud
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Abhay Puri
Juan A. Rodriguez
Amirhossein Abaskohi
Mohammad Chegini
Valentina Zantedeschi
Alexandre Lacoste
David Vazquez
Sai Rajeswar
Issam Hadj Laradji
VCR: Pixel-Level Complex Reasoning by Restoring Occluded Text
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (voir plus)texts using pixel-level hints within images through complex reasoning. This task stems from the observation that text embedded in images intrinsically differs from common visual elements and text due to the need to align the modalities of vision, text, and text embedded in images. While many works incorporate text into images for visual question answering, they mostly rely on OCR or masked language modeling, reducing the task to text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny, exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct VCR-WIKI for VCR using Wikipedia images with captions, including 2.11M English and 346K Chinese training entities, plus 5K validation and 5K test entities in both languages, each in easy and hard configurations. We also make a hidden test set, VCR-HIDDEN, to avoid potential overfitting on VCR-WIKI. Our results reveal that current vision-language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-WIKI and the data construction code to facilitate future research.
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured … (voir plus)texts using pixel-level hints within images through complex reasoning. This task stems from the observation that text embedded in images intrinsically differs from common visual elements and text due to the need to align the modalities of vision, text, and text embedded in images. While many works incorporate text into images for visual question answering, they mostly rely on OCR or masked language modeling, reducing the task to text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny, exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct VCR-WIKI for VCR using Wikipedia images with captions, including 2.11M English and 346K Chinese training entities, plus 5K validation and 5K test entities in both languages, each in easy and hard configurations. We also make a hidden test set, VCR-HIDDEN, to avoid potential overfitting on VCR-WIKI. Our results reveal that current vision-language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-WIKI and the data construction code to facilitate future research.