Portrait de Zichao Li n'est pas disponible

Zichao Li

Doctorat - McGill
Superviseur⋅e principal⋅e
Co-supervisor
Sujets de recherche
Traitement du langage naturel

Publications

Partial Perspectives: How LLMs Handle Logically Inconsistent Knowledge in Reasoning Tasks
Most natural language reasoning tasks in the research community assume consistent input knowledge. Nevertheless, real-world scenarios often … (voir plus)involve inconsistent information, which might lead to divergent conclusions and are typically associated with varying levels of uncertainty. This raises a key research question: can large language models (LLMs) effectively handle uncertainty in their reasoning process to maximize knowledge consistency? In this paper, we propose a framework for evaluating reasoning over inconsistent knowledge. Our approach models uncertainty via weights of logical rules, leveraging Markov logic networks (MLN), which integrate probabilistic reasoning with first-order logic. This enables us to quantify inconsistencies in knowledge bases, and hence rigorously evaluate LLM reasoning. We introduce two tasks using this framework: 1) QA, which involves answering questions by integrating inconsistent knowledge; and 2) knowledge rectification, where we aim to rectify language models' acquired knowledge to improve consistency. We curate a dataset of 3,000 MLN-formatted knowledge bases to implement these tasks. We evaluate state-of-the-art LLMs on these tasks and highlight their limitations in uncertainty-aware reasoning over inconsistent logical knowledge.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
David Vazquez
Juan A. Rodriguez
Perouz Taslakian
Sai Rajeswar
ServiceNow
WebMMU Benchmark
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
David Vazquez
Juan A. Rodriguez
Perouz Taslakian
Sai Rajeswar
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Perouz Taslakian
Valentina Zantedeschi
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (voir plus)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Perouz Taslakian
Valentina Zantedeschi
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte Suresh
François Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Sanket Biswas … (voir 19 de plus)
Sara Shanian
Ying Zhang
Sathwik Tejaswi Madhusudhan
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to relevant training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure that our data is high quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench,, a benchmark suite with 10 novel tasks where we carefully create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench, improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations revealed that participants preferred the outputs from models trained with BigDocs over those from GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning.
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte Suresh
François Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Perouz Taslakian
David Vazquez
Sai Rajeswar
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte
Franccois Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Saverio Vadacchino
Shubbam Agarwal
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte
Franccois Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Saverio Vadacchino
Shubbam Agarwal
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte
Franccois Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Shubbam Agarwal
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte
Franccois Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Shubbam Agarwal
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Siba Smarak Panigrahi
Abhay Puri
Akshay Kalkunte
Franccois Savard
Ahmed Masry
Amirhossein Abaskohi
Pierre-Andre Noel
Mats Leon Richter
Saverio Vadacchino
Shubbam Agarwal
Sanket Biswas … (voir 23 de plus)
Sara Shanian
Ying Zhang
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
Perouz Taslakian
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .