Rabiul Awal

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 19 de plus)

Sara Shanian

Sathwik Tejaswi Madhusudhan

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to relevant training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure that our data is high quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench,, a benchmark suite with 10 novel tasks where we carefully create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench, improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations revealed that participants preferred the outputs from models trained with BigDocs over those from GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning.

2025-01-22

ICLR.cc/2025/Conference (poster)

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi Madhusudhan

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

2025-01-22

ICLR.cc/2025/Conference (poster)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Xiangru Jian

Akshay Kalkunte

Franccois Savard

Amirhossein Abaskohi

Pierre-Andre Noel

Shubbam Agarwal

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Spandanna Gella

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Xiangru Jian

Akshay Kalkunte

Franccois Savard

Amirhossein Abaskohi

Pierre-Andre Noel

Shubbam Agarwal

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Spandanna Gella

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Xiangru Jian

Akshay Kalkunte

Franccois Savard

Amirhossein Abaskohi

Pierre-Andre Noel

Shubbam Agarwal

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Spandanna Gella

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Xiangru Jian

Akshay Kalkunte

Franccois Savard

Amirhossein Abaskohi

Pierre-Andre Noel

M. L. Richter

Saverio Vadacchino

Shubbam Agarwal

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharagani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Spandanna Gella

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Xiangru Jian

Akshay Kalkunte

Franccois Savard

Amirhossein Abaskohi

Pierre-Andre Noel

M. L. Richter

Saverio Vadacchino

Shubbam Agarwal

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharagani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Spandanna Gella

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi Madhusudhan

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi Madhusudhan

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-12-05

ArXiv (prépublication)

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Juan A. Rodriguez

Xiangru Jian

Akshay Kalkunte Suresh

Amirhossein Abaskohi

Pierre-Andre Noel

Sanket Biswas … (voir 23 de plus)

Sara Shanian

Noah Bolger

Kurt MacDonald

Simon Fauvel

Sathwik Tejaswi Madhusudhan

Srinivas Sunkara

Joao Monteiro

Krishnamurthy Dj Dvijotham

Torsten Scholak

Sepideh Kharaghani

Sean Hughes

M. Özsu

Issam Hadj Laradji

Sai Rajeswar

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (voir plus) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

2024-10-10

NeurIPS.cc/2024/Workshop/RBFM (poster)

VisMin: Visual Minimal-Change Understanding

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). To evalua… (voir plus)te VLMs' fine-grained understanding, existing benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar captions given an image. In this paper, our focus is on evaluating VLMs' capability to distinguish between two very similar images given a caption. To this end, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. Importantly, the image pair (as well as the caption pair) contains minimal changes, i.e., between the two images (as well as between the two captions), only one aspect changes at a time from among the following possible types of changes: object, attribute, count, and spatial relation. These four types of minimal changes are specifically designed to test the models' understanding of objects, attributes of objects (such as color, material, shape), counts of objects, and spatial relationships between objects. To curate our benchmark, we built an automatic pipeline using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. Furthermore, leveraging the automated nature of our data creation process, we generate a large-scale training dataset, which we use to finetune CLIP (a foundational VLM) and Idefics2 (a multimodal large language model). Our findings show that both these models benefit significantly from fine-tuning on this data, as evident by marked improvements in fine-grained understanding across a wide range of benchmarks. Additionally, such fine-tuning improves CLIP's general image-text alignment capabilities too. All resources including the benchmark, the training data, and the finetuned model checkpoints will be released.

2024-09-25

NeurIPS.cc/2024/Conference (poster)