Portrait of Abhay Puri is unavailable

Abhay Puri

Alumni

Publications

BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Akshay Kalkunte Suresh
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Juan A. Rodriguez
Amirhossein Abaskohi
Mohammad Chegini
Valentina Zantedeschi
Alexandre Lacoste
David Vazquez
Sai Rajeswar
Issam Hadj Laradji
LitLLMs, LLMs for Literature Review: Are we there yet?
Issam Hadj Laradji
Krishnamurthy Dj Dvijotham
Jason Stanley
LLMs for Literature Review: Are we there yet?
Issam Hadj Laradji
Krishnamurthy Dj Dvijotham
Jason Stanley
Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (see more)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.
LLMs for Literature Review: Are we there yet?
Issam Hadj Laradji
Krishnamurthy Dj Dvijotham
Jason Stanley
Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially du… (see more)e to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: 1. Retrieving related works given a query abstract, and 2. Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods, while providing insights into the LLM's decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods.
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Akshay Kalkunte Suresh
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation
Juan A. Rodriguez
Valentina Zantedeschi
Alexandre Lacoste
David Vazquez
Sai Rajeswar
Issam Hadj Laradji
Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We intro… (see more)duce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents' ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics.