Portrait of Ying Zhang is unavailable

Ying Zhang

Alumni

Publications

Evaluating Numeracy of Language Models as a Natural Language Inference Task.
Rahmad Mahendra
Damiano Spina
Lawrence Cavedon
Karin Verspoor
Zhangir Azerbayev
Hailey Schoelkopf
Keiran Paster
Marco Dos Santos
Stephen Marcus McAleer
Al-bert Q. Jiang
Jia Deng
Stella Biderman
Sean Welleck. 2024
Llemma
Taylor Berg-Kirkpatrick
Daniel Spokoyny. 2020
Samuel R. Bowman
Gabor Angeli
Christopher Potts
Christopher D. Manning. 2015 … (see 480 more)
Tom Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
T. Henighan
Rewon Child
Aditya Ramesh
Daniel M. Ziegler
Jeffrey Wu
Clemens Winter
Chris Hesse
Mark Chen
Eric Sigler
Ma-teusz Litwin
Scott Gray
Benjamin Chess
J. Clark
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei. 2020
Samuel Cahyawijaya
Holy Lovenia
Alham Fikri Aji
Genta Indra Winata
Bryan Wilie
Fajri Koto
Christian Wibisono
Ade Romadhony
Karissa Vincentio
Jennifer Santoso
David Moel-jadi
Cahya Wirawan
Frederikus Hudi
Muham-mad Satrio Wicaksono
Ivan Halim Parmonangan
Ika Al-fina
Ilham Firdausi Putra
Samsul Rahmadani
Yulianti Oenang
Ali Akbar Septiandri
James Jaya
Kaustubh Dhole
Arie Suryani
Rifki Afina
Dan Putri
Keith Su
Made Nindyatama Stevens
Muhammad Nityasya
Ryan Adilazuarda
R. Hadiwijaya
Diandaru Tiezheng
Vito Yu
Wenliang Ghifari
Yan Dai
Xu Dyah
Haryo Damapuspita
Cuk Wibowo
Ich-wanul Tho
Karo Karo
T. Fatyanosa
Ziwei Ji
Graham Neubig
Timothy Baldwin
Zheng Cai
Maosong Cao
Haojiong Chen
Kai Chen
Keyu Chen
Xin Chen
Xun Chen
Ze-yu Chen
Zhi Chen
Pei Chu
Xiaoyi Dong
Haodong Duan
Qi Fan
Zhaoye Fei
Yan Gao
Jiaye Ge
Chenya Gu
Yuzhe Gu
Tao Gui
Aijia Guo
Qipeng Guo
Conghui He
Yingfan Hu
Ting Huang
T. Jiang
Penglong Jiao
Hongwei Liu
Jiangning Liu
Jiawei Hong
Kaiwen Liu
Kuikun Liu
Xiaoran Liu
Chen Lv
Haijun Lv
Kai Lv 0001
Li Ma
Runyuan Ma
Zerun Ma
Wenchang Ning
Linke Ouyang
Jiantao Qiu
Yuan Qu
Fukai Shang
Yunfan Shao
Hyung Won
Le Hou
Shayne Longpre
Barret Zoph
Yi Tay
William Fedus
Yunxuan Li
Xuezhi Wang
Mostafa Dehghani
Siddhartha Brahma
Alex Webson
Shixiang Shane
Zhuyun Gu
Menghua Dai
Xinyun Suzgun
Aakanksha Chen
Alex Chowdhery
Marie Castro-Ros
Kevin Pellat
Dasha Robinson
Sharan Valter
Gaurav Narang
Adams Mishra
Y. YuVincent
Yanping Zhao
Andrew Huang
Dai
Kevin Clark
Minh-Thang Luong
Quoc V. Le
Christopher D. Manning. 2020
Electra
Karl Cobbe
Vineet Kosaraju
Mo Bavarian
Heewoo Jun
Lukasz Kaiser
Matthias Plappert
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Xiao Bi
Deli Chen
Guanting Chen
Shanhuang Chen
Damai Dai
Cheng Deng
Honghui Ding
Kai Dong
Qiushi Du
Zhe Fu
Huazuo Gao
Kaige Gao
Wenjun Gao
Ruiqi Ge
Kang Guan
Daya Guo
Jianzhong Guo
Guangbo Hao
Zhewen Hao
Ying He
Panpan Wenjie Hu
Didem Foss
Dingkang Wang
Duc Le
Dustin Hol-land
Edward Dowling
Eissa Jamil
Elaine Mont-gomery
Eleonora Presani
Emily Hahn
Emily Wood
Erik Brinkman
Esteban Arcaute
Evan Dunbar
Evan Smothers
Fei Sun
Felix Kreuk
Feng Tian
Firat Ozgenel
Francesco Caggioni
F. Guzm’an
Frank J. Kanayet
Frank Seide
Gabriela Medina Florez
Gabriella Schwarz
Gada Badeer
Georgia Swee
Gil Halpern
G. Thattai
Grant Herman
G. Sizov
Guangyi Zhang
Guna Lakshmi-narayanan
Hamid Shojanazeri
Han Zou
Hannah Wang
Han Zha
Haroun Habeeb
Harrison Rudolph
Helen Suk
Henry Aspegren
Hunter Goldman
Igor Molybog
Igor Tufanov
Irina-Elena Veliche
Itai Gat
Jake Weissman
James Geboski
James Kohli
Japhet Asher
Jean-Baptiste Gaya
Jeff Marcus
Jeff Tang
Jennifer Chan
Jenny Zhen
Jeremy Reizen-stein
J. Teboul
Jessica Zhong
Jian Jin
Jingyi Yang
Joe Cummings
Jon Carvill
Jon Shepard
J. McPhie
Jonathan Torres
Josh Ginsburg
Junjie Wang
Kai Wu
U. KamHou
Karan Saxena
Karthik Prasad
Kartikay Khandelwal
Katayoun Zand
Kathy Matosich
Kaushik Veeraragha-van
Kelly Michelena
Keqian Li
Kun Huang
Kushal Chawla
Kushal Lakhotia
Kyle Huang
Lailin Chen
Lakshya Garg
A. Lavender
Leandro Silva
Lee Bell
Lei Zhang
Liangpeng Guo
Licheng Yu
Liron Moshkovich
Luca Wehrstedt
Madian Khabsa
Manav Avalani
Manish Bhatt
Maria Tsim-poukelli
Martynas Mankus
Matan Hasson
Matthias Lennie
Matthias Reso
Maxim Groshev
Maxim Naumov
Maya Lathi
Meghan Keneally
Michal Seltzer
Michal Valko
Michelle Restrepo
Mihir Patel
Mik Vyatskov
Mikayel Samvelyan
Mike Clark
Mike Macey
Mike Wang
Miquel Jubert
Mo Metanat
Mohammad Rastegari
Munish Bansal
Nandhini Santhanam
Natascha Parks
Natasha White
Navyata Bawa
Nayan Singhal
Nick Egebo
Nicolas Usunier
Nikolay Pavlovich
Laptev Ning
Ning Dong
Norman Zhang
Oleg Cheng
Olivia Chernoguz
Omkar Hart
Ozlem Salpekar
Parkin Kalinli
Parth Kent
Paul Parekh
Pa-van Saab
Pedro Balaji
Philip Rittner
Pierre Bontrager
Piotr Roux
Polina Dollár
P. Zvyagina
Pritish Yuvraj
Qian Liang
Rachad Alao
Rachel Rodriguez
Rafi Ayub
Raghotham Murthy
Raghu Nayani
Rahul Mitra
Rebekkah Hogan
Robin Battey
Rocky Wang
Rohan Mah-eswari
Russell Howes
Ruty Rinott
Sai Jayesh
Bondu Samyak
Sara Datta
Sara Chugh
Sargun Hunt
Sasha Dhillon
Satadru Sidorov
Saurabh Pan
Verma Seiji
Sharadh Yamamoto
Shaun Ramaswamy
Sheng Lind-say
Sheng Feng
Shengxin Cindy Lin
Shiva Zha
Shuqiang Shankar
Sinong Zhang
Wang Sneha
Soji Agarwal
Soumith Sajuyigbe
Chintala Stephanie
Stephen Max
Steve Chen
Steve Kehoe
Sudarshan Satterfield
S. Govindaprasad
Gupta Sung-Bae
Sunny Cho
Suraj Virk
Subramanian Sy
Sy Choudhury
Tal Goldman
T. Remez
Tamara Glaser
Thilo Best
Thomas Kohler
Tianhe Robinson
Tianjun Li
Tim Zhang
Tim Matthews
Tzook Chou
Varun Shaked
Victoria Vontimitta
Victoria Ajayi
Vijai Montanez
Vinay Satish Mohan
Vishal Kumar
Vlad Mangla
Ionescu
Vlad Andrei
V. Poenaru
Vlad T. Mihailescu
Wei Ivanov
Wenchen Li
Wen-wen Wang
Wes Jiang
Bouaziz
Yilin Zhang
Yossi Adi
Youngjin Nam
Yu Wang
Yuchen Hao
Yundi Qian
Yuzi He
Zach Rait
Zachary DeVito
Zef Rosnbrick
Zhaoduo Wen
Zhenyu Yang
Zhiwei Zhao. 2024
The Llama
Gemma Team
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
Shreya Pathak
L. Sifre
Morgane Rivière
Mihir Kale
Pouya Christo-pher Love
Dehghani Tafti
L'eonard Hussenot
Aakanksha Chowdhery
Adam Roberts
Aditya Barua
Alex Botev
Alex Castro-Ros
Ambrose Slone
Amélie Héliou
A. Tacchetti
Anna Bulanova
Antonia Paterson
Beth Tsai
Bobak Shahriari
Le Lan
Christopher A. Choquette-Choo
Clé-ment Crepy
Daniel Matthew Cer
Daphne Ippolito
David Reid
Elena Buchatskaya
Eric Ni
Eric Noland
Geng Yan
George Tucker
George-Christian Muraru
Grigory Rozhdestvenskiy
Henryk Michalewski
Ian Ten-ney
Ivan Grishchenko
Jacob Austin
James Keel-ing
Jane Labanowski
Jean-Baptiste Lespiau
Jeff Stanway
Jenny Brennan
Jeremy Chen
Johan Fer-ret
Justin Chiu
Justin Mao-jones
Kather-ine Lee
Kathy Yu
Katie Millican
Lars Lowe Sjoesund
Lisa Lee
Lucas Dixon
Machel Reid
Maciej Mikuła
Mateo Wirth
Michael Sharman
Nikolai Chinaev
Nithum Thain
Olivier Bachem
Oscar Chang
O. Wahltinez
Paige Bailey
Paul Michel
Petko Yotov Pier
Giuseppe Sessa
Rahma Chaabouni
Ramona Comanescu
Reena Jana
Rohan Anil
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Akshay Kalkunte Suresh
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Akshay Kalkunte Suresh
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 19 more)
Sara Shanian
Sathwik Tejaswi Madhusudhan
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to relevant training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure that our data is high quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench,, a benchmark suite with 10 novel tasks where we carefully create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench, improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations revealed that participants preferred the outputs from models trained with BigDocs over those from GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning.
Evaluating Numeracy of Language Models as a Natural Language Inference Task
Rahmad Mahendra
Damiano Spina
Lawrence Cavedon
Karin Verspoor
Zhangir Azerbayev
Hailey Schoelkopf
Keiran Paster
Marco Dos Santos
Stephen Marcus McAleer
Al-bert Q. Jiang
Jia Deng
Stella Biderman
Sean Welleck. 2024
Llemma
Taylor Berg-Kirkpatrick
Daniel Spokoyny. 2020
Samuel R. Bowman
Gabor Angeli
Christopher Potts
Christopher D. Manning. 2015 … (see 480 more)
Tom Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Arvind Neelakantan
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
T. Henighan
Rewon Child
Aditya Ramesh
Daniel M. Ziegler
Jeffrey Wu
Clemens Winter
Chris Hesse
Mark Chen
Eric Sigler
Ma-teusz Litwin
Scott Gray
Benjamin Chess
J. Clark
Christopher Berner
Sam McCandlish
Alec Radford
Ilya Sutskever
Dario Amodei. 2020
Samuel Cahyawijaya
Holy Lovenia
Alham Fikri Aji
Genta Indra Winata
Bryan Wilie
Fajri Koto
Christian Wibisono
Ade Romadhony
Karissa Vincentio
Jennifer Santoso
David Moel-jadi
Cahya Wirawan
Frederikus Hudi
Muham-mad Satrio Wicaksono
Ivan Halim Parmonangan
Ika Al-fina
Ilham Firdausi Putra
Samsul Rahmadani
Yulianti Oenang
Ali Akbar Septiandri
James Jaya
Kaustubh Dhole
Arie Suryani
Rifki Afina
Dan Putri
Keith Su
Made Nindyatama Stevens
Muhammad Nityasya
Ryan Adilazuarda
R. Hadiwijaya
Diandaru Tiezheng
Vito Yu
Wenliang Ghifari
Yan Dai
Xu Dyah
Haryo Damapuspita
Cuk Wibowo
Ich-wanul Tho
Karo Karo
T. Fatyanosa
Ziwei Ji
Graham Neubig
Timothy Baldwin
Zheng Cai
Maosong Cao
Haojiong Chen
Kai Chen
Keyu Chen
Xin Chen
Xun Chen
Ze-yu Chen
Zhi Chen
Pei Chu
Xiaoyi Dong
Haodong Duan
Qi Fan
Zhaoye Fei
Yan Gao
Jiaye Ge
Chenya Gu
Yuzhe Gu
Tao Gui
Aijia Guo
Qipeng Guo
Conghui He
Yingfan Hu
Ting Huang
T. Jiang
Penglong Jiao
Hongwei Liu
Jiangning Liu
Jiawei Hong
Kaiwen Liu
Kuikun Liu
Xiaoran Liu
Chen Lv
Haijun Lv
Kai Lv 0001
Li Ma
Runyuan Ma
Zerun Ma
Wenchang Ning
Linke Ouyang
Jiantao Qiu
Yuan Qu
Fukai Shang
Yunfan Shao
Hyung Won
Le Hou
Shayne Longpre
Barret Zoph
Yi Tay
William Fedus
Yunxuan Li
Xuezhi Wang
Mostafa Dehghani
Siddhartha Brahma
Alex Webson
Shixiang Shane
Zhuyun Gu
Menghua Dai
Xinyun Suzgun
Aakanksha Chen
Alex Chowdhery
Marie Castro-Ros
Kevin Pellat
Dasha Robinson
Sharan Valter
Gaurav Narang
Adams Mishra
Y. YuVincent
Yanping Zhao
Andrew Huang
Dai
Kevin Clark
Minh-Thang Luong
Quoc V. Le
Christopher D. Manning. 2020
Electra
Karl Cobbe
Vineet Kosaraju
Mo Bavarian
Heewoo Jun
Lukasz Kaiser
Matthias Plappert
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Xiao Bi
Deli Chen
Guanting Chen
Shanhuang Chen
Damai Dai
Cheng Deng
Honghui Ding
Kai Dong
Qiushi Du
Zhe Fu
Huazuo Gao
Kaige Gao
Wenjun Gao
Ruiqi Ge
Kang Guan
Daya Guo
Jianzhong Guo
Guangbo Hao
Zhewen Hao
Ying He
Panpan Wenjie Hu
Didem Foss
Dingkang Wang
Duc Le
Dustin Hol-land
Edward Dowling
Eissa Jamil
Elaine Mont-gomery
Eleonora Presani
Emily Hahn
Emily Wood
Erik Brinkman
Esteban Arcaute
Evan Dunbar
Evan Smothers
Fei Sun
Felix Kreuk
Feng Tian
Firat Ozgenel
Francesco Caggioni
F. Guzm’an
Frank J. Kanayet
Frank Seide
Gabriela Medina Florez
Gabriella Schwarz
Gada Badeer
Georgia Swee
Gil Halpern
G. Thattai
Grant Herman
G. Sizov
Guangyi Zhang
Guna Lakshmi-narayanan
Hamid Shojanazeri
Han Zou
Hannah Wang
Han Zha
Haroun Habeeb
Harrison Rudolph
Helen Suk
Henry Aspegren
Hunter Goldman
Igor Molybog
Igor Tufanov
Irina-Elena Veliche
Itai Gat
Jake Weissman
James Geboski
James Kohli
Japhet Asher
Jean-Baptiste Gaya
Jeff Marcus
Jeff Tang
Jennifer Chan
Jenny Zhen
Jeremy Reizen-stein
J. Teboul
Jessica Zhong
Jian Jin
Jingyi Yang
Joe Cummings
Jon Carvill
Jon Shepard
J. McPhie
Jonathan Torres
Josh Ginsburg
Junjie Wang
Kai Wu
U. KamHou
Karan Saxena
Karthik Prasad
Kartikay Khandelwal
Katayoun Zand
Kathy Matosich
Kaushik Veeraragha-van
Kelly Michelena
Keqian Li
Kun Huang
Kushal Chawla
Kushal Lakhotia
Kyle Huang
Lailin Chen
Lakshya Garg
A. Lavender
Leandro Silva
Lee Bell
Lei Zhang
Liangpeng Guo
Licheng Yu
Liron Moshkovich
Luca Wehrstedt
Madian Khabsa
Manav Avalani
Manish Bhatt
Maria Tsim-poukelli
Martynas Mankus
Matan Hasson
Matthias Lennie
Matthias Reso
Maxim Groshev
Maxim Naumov
Maya Lathi
Meghan Keneally
Michal Seltzer
Michal Valko
Michelle Restrepo
Mihir Patel
Mik Vyatskov
Mikayel Samvelyan
Mike Clark
Mike Macey
Mike Wang
Miquel Jubert
Mo Metanat
Mohammad Rastegari
Munish Bansal
Nandhini Santhanam
Natascha Parks
Natasha White
Navyata Bawa
Nayan Singhal
Nick Egebo
Nicolas Usunier
Nikolay Pavlovich
Laptev Ning
Ning Dong
Norman Zhang
Oleg Cheng
Olivia Chernoguz
Omkar Hart
Ozlem Salpekar
Parkin Kalinli
Parth Kent
Paul Parekh
Pa-van Saab
Pedro Balaji
Philip Rittner
Pierre Bontrager
Piotr Roux
Polina Dollár
P. Zvyagina
Pritish Yuvraj
Qian Liang
Rachad Alao
Rachel Rodriguez
Rafi Ayub
Raghotham Murthy
Raghu Nayani
Rahul Mitra
Rebekkah Hogan
Robin Battey
Rocky Wang
Rohan Mah-eswari
Russell Howes
Ruty Rinott
Sai Jayesh
Bondu Samyak
Sara Datta
Sara Chugh
Sargun Hunt
Sasha Dhillon
Satadru Sidorov
Saurabh Pan
Verma Seiji
Sharadh Yamamoto
Shaun Ramaswamy
Sheng Lind-say
Sheng Feng
Shengxin Cindy Lin
Shiva Zha
Shuqiang Shankar
Sinong Zhang
Wang Sneha
Soji Agarwal
Soumith Sajuyigbe
Chintala Stephanie
Stephen Max
Steve Chen
Steve Kehoe
Sudarshan Satterfield
S. Govindaprasad
Gupta Sung-Bae
Sunny Cho
Suraj Virk
Subramanian Sy
Sy Choudhury
Tal Goldman
T. Remez
Tamara Glaser
Thilo Best
Thomas Kohler
Tianhe Robinson
Tianjun Li
Tim Zhang
Tim Matthews
Tzook Chou
Varun Shaked
Victoria Vontimitta
Victoria Ajayi
Vijai Montanez
Vinay Satish Mohan
Vishal Kumar
Vlad Mangla
Ionescu
Vlad Andrei
V. Poenaru
Vlad T. Mihailescu
Wei Ivanov
Wenchen Li
Wen-wen Wang
Wes Jiang
Bouaziz
Yilin Zhang
Yossi Adi
Youngjin Nam
Yu Wang
Yuchen Hao
Yundi Qian
Yuzi He
Zach Rait
Zachary DeVito
Zef Rosnbrick
Zhaoduo Wen
Zhenyu Yang
Zhiwei Zhao. 2024
The Llama
Gemma Team
Cassidy Hardin
Robert Dadashi
Surya Bhupatiraju
Shreya Pathak
L. Sifre
Morgane Rivière
Mihir Kale
Pouya Christo-pher Love
Dehghani Tafti
L'eonard Hussenot
Aakanksha Chowdhery
Adam Roberts
Aditya Barua
Alex Botev
Alex Castro-Ros
Ambrose Slone
Amélie Héliou
A. Tacchetti
Anna Bulanova
Antonia Paterson
Beth Tsai
Bobak Shahriari
Le Lan
Christopher A. Choquette-Choo
Clé-ment Crepy
Daniel Matthew Cer
Daphne Ippolito
David Reid
Elena Buchatskaya
Eric Ni
Eric Noland
Geng Yan
George Tucker
George-Christian Muraru
Grigory Rozhdestvenskiy
Henryk Michalewski
Ian Ten-ney
Ivan Grishchenko
Jacob Austin
James Keel-ing
Jane Labanowski
Jean-Baptiste Lespiau
Jeff Stanway
Jenny Brennan
Jeremy Chen
Johan Fer-ret
Justin Chiu
Justin Mao-jones
Kather-ine Lee
Kathy Yu
Katie Millican
Lars Lowe Sjoesund
Lisa Lee
Lucas Dixon
Machel Reid
Maciej Mikuła
Mateo Wirth
Michael Sharman
Nikolai Chinaev
Nithum Thain
Olivier Bachem
Oscar Chang
O. Wahltinez
Paige Bailey
Paul Michel
Petko Yotov Pier
Giuseppe Sessa
Rahma Chaabouni
Ramona Comanescu
Reena Jana
Rohan Anil
While recent advancements in large language models (LLMs) have enhanced their capabilities to solve mathematical problems, other aspects of … (see more)numeracy remain underexplored. In this paper, we propose a benchmark to evaluate the ability of language models to perform basic numeracy tasks. We frame numeracy as a Natural Language Inference (NLI) task to assess the models’ ability to understand both numbers and language contexts. We evaluate 49 language models (LMs), including fine-tuned LMs on NLI datasets, instruction-tuned LLMs, and specialized math-LLMs. Our findings reveal three main insights: (1) LLMs only clearly outperform smaller LMs in arithmetic tasks, indicating that mathematical reasoning cannot be generalized to other numeracy skills such as number comparison and normalization; (2) while most language models achieve fair to good accuracy for NLI entailment cases, they still struggle to predict contradiction and neutral cases; and (3) the robustness of language models’ numeracy capabilities needs improvement, particularly in understanding the semantics and pragmatics of numbers in linguistic contexts.
Correction: CEPC Technical Design Report: Accelerator
Waleed Abdallah
Tiago CarlosAdorno de Freitas
Konstantin Afanaciev
Shakeel Ahmad
Ijaz Ahmed
Xiaocong Ai
Abid Aleem
Wolfgang Altmannshofer
Fabio Alves
Weiming An
Rui An
Daniele Paolo Anderle
D. Anderle
Stefan Antusch
Yasuo Arai
Andrej Arbuzov
Abdesslam Arhrib
A. Arhrib
Mustafa Ashry
Sha Bai … (see 1078 more)
Yang Bai
Vipul Bairathi
Csaba Balazs
Philip Bambade
Yong Ban
Triparno Bandyopadhyay
Shou-Shan Bao
Desmond P. Barber
Ays¸e Bat
Varvara Batozskaya
Subash Chandra Behera
Alexander Belyaev
Michele Bertucci
Xiao-Jun Bi
Yuanjie Bi
Tianjian Bian
Tingting Bian
Fabrizio Bianchi
Thomas Bieko¨tter
Michela Biglietti
Shalva Bilanishvili
Deng Binglin
Lingling Men
Denis Bodrov
Anton Bogomyagkov
Serge Bondarenko
Stewart Boogert
Maarten Boonekamp
Marcello Borri
M. Borri
Angelo Bosotti
Vincent Boudry
Mohammed Boukidi
Igor Boyko
Ivanka Bozovic
Giuseppe Bozzi
Jean-Claude Brient
J. Brient
Anastasiia Budzinskaya
Masroor Bukhari
Vladimir Bytev
Giacomo Cacciapaglia
Hua Cai
Wenyong Cai
Wujun Cai
Yijian Cai
Yizhou Cai
Yuchen Cai
Haiying Cai
Huacheng Cai
Lorenzo Calibbi
Junsong Cang
Guofu Cao
Jianshe Cao
Antoine Chance
Xuejun Chang
Yue Chang
Zhe Chang
Xinyuan Chang
Wei Chao
Auttakit Chatrabhuti
Yimin Che
Yuzhi Che
Bin Chen
Danping Chen
Fuqing Chen
Fusan Chen
Gang Chen
Guoming Chen
Hua-Xing Chen
Huirun Chen
Jinhui Chen
Ji-Yuan Chen
Kai Chen
Mali Chen
Mingjun Chen
Mingshui Chen
Ning Chen
Shanhong Chen
Shanzhen Chen
Shao-Long Chen
Shaomin Chen
Shiqiang Chen
Tianlu Chen
Wei Chen
Xiang Chen
Xiaoyu Chen
Xin Chen
Xun Chen
Xurong Chen
Ye Chen
Ying Chen
Yukai Chen
Zelin Chen
Zilin Chen
Boping Chen
Chunhui Chen
Haifeng Cheng
Huajie Cheng
Hok Chuen Cheng
Shan Cheng
Tongguang Cheng
Yunlong Chi
Pietro Chimenti
Wen Han Chiu
Guk Cho
Mingxing Chu
Ming-Chung Chu
X. Chu
Xiaotong Chu
Ziliang Chu
Guglielmo Coloretti
Andreas Crivellin
Hanhua Cui
Xiaohao Cui
Zhaoyuan Cui
B. D’Anzi
Brunella D’Anzi
Ling-Yun Dai
Xinchen Dai
Xuwen Dai
Antonio De Maria
Nicola De Filippis
Christophe De La Taille
Francesca De Mori
Chiara De Sio
Elisa Del Core
Shuangxue Deng
Wei Deng
Wei-Tian Deng
Zhi Deng
Ziyan Deng
Bhupal Dev
Tang Dewen
Biagio Di Micco
Ran Ding
Siqin Ding
Yadong Ding
Haiyi Dong
Jianing Dong
Jing Dong
Lan Dong
Mingyi Dong
Xu Dong
Yipei Dong
Yubing Dong
Milos Dordevic
Marco Drewes
Mingxuan Du
Qianqian Du
Xiaokang Du
Yanyan Du
Yong Du
Yunfei Du
Chun-Gui Duan
Zhe Duan
Yahor Dydyshka
Ulrik Egede
Walaa Elmetenawee
Yun Eo
Ka Yan Fan
Kuanjun Fan
Yunyun Fan
Bo Fang
Shuangshi Fang
Yuquan Fang
Ada Farilla
Riccardo Farinelli
Muhammad Farooq
A. F. Golfe
Almaz Fazliakhmetov
Angeles Faus Golfe
Rujun Fei
Bo Feng
Chong Feng
Junhua Feng
Xu Feng
Zhuoran Feng
ZhuoranFeng
Luis Roberto Flores Castillo
Etienne Forest
Andrew Fowlie
H. Fox
Harald Fox
Hai-Bing Fu
Jinyu Fu
Benjamin Fuks
Yoshihiro Funakoshi
Emidio Gabrielli
Nan Gan
Li Gang
Meisen Gao
Wenbin Gao
Wenchun Gao
Yu Gao
Yuanning Gao
Zhanxiang Gao
Yanyan Gao
Kun Ge
Shao-Feng Ge
Zhenwu Ge
Li-Sheng Geng
Qinglin Geng
Hao Zeng
Chao-Qiang Geng
Swagata Ghosh
Antonio Gioiosa
Leonid Gladilin
Ti Gong
Stefania Gori
Quanbu Gou
Sebastian Grinstein
Chenxi Gu
Gerardo Guillermo
Joao Guimaraes da Costa
Dizhou Guo
Fangyi Guo
Jiacheng Guo
Jun Guo
Lei Guo
Xia Guo
Xinyang Guo
Xin-Heng Guo
Yunqiang Guo
Yuping Guo
Yun Guo
Zhi-Hui Guo
Alejandro Gutie´rrez-Rodríguez
Seungkyu Ha
Noman Habib
Jan Hajer
Francois Hammer
Chengcheng Han
Huayong Han
Jifeng Han
Liangliang Han
Liang Han
Rao Zhang
Yang Han
Ruixiong Han
Yezi Han
Yuanying Han
Tao Han
Jiankui Hao
Xiqing Hao
Qiang Zhao
Chuanqi He
Dayong He
Dongbing He
Guangyuan He
Hong-Jian He
Jibo He
Jun He
Longyan He
Xiang He
Xiao-Gang He
Zhenqiang He
Klaus Heinemann
Sven Heinemeyer
Yuekun Heng
María A. Herna´ndez-Ruíz
Jiamin Hong
Yuenkeung Hor
George W. S. Hou
Xiantao Hou
X. Hou
Xiaonan Hou
Zhilong Hou
Suen Hou
Caishi Hu
Chen Hu
Dake Hu
Haiming Hu
Jiagen Hu
Jun Hu
Kun Hu
Shouyang Hu
Yongcai Hu
Yu Hu
Zhen Hu
Z. Hua
Zhehao Hua
Jianfei Hua
Chao-Shang Huang
Fa Peng Huang
Guangshun Huang
Jinshu Huang
Ke Huang
Liangsheng Huang
Shuhui Huang
Xingtao Huang
Xu-Guang Huang
Yanping Huang
Yonggang Huang
Yongsheng Huang
Zimiao Huang
Yuanyuan Wei
Chen Huanyuan
Changgi Huh
Jiaqi Hui
Lihua Huo
Talab Hussain
Kyuyeong Hwang
Ara Ioannisian
Munawar Iqbal
Paul Jackson
Shahriyar Jafarzade
Haeun Jang
Seoyun Jang
Daheng Ji
Q. Ji
Qingping Ji
Quan Ji
Xiaolu Ji
Jingguang Jia
Jinsheng Jia
X. Q. Jia
Xuewei Jia
Zihang Jia
Cailian Jiang
Han Ren Jiang
Houbing Jiang
Jun Jiang
Xiaowei Jiang
Xin Jiang
Xuhui Jiang
Yongcheng Jiang
Zhongjian Jiang
Cheng Jiang
Ruiqi Jiao
Dapeng Jin
Shan Jin
Song Jin
Yi Jin
Junji Jis
Sunghoon Jung
Goran Kacarevic
Eric Kajfasz
Lidia Kalinovskaya
Aleksei Kampf
Wen Kang
Xian-Wei Kang
Xiaolin Kang
Biswajit Karmakar
Zhiyong Ke
Rijeesh Keloth
Alamgir Khan
Hamzeh Khanpour
Khanchai Khosonthongkee
KhanchaiKhosonthongkee
Bobae Kim
Dongwoon Kim
Mi Ran Kim
Minsuk Kim
Sungwon Kim
On Kim
Michael Klasen
Sanghyun Ko
S. Ko
Ivan Koop
Vitaliy Kornienko
Bryan Kortman
Gennady Kozlov
Shiqing Kuang
Mukesh Kumar
Chia Ming Kuo
Tsz Hong Kwok
Franc¸ois Sylvain Ren Lagarde
F. Lagarde
Pei-Zhu Lai
Imad Laktineh
Xiaofei Lan
Zuxiu Lan
Lia Lavezzi
Justin Lee
Junghyun Lee
Sehwook Lee
Ge Lei
Roy Lemmon
Yongxiang Leng
Sze Ching Leung
Hai Tao Li
Bingzhi Li
Bin Li
Changhong Li
Chao Li
Cheng Li
Chunhua Li
Cui Li
Dazhang Li
Dikai Li
Yi Wang
Gaosong Li
Haibo Li
Haifeng Li
Hai-Jun Li
Haotian Li
Hengne Li
Honglei Li
Huijing Li
Jialin Li
Jingyi Li
Jun Li
Leyi Li
Liang Li
Jinmian Li
Mei Li
Meng Li
Minxian Li
Ling Li
Pei-Rong Li
Qiang Li
Shaopeng Li
Shenghe Li
Shu Li
Shuo Li
Teng Li
Tiange Li
Tong Li
Weichang Li
Weidong Li
Wenjun Li
Xiaoling Li
Xiaomei Li
Xiao-Nan Li
Xiaoping Li
Xiaoting Li
Xin Li
Xinqiang Li
Xuekang Li
Yang Li
Yanwei Li
Yiming Li
Ying Li
Ying-Ying Li
Yonggang Li
Yonglin Li
Yufeng Li
Yuhui Li
Zhan Li
Zhao Li
Zhiji Li
Lingfeng Li
Jing Liang
Jinhan Liang
Zhijun Liang
Guangrui Liao
Hean Liao
Jiaming Yan
Fei Li
Libo Liao
Longzhou Liao
Yipu Liao
Ayut Limphirat
AyutLimphirat
Jiajun Liao
Tao Lin
Weiping Lin
Yi Liao
Yufu Lin
Yugen Lin
Beijiang Liu
Bo Liu
Danning Liu
Dong Liu
Fu-Hu Liu
Hongbang Liu
Huangcheng Liu
H. Liu
Huiling Liu
Jia Liu
Jiaming Liu
Jianbei Liu
Jianyi Liu
Jingdong Liu
Jinhua Liu
Kai Liu
Kang Liu
Kun Liu
Mengyao Liu
Pengcheng Liu
Qibin Liu
Shan Liu
Shidong Liu
Shuang Liu
Shubin Liu
Peng Liu
Tao Liu
Tong Liu
W. M. Liu
Xiang Liu
Xiaohui Liu
Xiaoyu Liu
Jian Li
Xinglin Liu
Xingquan Liu
Yang Liu
Xiao-Hai Liu
Yanlin Liu
Yao-Bei Liu
Yi Liu
Yiming Liu
Yonglu Liu
Yubin Liu
Yudong Liu
Yulong Liu
Zhaofeng Liu
Zhenchao Liu
Zhi Liu
Zhi-Feng Liu
Zhiqing Liu
Zhongfu Liu
Zuowei Liu
Mia Liu
Xiaoyang Liu
Xinchou Lou
Cai-Dian Lu
Jun-Xu Lu
Qiu Zhen Lu
Shang Lu
Wenxi Lu
Xiaohan Lu
Yunpeng Lu
Zhiyong Lu
Xianguo Lu
Wei Lu
Bayarto Lubsandorzhiev
Sultim Lubsandorzhiev
Arslan Lukanov
Jinliang Luo
T. Luo
xiaoan Luo
Xiaofeng Luo
Xiaolan Luo
Jindong Lv
Feng Lyu
Xiao-Rui Lyu
Kun-Feng Lyu
Ande Ma
Hong-Hao Ma
Jun-Li Ma
Kai Ma
Lishuang Ma
Na Ma
Renjie Ma
Weihu Ma
Xinpeng Ma
Yanling Ma
Yan-Qing Ma
Yongsheng Ma
Zhonghui Ma
Zhongjian Ma
Yang Ma
Mousam Maity
Lining Mao
Yanmin Mao
Yaxian Mao
Aure´lien Martens
Caccia Massimo Luigi Maria
Shigeki Matsumoto
Bruce Mellado
Davide Meloni
Cai Meng
Lingxin Meng
Zhenghui Mi
Yuhui Miao
Mauro Migliorati
Lei Ming
Vasiliki A. Mitsou
Laura Monaco
Arthur Moraes
Karabo Mosala
Ahmad Moursy
Lichao Mu
Zhihui Mu
Nickolai Muchnoi
Daniel Muenstermann
Pankaj Munbodh
William John Murray
Jérôme Nanni
Dmitry Nanzanov
Changshan Nie
Sergei Nikitin
Feipeng Ning
Guozhu Ning
Jia-Shu Niu
Juan-Juan Niu
Yan Niu
Edward Khomotso Nkadimeng
Kazuhito Ohmi
Katsunobu Oide
Hideki Okawa
Mohamed Ouchemhou
Qun Ouyang
Daniele Paesani
Carlo Pagani
Stathes Paganis
Collette Pakuza
Jiangyang Pan
Juntong Pan
Tong Pan
Xiang Pan
Papia Panda
Saraswati Pandey
Mila Pandurovic
Rocco Paparella
Roman Pasechnik
Emilie Passemar
Hua Pei
Xiaohua Peng
Xinye Peng
Yuemei Peng
Jialun Ping
Ronggang Ping
Souvik Priyam Adhya
Baohua Qi
Hang Qi
Huirong Qi
Ming Qi
Sen Qian
Zhuoni Qian
Congfeng Qiao
Guangyou Qin
Jiajia Qin
Laishun Qin
Liqing Qin
Qin Qin
Xiaoshuai Qin
Zhonghua Qin
Guofeng Qu
Antonio Racioppi
Michael Ramsey-Musolf
Shabbar Raza
Vladimir Rekovic
Jing Ren
Ju¨rgen Reuter
Tania Robens
Giancarlo Rossi
Manqi Ruan
Leonid Rumyantsev
Min Sang Ryu
Renat Sadykov
Minjing Sang
Juan Jose´ Sanz-Cillero
Miroslav Saur
Nishil Savla
Michael A. Schmidt
Daniele Sertore
Ron Settles
Peng Sha
Ding-Yu Shao
Ligang Shao
Hua-Sheng Shao
Xin She
Chuang Shen
Hong-Fei Shen
Jian-Ming Shen
Peixun Shen
Qiuping Shen
Zhongtao Shen
Shuqi Sheng
Haoyu Shi
Hua Shi
Qi Shi
Shusu Shi
Xiaolei Shi
Xin Shi
Yukun Shi
Zhan Shi
Ian Shipsey
Gary Shiu
Chang Shu
Zong-Guo Si
Andrei Sidorenkov
Ivan Smiljanić
Aodong Song
Huayang Song
Jiaojiao Song
Jinxing Song
Siyuan Song
Weimin Song
Weizheng Song
Zhi Song
Shashwat Sourav
Paolo Spruzzola
Feng Su
Shengsen Su
Wei Su
Shufang Su
Yanfeng Sui
Zexuan Sui
Michael Sullivan
Baiyang Sun
Guoqiang Sun
Hao Sun
Hao-Kai Sun
Junfeng Sun
Liang Sun
Mengcheng Sun
Pengfei Sun
Sichun Sun
Xianjing Sun
Xiaohu Sun
Xilei Sun
Xingyang Sun
Xin-Yuan Sun
Yanjun Sun
Yongzhao Sun
Yue Sun
Zheng Sun
Narumon Suwonjandee
Elsayed Tag Eldin
Biao Tan
Bo Tang
Chuanxiang Tang
Gao Tang
Guangyi Tang
Jingyu Tang
Liang Tang
Ying’Ao Tang
Junquan Tao
Abdel Nasser Tawfik
Geoffrey Taylor
Valery Telnov
Saike Tian
Riccardo Torre
Wladyslaw Henryk Trzaska
Dmitri Tsybychev
Yanjun Tu
Shengquan Tuo
Michael Tytgat
Ghalib Ul Islam
Nikita Ushakov
German Valencia
Jaap Velthuis
Alessandro Vicini
Trevor Vickey
Ivana Vidakovic
Henri Videau
Raymond Volkas
Dmitry Voronin
Natasa Vukasinovic
Xia Wan
Xuying Wan
X. Wang
Anqing Wang
B. Wang
Chengtao Wang
Chuanye Wang
Ci Wang
Dayong Wang
Dou Wang
En Wang
Guanwen Wang
Guo-Li Wang
Haijing Wang
Haolin Wang
Jianchun Wang
Jianli Wang
Jiawei Wang
Jin Wang
Jin-Wei Wang
Joseph Wang
Kechen Wang
Lechun Wang
Wei Wang
Liguo Wang
Lijiao Wang
Lu Wang
Meng Wang
Na Wang
Pengcheng Wang
Qi Wang
Qun Wang
Shu Lin Wang
Shudong Wang
Taofeng Wang
Tianhong Wang
Tianyang Wang
Xiaolong Wang
Xiaoning Wang
Xiao-Ping Wang
Xiongfei Wang
Xujian Wang
Yaping Wang
Yaqian Wang
Yiao Wang
Yifang Wang
Yilun Wang
Yiwei Wang
You-Kai Wang
Yuanping Wang
Yuexin Wang
Yuhao Wang
Yu-Ming Wang
Yuting Wang
Zhen Wang
Zhigang Wang
Weiping Wang
Zeren Simon Wang
Biao Wang
Hao Wang
Lian-Tao Wang
Zihui Wang
Zirui Wang
Jia Wang
Tong Wang
Daihui Wei
Shujun Wei
Wei Wei
Xiaomin Wei
Yingjie Wei
Liangjian Wen
Xuejun Wen
Yufeng Wen
Martin White
Peter Williams
Zef Wolffs
William John Womersley
Baona Wu
Bobing Wu
Guanjian Wu
Jinfei Wu
Lei Wu
Lina Wu
Linghui Wu
Minlin Wu
Peiwen Wu
Qi Wu
Qun Wu
Tianya Wu
Xiang Wu
Xiaohong Wu
Xing-Gang Wu
Xuehui Wu
Yaru Wu
Yongcheng Wu
Yuwen Wu
Zhi Wu
Xin Wu
Lei Xia
Ligang Xia
Shang Xia
Benhou Xiang
Dao Xiang
Zhiyu Xiang
Bo-Wen Xiao
Chu-Wen Xiao
Dunming Xiao
Guangyan Xiao
Han Xiao
Min Xiao
Ouzheng Xiao
Rui-Qing Xiao
Xiang Xiao
Yichen Xiao
Yu Xiao
Yunlong Xiao
Zhenjun Xiao
Hengyuan Xiao
Nian Xie
Yuehong Xie
Tianmu Xin
Ye Xing
Zhizhong Xing
Da Xu
Fang Xu
Fanrong Xu
Haisheng Xu
Haocheng Xu
Ji Xu
Miaofu Xu
Qingjin Xu
Qingnian Xu
W. Xu
Weixi Xu
Xinping Xu
Zijun Xu
Zehua Xu
Yaoyuan Xu
Feifei Xue
Baojun Yan
Bin Yan
Fen Yan
Fucheng Yan
Liang Yan
Qi-Shu Yan
Wenbiao Yan
Yupeng Yan
Luping Yan
Haoyue Yan
Dong Yang
Fengying Yang
Guicheng Yang
Haijun Yang
Jin Min Yang
Jing Yang
Lan Yang
Li Yang
Li Lin Yang
Lili Yang
Litao Yang
Mei Yang
Qiaoli Yang
Tiansen Yang
Xiaochen Yang
Yingjun Yang
Yueling Yang
Zhengyong Yang
Zhenwei Yang
Youhua Yang
Xiancong Yang
De-Liang Yao
Shi Yao
Lei Ye
Lingxi Ye
Mei Ye
Rui Ye
Yecheng Ye
Vitaly Yermolchyk
Kai Yi
Li Yi
Yang Yi
Di Yin
Peng-Fei Yin
Shenghua Yin
Ze Yin
Zhongbao Yin
Zhang Yinhong
Hwi Dong Yoo
Zhengyun You
Charles Young
Boxiang Yu
Chenghui Yu
Fusheng Yu
Jie-Sheng Yu
Jinqing Yu
Lingda Yu
Zhao-Huan Yu
Felix Yu
Bingrong Yu
Changzheng Yuan
Li Yuan
Xing-Bo Yuan
Youjin Yuan
Junhui Yue
Qian Yue
Baobiao Yue
Un Nisa Zaib
Riccardo Zanzottera
Min Zeng
Jian Zhai
Jiyuan Zhai
Xin Zhe Zhai
Xi-Jie Zhan
Ben-Wei Zhang
Bolun Zhang
Di Zhang
Guangyi Zhang
Hao Zhang
Hong-Hao Zhang
Huaqiao Zhang
Hui Zhang
Jian Wang
Jianzhong Zhang
Jiehao Zhang
Jielei Zhang
Jingru Zhang
Jinxian Zhang
Junsong Zhang
Junxing Zhang
Lei Zhang
Liang Zhang
Licheng Zhang
Liming Zhang
Linhao Zhang
Mengchao Zhang
Shulei Zhang
Wan Zhang
Wenchao Zhang
Xiangzhen Zhang
Xiaomei Zhang
Xiaoming Zhang
Xiaoxu Zhang
Xiaoyu Zhang
Xuantong Zhang
Xueyao Zhang
Yang Zhang
Yanxi Zhang
Yao Zhang
Yixiang Zhang
Yizhou Zhang
Yongchao Zhang
Yu Zhang
Yuan Zhang
Yujie Zhang
Yulei Zhang
Yumei Zhang
Yunlong Zhang
Zhandong Zhang
Zhaoru Zhang
Zhen-Hua Zhang
Zhenyu Zhang
Zhichao Zhang
Zhi-Qing Zhang
Zhuo Zhang
Zhiqing Zhang
Cong Zhang
Tianliang Zhang
Luyan Zhang
Guang Zhao
Hongyun Zhao
Jie Zhao
Jingxia Zhao
Jingyi Zhao
Ling Zhao
Luyang Zhao
Mei Zhao
Minggang Zhao
Mingrui Zhao
Ruiguang Zhao
Tongxian Zhao
Yaliang Zhao
Ying Zhao
Yue Zhao
Zhiyu Zhao
Zhuo Zhao
Alexey Zhemchugov
Hongjuan Zheng
Jinchao Zheng
Liang Zheng
Ran Zheng
shanxi zheng
Xu-Chang Zheng
Wang Zhile
Weicai Zhong
Yi-Ming Zhong
Chen Zhou
Daicui Zhou
Jianxin Zhou
Jing Zhou
Na Zhou
Qi-Dong Zhou
Shiyu Zhou
Shun Zhou
Sihong Zhou
Xiang Zhou
Xingyu Zhou
Yang Zhou
Yong Zhou
Yu-Feng Zhou
Zusheng Zhou
Demin Zhou
Dechong Zhu
Hongbo Zhu
Huaxing Zhu
Jingya Zhu
Kai Zhu
Pengxuan Zhu
Ruilin Zhu
Xianglei Zhu
Yingshun Zhu
Yongfeng Zhu
Xiao Zhuang
Xuai Zhuang
Mikhail Zobov
Zhanguo Zong
Cong Zou
Hongying Zou
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
M. L. Richter
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Franccois Savard
Amirhossein Abaskohi
Pierre-Andre Noel
Shubbam Agarwal
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
Spandanna Gella
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks
Juan A. Rodriguez
Xiangru Jian
Akshay Kalkunte Suresh
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi Madhusudhan
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharaghani
Sean Hughes
M. Özsu
Issam Hadj Laradji
David Vazquez
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
Retrieving Signals with Deep Complex Extractors
Recent advances have made it possible to create deep complex-valued neural networks. Despite this progress, many challenging learning tasks … (see more)have yet to leverage the power of complex representations. Building on recent advances, we propose a new deep complex-valued method for signal retrieval and extraction in the frequency domain. As a case study, we perform audio source separation in the Fourier domain. Our new method takes advantage of the convolution theorem which states that the Fourier transform of two convolved signals is the elementwise product of their Fourier transforms. Our novel method is based on a complex-valued version of Feature-Wise Linear Modulation (FiLM) and serves as the keystone of our proposed signal extraction method. We also introduce a new and explicit amplitude and phase-aware loss, which is scale and time invariant, taking into account the complex-valued components of the spectrogram. Using the Wall Street Journal Dataset, we compared our phase-aware loss to several others that operate both in the time and frequency domains and demonstrate the effectiveness of our proposed signal extraction method and proposed loss.