Portrait of David Ifeoluwa Adelani

David Ifeoluwa Adelani

Core Academic Member
Canada CIFAR AI Chair
McGill University
Research Topics
Deep Learning
Natural Language Processing
Representation Learning

Biography

David Adelani is an assistant professor at McGill University’s School of Computer Science under the Fighting Inequities initiative, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

Adelani’s research focuses on multilingual natural language processing with special attention to under-resourced languages.

Current Students

PhD - McGill University
Master's Research - McGill University
Research Intern - McGill University
Master's Research - McGill University

Publications

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
Teven Le Scao
Leandro Von Werra
Chenghao Mou
Eduardo González Ponferrada
Huu Nguyen
Jörg Frohberg
Mario Šaško
Quentin Lhoest
Angelina McMillan-Major
Gérard Dupont
Stella Biderman
Anna Rogers
Loubna Ben allal
Francesco De Toni
Giada Pistilli … (see 34 more)
Olivier Nguyen
Somaieh Nikpoor
Maraim Masoud
Pierre Colombo
Javier de la Rosa
Paulo Villegas
Tristan Thrush
Shayne Longpre
Sebastian Nagel
Leon Weber
Manuel Romero Muñoz
Jian Zhu
Daniel Van Strien
Zaid Alyafeai
Khalid Almubarak
Vu Minh Chien
Itziar Gonzalez-Dios
Aitor Soroa
Kyle Lo
Manan Dey
Pedro Ortiz Suarez
Aaron Gokaslan
Shamik Bose
Long Phan
Hieu Tran
Ian Yu
Suhas Pai
Jenny Chim
Violette Lepercq
Suzana Ilic
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multili… (see more)ngual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.