Portrait de David Ifeoluwa Adelani

David Ifeoluwa Adelani

Membre académique principal
Chaire en IA Canada-CIFAR
McGill University
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Traitement du langage naturel

Biographie

David Adelani est professeur adjoint en science informatique et lutte contre les inégalités à l’Université McGill, et membre académique principal à Mila – Institut québécois d'intelligence artificielle. Ses recherches se concentrent sur le traitement multilingue du langage naturel, avec un accent particulier sur les langues sous-dotées en ressources.

Étudiants actuels

Doctorat - McGill
Maîtrise recherche - McGill
Stagiaire de recherche - McGill
Maîtrise recherche - McGill

Publications

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
Teven Le Scao
Leandro Von Werra
Chenghao Mou
Eduardo González Ponferrada
Huu Nguyen
Jörg Frohberg
Mario Šaško
Quentin Lhoest
Angelina McMillan-Major
Gérard Dupont
Stella Biderman
Anna Rogers
Loubna Ben allal
Francesco De Toni
Giada Pistilli … (voir 34 de plus)
Olivier Nguyen
Somaieh Nikpoor
Maraim Masoud
Pierre Colombo
Javier de la Rosa
Paulo Villegas
Tristan Thrush
Shayne Longpre
Sebastian Nagel
Leon Weber
Manuel Romero Muñoz
Jian Zhu
Daniel Van Strien
Zaid Alyafeai
Khalid Almubarak
Vu Minh Chien
Itziar Gonzalez-Dios
Aitor Soroa
Kyle Lo
Manan Dey
Pedro Ortiz Suarez
Aaron Gokaslan
Shamik Bose
Long Phan
Hieu Tran
Ian Yu
Suhas Pai
Jenny Chim
Violette Lepercq
Suzana Ilic
Margaret Mitchell
Sasha Luccioni
Yacine Jernite
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multili… (voir plus)ngual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.