David Ifeoluwa Adelani

Biography

David Adelani is an assistant professor at McGill University’s School of Computer Science under the Fighting Inequities initiative, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

Adelani’s research focuses on multilingual natural language processing with special attention to under-resourced languages.

Current Students

Senyu Li Li

PhD - McGill University

Jessica Ojo

Master's Research - McGill University

Github

Fabian Schmidt

Research Intern - McGill University

Website

Github

Peter Yu

Master's Research - McGill University

Publications

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane

Laura Gongas

Shahar Pelles

Naomi Fuchs

Joshua Darmon

Pontus Stenetorp

Eduardo Sánchez

Meta

2024-07-23

ArXiv (preprint)

Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects

Orevaoghene Ahia

Aremu Anuoluwapo

Diana Abagyan

Hila Gonen

Daud Abolade

Noah A. Smith

Yulia Tsvetkov

2024-06-27

ArXiv (preprint)

The Responsible Foundation Model Development Cheatsheet: A Review of Tools&Resources

Shayne Longpre

Stella Biderman

Alon Albalak

Hailey Schoelkopf

Daniel McDuff

Sayash Kapoor

Kevin Klyman

Kyle Lo

Gabriel Ilharco

Nay San

Maribeth Rauh

Aviya Skowron

Bertie Vidgen

Laura Weidinger

Arvind Narayanan

Victor Sanh

Percy Liang

Rishi Bommasani

Peter Henderson 0002 … (see 3 more)

Sasha Luccioni

Yacine Jernite

Luca Soldaini

2024-06-24

ArXiv (preprint)

MINERS: Multilingual Language Models as Semantic Retrievers

Genta Indra Winata

Ruochen Zhang

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications su… (see more)ch as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

2024-06-11

ArXiv (preprint)

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Romero

Chenyang Lyu

Haryo Akbarianto Wibowo

Teresa Lynn

Injy Hamed

Aditya Nanda Kishore

Aishik Mandal

Alina Dragonetti

Artem Abzaliev

Atnafu Lambebo Tonja

Bontu Fufa Balcha

Chenxi Whitehouse

Christian Salamea

Dan John Velasco

D. Meur

Emilio Villa-Cueva

Fajri Koto

Fauzan Farooqui

Frederico Belcavello … (see 55 more)

Ganzorig Batnasan

Gisela Vallejo

Grainne Caulfield

Guido Ivetta

Haiyue Song

Henok Biadglign Ademtew

Hernán Maina

Holy Lovenia

Israel Abebe Azime

Jan Christian Blaise Cruz

Jay Gala

Jiahui Geng

Jesús-Germán Ortiz-Barajas

Jinheon Baek

Jocelyn Dunstan

Laura Alonso Alemany

Kumaranage Ravindu Yasas Nagasinghe

Luciana Benotti

Luis Fernando D'Haro

Marcelo Viridiano

Marcos Estecha-Garitagoitia

Maria Camila Buitrago Cabrera

Mario Rodr'iguez-Cantelar

Mélanie Jouitteau

Mihail Mihaylov

Mohamed Fazli Mohamed Imam

Muhammad Farid Adilazuarda

Munkhjargal Gochoo

Munkh-Erdene Otgonbold

Naome Etori

Olivier Niyomugisha

Paula M'onica Silva

Pranjal A. Chitale

Raj Dabre

Rendi Chevi

Ruochen Zhang

Ryandito Diandaru

Samuel Cahyawijaya

Santiago G'ongora

Soyeong Jeong

Sukannya Purkayastha

Tatsuki Kuribayashi

Thanmay Jayakumar

Tiago Timponi Torrent

Toqeer Ehsan

Vladimir Araujo

Yova Kementchedjhieva

Zara Burzo

Zheng Wei Lim

Zheng-Xin Yong

O. Ignat

Joan Nwatu

Rada Mihalcea

Thamar Solorio

Alham Fikri Aji

2024-06-10

ArXiv (preprint)

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

Jessica Ojo

Israel Abebe Azime

Zhuang Yun Jian

Jesujoba Oluwadara Alabi

Xuanli He

Millicent Ochieng

Sara Hooker

Andiswa Bukula

En-Shiun Annie Lee

Chiamaka Ijeoma Chukwuneke

Happy Buzaaba

Blessing Kudzaishe Sibanda

Godson Kalipe

Jonathan Mukiibi

Salomon Kabongo

Foutse Yuehgoh

M. Setaka

Lolwethu Ndolela

Nkiruka Bridget Odu … (see 6 more)

Rooweither Mabuya

Shamsuddeen Hassan Muhammad

Salomey Osei

Sokhar Samb

Tadesse Kebede Guge

Pontus Stenetorp

Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languag… (see more)es. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.

2024-06-05

ArXiv (preprint)

Meta's AI translation model embraces overlooked languages.

2024-06-05

Nature (published)

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Jiayi Wang

Sweta Agrawal

Marek Masiak

Ricardo Rei

Eleftheria Briakou

Marine Carpuat

Xuanli He

Sofia Bourhim

Andiswa Bukula

Muhidin A. Mohamed

Temitayo Olatoye

Tosin Adewumi

Hamam Mokayed

Christine Mwase

Wangui Kimotho

Foutse Yuehgoh

Aremu Anuoluwapo

Jessica Ojo

Shamsuddeen Hassan Muhammad … (see 41 more)

Salomey Osei

Abdul-Hakeem Omotayo

Chiamaka Ijeoma Chukwuneke

Perez Ogayo

Oumaima Hourrane

Salma El Anigri

Lolwethu Ndolela

Thabiso Mangwana

Shafie Abdi Mohamed

Hassan Ayinde

Ayinde Hassan

Oluwabusayo Olufunke Awoyomi

Lama Alkhaled

sana Sabah al-azzawi

Naome Etori

Millicent Ochieng

Clemencia Siro

Samuel Njoroge

Njoroge Kiragu

Eric Muchiri

Wangari Kimotho

Lyse Naomi Wamba

Daud Abolade

Simbiat Ajao

Iyanuoluwa Shode

Ricky Macharm

Ruqayya Nasir Iro

Saheed Salahudeen Abdullahi

Stephen Moore

Bernard Opoku

Zainab Akinjobi

Abeeb Afolabi

Nnaemeka Casmir Obiefuna

Onyekachi Ogbu

Sam Brian

Sam Ochieng’

Verrah Akinyi Otiende

CHINEDU EMMANUEL MBONU

Toadoum Sari Sakayo

Yao Lu

Pontus Stenetorp

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measur… (see more)ing this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

2024-06-01

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (published)

Comparing LLM prompting with Cross-lingual transfer performance on Indigenous and Low-resource Brazilian Languages

A. Seza Dougruoz

Andr'e Coneglian

Atul Kr. Ojha

Large Language Models are transforming NLP for a variety of tasks. However, how LLMs perform NLP tasks for low-resource languages (LRLs) is … (see more)less explored. In line with the goals of the AmericasNLP workshop, we focus on 12 LRLs from Brazil, 2 LRLs from Africa and 2 high-resource languages (HRLs) (e.g., English and Brazilian Portuguese). Our results indicate that the LLMs perform worse for the part of speech (POS) labeling of LRLs in comparison to HRLs. We explain the reasons behind this failure and provide an error analysis through examples observed in our data set.

2024-04-28

ArXiv (preprint)

EkoHate: Abusive Language and Hate Speech Detection for Code-switched Political Discussions on Nigerian Twitter

Comfort Eseohen Ilevbare

Jesujoba Oluwadara Alabi

Firdous Damilola Bakare

Oluwatoyin Bunmi Abiola

Oluwaseyi A. Adeyemo

Nigerians have a notable online presence and actively discuss political and topical matters. This was particularly evident throughout the 20… (see more)23 general election, where Twitter was used for campaigning, fact-checking and verification, and even positive and negative discourse. However, little or none has been done in the detection of abusive language and hate speech in Nigeria. In this paper, we curated code-switched Twitter data directed at three musketeers of the governorship election on the most populous and economically vibrant state in Nigeria; Lagos state, with the view to detect offensive speech in political discussions. We developed EkoHate -- an abusive language and hate speech dataset for political discussions between the three candidates and their followers using a binary (normal vs offensive) and fine-grained four-label annotation scheme. We analysed our dataset and provided an empirical evaluation of state-of-the-art methods across both supervised and cross-lingual transfer learning settings. In the supervised setting, our evaluation results in both binary and four-label annotation schemes show that we can achieve 95.1 and 70.3 F1 points respectively. Furthermore, we show that our dataset adequately transfers very well to three publicly available offensive datasets (OLID, HateUS2020, and FountaHate), generalizing to political discussions in other regions like the US.

2024-04-28

ArXiv (preprint)

5th Workshop on African Natural Language Processing (AfricaNLP 2024)

Happy Buzaaba

Bonaventure F. P. Dossou

Hady Elsahar

Constantine Lignos

Atnafu Lambebo Tonja

Salomey Osei

Aremu Anuoluwapo

Clemencia Siro

Shamsuddeen Hassan Muhammad

Tajuddeen Gwadabe

Perez Ogayo

Israel Abebe Azime

Kayode Olaleye

Over 1 billion people live in Africa, and its residents speak more than 2,000 languages. But those languages are among the least represented… (see more) in NLP research, and work on African languages is often sidelined at major venues. Over the past few years, a vibrant, collaborative community of researchers has formed around a sustained focus on NLP for the benefit of the African continent: national, regional, continental and even global collaborative efforts focused on African languages, African corpora, and tasks with importance in the African context. The AfricaNLP workshops have been a central venue in organizing, sustaining, and growing this focus, and we propose to continue this tradition with an AfricaNLP 2024 workshop in Vienna. Starting in 2020, the AfricaNLP workshop has become a core event for the African NLP community and has drawn global attendance and interest. Many of the participants are active in the Masakhane grassroots NLP community, allowing the community to convene, showcase and share experiences with each other. Large scale collaborative works have been enabled by participants who joined from the AfricaNLP workshop such as MasakhaNER (61 authors), Quality assessment of Multilingual Datasets (51 authors), Corpora Building for Twi (25 authors), NLP for Ghanaian Languages (25 Authors). Many first-time authors, through the mentorship program, found collaborators and published their first paper. Those mentorship relationships built trust and coherence within the community that continues to this day. We aim to continue this. In the contemporary AI landscape, generative AI has rapidly expanded with significant input and innovation from the global research community. This technology enables machines to generate novel content, showcases potential across a multitude of sectors. However, underrepresentation of African languages persists within this growth. Recognizing the urgency to address this gap has inspired the theme for the 2024 workshop: Adaptation of Generative AI for African languages which aspires to congregate experts, linguists, and AI enthusiasts to delve into solutions, collaborations, and strategies to amplify the presence of African languages in generative AI models.

2024-03-08

ICLR.cc/2024/Workshop_Proposals (published)

openreview.net

AfriHG: News Headline Generation for African Languages

Toyib Ogunremi

Serah sessi Akojenu

Anthony Soronnadi

Olubayo Adekanmbi