Portrait de David Ifeoluwa Adelani

David Ifeoluwa Adelani

Membre académique principal
Chaire en IA Canada-CIFAR
McGill University
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Traitement de la parole
Traitement du langage naturel

Biographie

David Adelani est professeur adjoint en science informatique et lutte contre les inégalités à l’Université McGill, et membre académique principal à Mila – Institut québécois d'intelligence artificielle. Ses recherches se concentrent sur le traitement multilingue du langage naturel, avec un accent particulier sur les langues sous-dotées en ressources.

Étudiants actuels

Maîtrise recherche - McGill
Maîtrise recherche - McGill
Collaborateur·rice de recherche - McGill
Stagiaire de recherche - McGill
Stagiaire de recherche - McGill
Postdoctorat - McGill
Doctorat - McGill
Collaborateur·rice de recherche - McGill
Doctorat - McGill
Doctorat - McGill
Collaborateur·rice alumni - McGill
Maîtrise recherche - McGill
Stagiaire de recherche - McGill
Maîtrise professionnelle - UdeM
Stagiaire de recherche - McGill
Stagiaire de recherche - McGill
Stagiaire de recherche - McGill
Collaborateur·rice alumni - McGill

Publications

AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages
Happy Buzaaba
Cheikh Mouhamadou Bamba Dione
Sylvain Kahane
Kim Gerdes
Bruno Guillaume
Kevin Guan
Aremu Anuoluwapo
Naome A. Etori
Shamsuddeen Hassan Muhammad
Utitofon Inyang
Peter Nabende
David Sabiiti Bamutura
Andiswa Bukula
Chinedu Uchechukwu
Rooweither Mabuya
Idris Akinade
Christiane Fellbaum
Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support N… (voir plus)LP. We aim to bridge this gap by introducing AfriSUD, the first large-scale collection of syntactically annotated treebanks for nine diverse African languages spanning major language families and regions across Sub-Saharan Africa. Using the Surface-Syntactic Universal Dependencies (SUD) framework, our community-led effort provides high-quality, native-speaker verified data that capture typological key features such as agglutination and tone. We evaluate a range of models on AfriSUD for part-of-speech tagging and dependency parsing including non-transformer baselines, multilingual pretrained encoders, and LLMs. Our results reveal a significant syntax gap, where models still show clear limitations across the nine languages, suggesting that existing architectures may not fully capture the structural diversity of African-language syntax.
OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages
David Guzmán
Jesujoba Oluwadara Alabi
Dietrich Klakow
Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet … (voir plus)these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.
SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech
Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily eval… (voir plus)uated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.
NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages
Min Ma
Shamsuddeen Hassan Muhammad
Idris Abdulmumin
Maryam Ibrahim Mukhtar
Daud Abolade
Joel Okepefi
Johnson Sewedo
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a… (voir plus) challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yor\`ub\'a, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.
YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset
Peace Busola Falola
Jesujoba O. Alabi
Solomon O. Akinola
Folashade T. Ogunajo
Emmanuel Oluwadunsin Alabi
Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific re… (voir plus)sources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yor\`ub\'a, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yor\`ub\'a-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yor\`ub\'a natural language processing.
Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect
Hadia Mohmmedosman Ahmed Samil
In this work, we introduce Sudanese-Flores, an extension of the popular Flores+ machine translation (MT) benchmark to the Sudanese Arabic di… (voir plus)alect. We translate both the DEV and DEVTEST splits of the Modern Standard Arabic dataset into the corresponding Sudanese dialect, resulting in a total of 2,009 sentences. While the dialect was recently introduced in Google Translate, there are no available benchmark in this dialect despite spoken by over 40 million people. Our evaluation on two leading LLMs such as GPT-4.1 and Gemini 2.5 Flash showed that while the performance English to Arabic is impressive (more than 23 BLEU), they struggle on Sudanese dialect (less than 11 BLEU) in zero-shot settings. In few-shot scenario, we achieved only a slight boost in performance.
Multilinguality as Sense Adaptation
Jan Christian Blaise Cruz
Alham Fikri Aji
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Hao Yu
Tianyi Xu
Michael A. Hedderich
Wassim Hamidouche
Syed Waqas Zamir
Afri-MCQA: Multimodal Cultural Question Answering for African Languages
Atnafu Lambebo Tonja
Srija Anand
Emilio Villa Cueva
Israel Abebe Azime
Jesujoba Oluwadara Alabi
Muhidin A. Mohamed
Debela Desalegn Yadeta
Negasi Haile Abadi
Abigail Oppong
Nnaemeka Casmir Obiefuna
Idris Abdulmumin
Naome Etori
Eric Peter Wairagala
Kanda Patrick Tshinu
Imanigirimbabazi Emmanuel
Gabofetswe Malema
Alham Fikri Aji
Thamar Solorio
Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Mu… (voir plus)ltilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)
Ibom NLP: A Step Toward Inclusive Natural Language Processing for Nigeria's Minority Languages
Oluwadara Kalejaiye
Mmekut-Mfon Gabriel Edet
A. D. Akpan
Eno-Abasi Urua
Anietie U Andy
Evaluating WMT 2025 Metrics Shared Task Submissions on the SSA-MTE African Challenge Set
Senyu Li
Felermino Dario Mario Ali
Jiayi Wang
Rui Sousa-Silva
Henrique Lopes Cardoso
Pontus Stenetorp
Colin Cherry
Findings of the WMT25 Shared Task on Automated Translation Evaluation Systems: Linguistic Diversity is Challenging and References Still Help
Alon Lavie
Greg Hanneman
Sweta Agrawal
Diptesh Kanojia
Chi-kiu Lo
Vilém Zouhar
Frédéric Blain
Chrysoula Zerva
Eleftherios Avramidis
Sourabh Dattatray Deoghare
Archchana Sindhujan
Jiayi Wang
Brian Thompson
Tom Kocmi
Markus Freitag
Daniel Deutsch