Portrait of Chris Emezue

Chris Emezue

Master's Research - Université de Montréal
Supervisor
Co-supervisor
Research Topics
Causality
Deep Learning
Natural Language Processing
Representation Learning

Publications

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
The NaijaVoices Community
Busayo Awobade
Abraham Owodunni
Handel Emezue
Gloria Monica Tobechukwu Emezue
N. N. Emezue
Sewade Ogun
Bunmi Akinremi
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African langu… (see more)ages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
The NaijaVoices Community
Busayo Awobade
Abraham Owodunni
Handel Emezue
Gloria Monica Tobechukwu Emezue
N. N. Emezue
Sewade Ogun
Bunmi Akinremi
Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo
Tajuddeen Gwadabe
Clara E. Rivera
Jonathan H. Clark
Sebastian Ruder
Bonaventure F. P. Dossou
Abdou Aziz DIOP
Claytone Sikasote
Gilles Q. Hacheme
Happy Buzaaba
Ignatius Majesty Ezeani
Rooweither Mabuya
Salomey Osei
Albert Njoroge Kahira
Shamsuddeen Hassan Muhammad
Akintunde Oladipo
Abraham Toluwase Owodunni
Atnafu Lambebo Tonja … (see 24 more)
Iyanuoluwa Shode
Akari Asai
Aremu Anuoluwapo
Ayodele Awokoya
Bernard Opoku
Chiamaka Ijeoma Chukwuneke
Christine Mwase
Clemencia Siro
Stephen Arthur
Tunde Oluwaseyi Ajayi
V. Otiende
Andre Niyongabo Rubungo
B. Sinkala
Daniel A. Ajisafe
Emeka Onwuegbuzia
Falalu Lawan
Ibrahim Ahmad
Jesujoba Alabi
CHINEDU EMMANUEL MBONU
Mofetoluwa Adeyemi
Mofya Phiri
Orevaoghene Ahia
Ruqayya Nasir Iro
Sonia Adhiambo
Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo
Tajuddeen Gwadabe
Clara E. Rivera
Jonathan H. Clark
Sebastian Ruder
Bonaventure F. P. Dossou
Abdou Aziz DIOP
Claytone Sikasote
Gilles HACHEME
Happy Buzaaba
Ignatius Ezeani
Rooweither Mabuya
Salomey Osei
Albert Kahira
Shamsuddeen Hassan Muhammad
Akintunde Oladipo
Abraham Toluwase Owodunni
Atnafu Lambebo Tonja … (see 32 more)
Iyanuoluwa Shode
Akari Asai
Aremu Anuoluwapo
Ayodele Awokoya
Bernard Opoku
Chiamaka Ijeoma Chukwuneke
Christine Mwase
Clemencia Siro
Stephen Arthur
Oyinkansola Awosan
Tunde Oluwaseyi Ajayi
Verrah Akinyi Otiende
Andre Niyongabo Rubungo
Boyd Sinkala
Daniel Ajisafe
Emeka Felix Onwuegbuzia
Falalu Lawan
Ibrahim Ahmad
Jesujoba Oluwadara Alabi
Habib Mbow
CHINEDU EMMANUEL MBONU
Emile Niyomutabazi
Mofetoluwa Adeyemi
Eunice Mukonde
Mofya Phiri
Orevaoghene Ahia
Ruqayya Nasir Iro
Sonia Adhiambo
Martin Namukombo
Neo Putini
Ndumiso Mngoma
Priscilla A. Amuok
Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation
The practical utility of causality in decision-making is widespread and brought about by the intertwining of causal discovery and causal inf… (see more)erence. Nevertheless, a notable gap exists in the evaluation of causal discovery methods, where insufficient emphasis is placed on downstream inference. To address this gap, we evaluate seven established baseline causal discovery methods including a newly proposed method based on GFlowNets, on the downstream task of treatment effect estimation. Through the implementation of a distribution-level evaluation, we offer valuable and unique insights into the efficacy of these causal discovery methods for treatment effect estimation, considering both synthetic and real-world scenarios, as well as low-data scenarios. The results of our study demonstrate that some of the algorithms studied are able to effectively capture a wide range of useful and diverse ATE modes, while some tend to learn many low-probability modes which impacts the (unrelaxed) recall and precision.
Benchmarking Bayesian Causal Discovery Methods for Downstream Treatment Effect Estimation
AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages
Odunayo Ogundepo
Tajuddeen Gwadabe
Clara E. Rivera
Jonathan H. Clark
Sebastian Ruder
Bonaventure F. P. Dossou
Abdoulahat Diop
Claytone Sikasote
Gilles HACHEME
Happy Buzaaba
Ignatius Ezeani
Rooweither Mabuya
Salomey Osei
Albert Kahira
Shamsuddeen Hassan Muhammad
Akintunde Oladipo
Abraham Toluwase Owodunni
Atnafu Lambebo Tonja … (see 32 more)
Iyanuoluwa Shode
Akari Asai
Tunde Oluwaseyi Ajayi
Clemencia Siro
Stephen Arthur
Mofetoluwa Adeyemi
Orevaoghene Ahia
Aremu Anuoluwapo
Oyinkansola Awosan
Chiamaka Ijeoma Chukwuneke
Bernard Opoku
A. Ayodele
Verrah Akinyi Otiende
Christine Mwase
Boyd Sinkala
Andre Niyongabo Rubungo
Daniel Ajisafe
Emeka Felix Onwuegbuzia
Habib Mbow
Emile Niyomutabazi
Eunice Mukonde
Falalu Lawan
Ibrahim Ahmad
Jesujoba Oluwadara Alabi
Martin Namukombo
Mbonu Chinedu
Mofya Phiri
Neo Putini
Ndumiso Mngoma
Priscilla A. Amuok
Ruqayya Nasir Iro
Sonia Adhiambo34
Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023
Francesco Tinner
Mammad Hajili
Omer Goldman
Muhammad Farid Adilazuarda
Muhammad Dehan Al Kautsar
Aziza Mirsaidova
Müge Kural
Dylan Massey
Chiamaka Ijeoma Chukwuneke
CHINEDU EMMANUEL MBONU
Damilola Oluwaseun Oloyede
Kayode Olaleye
Jonathan Atala
Benjamin A. Ajibade
Saksham Bassi
Rahul Aralikatte
Najoung Kim
Duygu Ataman
Large language models (LLMs) excel in language understanding and generation, especially in English which has ample public benchmarks for var… (see more)ious natural language processing (NLP) tasks. Nevertheless, their reliability across different languages and domains remains uncertain. Our new shared task introduces a novel benchmark to assess the ability of multilingual LLMs to comprehend and produce language under sparse settings, particularly in scenarios with under-resourced languages, with an emphasis on the ability to capture logical, factual, or causal relationships within lengthy text contexts. The shared task consists of two sub-tasks crucial to information retrieval: Named Entity Recognition (NER) and Reading Comprehension (RC), in 7 data-scarce languages: Azerbaijani, Igbo, Indonesian, Swiss German, Turkish, Uzbek and Yorùbá, which previously lacked annotated resources in information retrieval tasks. Our evaluation of leading LLMs reveals that, despite their competitive performance, they still have notable weaknesses such as producing output in the non-target language or providing counterfactual information that cannot be inferred from the context. As more advanced models emerge, the benchmark will remain essential for supporting fairness and applicability in information retrieval systems.
GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu
Moksh J. Jain
Bonaventure F. P. Dossou
Qianli Shen
Anirudh Goyal
Xu Ji
Kenji Kawaguchi
GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu
Moksh J. Jain
Bonaventure F. P. Dossou
Qianli Shen
Anirudh Goyal
Xu Ji
Kenji Kawaguchi
MasakhaNEWS: News Topic Classification for African languages
Marek Masiak
Israel Abebe Azime
Jesujoba Oluwadara Alabi
Atnafu Lambebo Tonja
Christine Mwase
Odunayo Ogundepo
Bonaventure F. P. Dossou
Akintunde Oladipo
Doreen Nixdorf
sana Sabah al-azzawi
Blessing Kudzaishe Sibanda
Davis David
Lolwethu Ndolela
Jonathan Mukiibi
Tunde Oluwaseyi Ajayi
Tatiana Moteu Ngoli
Brian Odhiambo
Abraham Toluwase Owodunni … (see 42 more)
Nnaemeka Casmir Obiefuna
Shamsuddeen Hassan Muhammad
Saheed Salahudeen Abdullahi
Mesay Gemeda Yigezu
Tajuddeen Gwadabe
Idris Abdulmumin
Mahlet Taye Bame
Oluwabusayo Olufunke Awoyomi
Iyanuoluwa Shode
Tolulope Anu Adelani
Habiba Abdulganiy Kailani
Abdul-Hakeem Omotayo
Adetola Adeeko
Afolabi Abeeb
Aremu Anuoluwapo
Olanrewaju Samuel
Clemencia Siro
Wangari Kimotho
Onyekachi Ogbu
CHINEDU EMMANUEL MBONU
Chiamaka Ijeoma Chukwuneke
Samuel Fanijo
Oyinkansola Fiyinfoluwa Awosan
Tadesse Kebede Guge
Toadoum Sari Sakayo
Pamela Nyatsine
Freedmore Sidume
Oreen Yousuf
Mardiyyah Oduwole
USSEN ABRE KIMANUKA
Kanda Patrick Tshinu
Thina Diko
Siyanda Nxakama
Abdulmejid Tuni Johar
Sinodos Gebre
Muhidin A. Mohamed
Shafie Abdi Mohamed
Fuad Mire Hassan
Moges Ahmed Mehamed
Evrard Ngabire
Pontus Stenetorp
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individ… (see more)ual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African languages
Cheikh M. Bamba Dione
Peter Nabende
Jesujoba Oluwadara Alabi
Thapelo Sindane
Happy Buzaaba
Shamsuddeen Hassan Muhammad
Perez Ogayo
Aremu Anuoluwapo
Catherine Gitau
Derguene Mbaye
Jonathan Mukiibi
Blessing Kudzaishe Sibanda
Bonaventure F. P. Dossou
Andiswa Bukula
Rooweither Mabuya
Allahsera Auguste Tapo
Edwin Munkoh-Buabeng
Victoire Memdjokam Koagne … (see 24 more)
Fatoumata Ouoba Kabore
Amelia Taylor
Godson Kalipe
Tebogo Macucwa
Vukosi Marivate
Tajuddeen Gwadabe
Mboning Tchiaze Elvis
Ikechukwu Onyenwe
Gratien Atindogbe
Tolulope Anu Adelani
Idris Akinade
Olanrewaju Samuel
Marien Nahimana
Théogène Musabeyezu
Emile Niyomutabazi
Ester Chimhenga
Kudzai Gotosa
Patrick Mizha
Apelete Agbolo
Seydou Traore
Chinedu Uchechukwu
Aliyu Yusuf
Muhammad Abdullahi
Dietrich Klakow
In this paper, we present AfricaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the… (see more) challenges in annotating POS for these languages using the universal dependencies (UD) guidelines. We conducted extensive POS baseline experiments using both conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in the UD. Evaluating on the AfricaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with parameter-fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems to be more effective for POS tagging in unseen languages.