Portrait of Jessica Ojo

Jessica Ojo

Master's Research - McGill University
Research Topics
Large Language Models (LLM)
Linguistic Evaluation of Language Models
Machine Learning For Speech and Audio
Machine Translation
Scaling Engineering Infrastructure for Large Models Training

Publications

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work in… (see more)troduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings.
AfroBench: How Good are Large Language Models on African Languages?
Kelechi Ogueji
Pontus Stenetorp
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Israel Abebe Azime
Zhuang Yun Jian
Jesujoba Oluwadara Alabi
Xuanli He
Millicent Ochieng
Sara Hooker
Andiswa Bukula
En-Shiun Annie Lee
Chiamaka Ijeoma Chukwuneke
Happy Buzaaba
Blessing Kudzaishe Sibanda
Godson Kalipe
Jonathan Mukiibi
Salomon Kabongo
Foutse Yuehgoh
M. Setaka
Lolwethu Ndolela
Nkiruka Bridget Odu … (see 6 more)
Rooweither Mabuya
Shamsuddeen Hassan Muhammad
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languag… (see more)es. Additionally, many low-resource languages (e.g. African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 16 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based QA~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and four proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Aya-101 only at 58\% of the best-performing proprietary model GPT-4o performance. Machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, like LLaMa 3 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Israel Abebe Azime
Zhuang Yun Jian
Jesujoba Oluwadara Alabi
Xuanli He
Millicent Ochieng
Sara Hooker
Andiswa Bukula
En-Shiun Annie Lee
Chiamaka Ijeoma Chukwuneke
Happy Buzaaba
Blessing Kudzaishe Sibanda
Godson Kalipe
Jonathan Mukiibi
Salomon Kabongo
Foutse Yuehgoh
M. Setaka
Lolwethu Ndolela
Nkiruka Bridget Odu … (see 6 more)
Rooweither Mabuya
Shamsuddeen Hassan Muhammad
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languag… (see more)es. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63\% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models
Israel Abebe Azime
Zhuang Yun Jian
Jesujoba Oluwadara Alabi
Xuanli He
Millicent Ochieng
Sara Hooker
Andiswa Bukula
En-Shiun Annie Lee
Chiamaka Ijeoma Chukwuneke
Happy Buzaaba
Blessing Kudzaishe Sibanda
Godson Kalipe
Jonathan Mukiibi
Salomon Kabongo
Foutse Yuehgoh
M. Setaka
Lolwethu Ndolela
Nkiruka Bridget Odu … (see 6 more)
Rooweither Mabuya
Shamsuddeen Hassan Muhammad
Salomey Osei
Sokhar Samb
Tadesse Kebede Guge
Pontus Stenetorp
Despite the widespread adoption of Large language models (LLMs), their remarkable capabilities remain limited to a few high-resource languag… (see more)es. Additionally, many low-resource languages (\eg African languages) are often evaluated only on basic text classification tasks due to the lack of appropriate or comprehensive benchmarks outside of high-resource languages. In this paper, we introduce IrokoBench -- a human-translated benchmark dataset for 17 typologically-diverse low-resource African languages covering three tasks: natural language inference~(AfriXNLI), mathematical reasoning~(AfriMGSM), and multi-choice knowledge-based question answering~(AfriMMLU). We use IrokoBench to evaluate zero-shot, few-shot, and translate-test settings~(where test sets are translated into English) across 10 open and six proprietary LLMs. Our evaluation reveals a significant performance gap between high-resource languages~(such as English and French) and low-resource African languages. We observe a significant performance gap between open and proprietary models, with the highest performing open model, Gemma 2 27B only at 63\% of the best-performing proprietary model GPT-4o performance. In addition, machine translating the test set to English before evaluation helped to close the gap for larger models that are English-centric, such as Gemma 2 27B and LLaMa 3.1 70B. These findings suggest that more efforts are needed to develop and adapt LLMs for African languages.
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Jiayi Wang
Sweta Agrawal
Marek Masiak
Ricardo Rei
Eleftheria Briakou
Marine Carpuat
Xuanli He
Sofia Bourhim
Andiswa Bukula
Muhidin A. Mohamed
Temitayo Olatoye
Tosin Adewumi
Hamam Mokayed
Christine Mwase
Wangui Kimotho
Foutse Yuehgoh
Aremu Anuoluwapo
Shamsuddeen Hassan Muhammad … (see 41 more)
Salomey Osei
Abdul-Hakeem Omotayo
Chiamaka Ijeoma Chukwuneke
Perez Ogayo
Oumaima Hourrane
Salma El Anigri
Lolwethu Ndolela
Thabiso Mangwana
Shafie Abdi Mohamed
Hassan Ayinde
Ayinde Hassan
Oluwabusayo Olufunke Awoyomi
Lama Alkhaled
sana Sabah al-azzawi
Naome Etori
Millicent Ochieng
Clemencia Siro
Samuel Njoroge
Njoroge Kiragu
Eric Muchiri
Wangari Kimotho
Lyse Naomi Wamba
Daud Abolade
Simbiat Ajao
Iyanuoluwa Shode
Ricky Macharm
Ruqayya Nasir Iro
Saheed Salahudeen Abdullahi
Stephen Moore
Bernard Opoku
Zainab Akinjobi
Abeeb Afolabi
Nnaemeka Casmir Obiefuna
Onyekachi Ogbu
Sam Brian
Sam Ochieng’
Verrah Akinyi Otiende
CHINEDU EMMANUEL MBONU
Toadoum Sari Sakayo
Pontus Stenetorp
Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measur… (see more)ing this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).
McGill NLP Group Submission to the MRL 2024 Shared Task: Ensembling Enhances Effectiveness of Multilingual Small LMs
McGill NLP Group Submission to the MRL 2024 Shared Task: Ensembling Enhances Effectiveness of Multilingual Small LMs
We present our systems for the three tasks and five languages included in the MRL 2024 Shared Task on Multilingual Multi-task Information Re… (see more)trieval: (1) Named Entity Recognition, (2) Free-form Question Answering, and (3) Multiple-choice Question Answering. For each task, we explored the impact of selecting different multilingual language models for fine-tuning across various target languages, and implemented an ensemble system that generates final outputs based on predictions from multiple fine-tuned models. All models are large language models fine-tuned on task-specific data. Our experimental results show that a more balanced dataset would yield better results. However, when training data for certain languages are scarce, fine-tuning on a large amount of English data supplemented by a small amount of “triggering data” in the target language can produce decent results.
AfroBench: How Good are Large Language Models on African Languages?
Kelechi Ogueji
Pontus Stenetorp
How good are Large Language Models on African Languages?
Kelechi Ogueji
Pontus Stenetorp
AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
Jiayi Wang
Sweta Agrawal
Ricardo Rei
Eleftheria Briakou
Marine Carpuat
Marek Masiak
Xuanli He
Sofia Bourhim
Andiswa Bukula
Muhidin A. Mohamed
Temitayo Olatoye
Hamam Mokayede
Christine Mwase
Wangui Kimotho
Foutse Yuehgoh
Anuoluwapo Aremu
Shamsuddeen Hassan Muhammad
Salomey Osei … (see 37 more)
Abdul-Hakeem Omotayo
Chiamaka Chukwuneke
Perez Ogayo
Oumaima Hourrane
Salma El Anigri
Lolwethu Ndolela
Thabiso Mangwana
Shafie Abdi Mohamed
Ayinde Hassan
Oluwabusayo Olufunke Awoyomi
Lama Alkhaled
sana Sabah al-azzawi
Naome A. Etori
Millicent A. Ochieng
Clemencia Siro
Samuel Njoroge
Eric Muchiri
Wangari Kimotho
Lyse Naomi Wamba Momo
Daud Abolade
Simbiat Ajao
Tosin P. Adewumi
Iyanuoluwa Shode
Ricky Macharm
Ruqayya Nasir Iro
Saheed Salahudeen Abdullahi
Stephen E. Moore
Bernard Opoku
Zainab Akinjobi
Abeeb Afolabi
Nnaemeka Casmir Obiefuna
Onyekachi Ogbu
Sam Brian
V. Otiende
CHINEDU EMMANUEL MBONU
Toadoum Sari Sakayo
Pontus Stenetorp
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced… (see more) African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n -gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologi-cally diverse African languages. Furthermore, we develop A FRI COMET—a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments ( +0 . 406 ).
AfriMTE and AfriCOMET: Empowering COMET to Embrace Under-resourced African Languages
Jiayi Wang
Sweta Agrawal
Ricardo Rei
Eleftheria Briakou
Marine Carpuat
Marek Masiak
Xuanli He
Sofia Bourhim
Andiswa Bukula
Muhidin A. Mohamed
Temitayo Olatoye
Hamam Mokayed
Christine Mwase
Wangui Kimotho
Foutse Yuehgoh
Aremu Anuoluwapo
Shamsuddeen Hassan Muhammad
Salomey Osei … (see 37 more)
Abdul-Hakeem Omotayo
Chiamaka Ijeoma Chukwuneke
Perez Ogayo
Oumaima Hourrane
Salma El Anigri
Lolwethu Ndolela
Thabiso Mangwana
Shafie Abdi Mohamed
Ayinde Hassan
Oluwabusayo Olufunke Awoyomi
Lama Alkhaled
sana Sabah al-azzawi
Naome Etori
Millicent Ochieng
Clemencia Siro
Samuel Njoroge
Eric Muchiri
Wangari Kimotho
Lyse Naomi Wamba
Daud Abolade
Simbiat Ajao
Tosin Adewumi
Iyanuoluwa Shode
Ricky Macharm
Ruqayya Nasir Iro
Saheed Salahudeen Abdullahi
Stephen Moore
Bernard Opoku
Zainab Akinjobi
Abeeb Afolabi
Nnaemeka Casmir Obiefuna
Onyekachi Ogbu
Sam Brian
Verrah Akinyi Otiende
CHINEDU EMMANUEL MBONU
Toadoum Sari Sakayo
Pontus Stenetorp
Despite the progress we have recorded in scaling multilingual machine translation (MT) models and evaluation data to several under-resourced… (see more) African languages, it is difficult to measure accurately the progress we have made on these languages because evaluation is often performed on n -gram matching metrics like BLEU that often have worse correlation with human judgments. Embedding-based metrics such as COMET correlate better; however, lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with a simplified MQM guideline for error-span annotation and direct assessment (DA) scoring for 13 typologi-cally diverse African languages. Furthermore, we develop A FRI COMET—a COMET evaluation metric for African languages by leveraging DA training data from high-resource languages and African-centric multilingual encoder (AfroXLM-Roberta) to create the state-of-the-art evaluation metric for African languages MT with respect to Spearman-rank correlation with human judgments ( +0 . 406 ).