David Ifeoluwa Adelani

Clement Odoje

Idris Akinade

Iffat Maab

Davis David

Shamsuddeen Hassan Muhammad

Neo Putini

David O. Ademuyiwa

Andrew Caines

Dietrich Klakow

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, … (see more)Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

2025-01-10

ArXiv (preprint)

AFRIDOC-MT: Document-level MT Corpus for African Languages

Jesujoba Oluwadara Alabi

Israel Abebe Azime

Miaoran Zhang

Cristina España-Bonet

Rachel Bawden

Dawei Zhu

Clement Odoje

Idris Akinade

Iffat Maab

Davis David

Shamsuddeen Hassan Muhammad

Neo Putini

David O. Ademuyiwa

Andrew Caines

Dietrich Klakow

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, … (see more)Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

2025-01-10

ArXiv (preprint)

AFRIDOC-MT: Document-level MT Corpus for African Languages

Jesujoba Oluwadara Alabi

Israel Abebe Azime

Miaoran Zhang

Cristina España-Bonet

Rachel Bawden

Dawei Zhu

Clement Odoje

Idris Akinade

Iffat Maab

Davis David

Shamsuddeen Hassan Muhammad

Neo Putini

David O. Ademuyiwa

Andrew Caines

Dietrich Klakow

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, … (see more)Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

2025-01-10

ArXiv (preprint)

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt

Ivan Vuli'c

Goran Glavavs

2025-01-10

ArXiv (preprint)

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt

Ivan Vuli'c

Goran Glavavs

Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system, since these languag… (see more)es cannot pair automatic speech recognition (ASR) with language models to benefit from language technology. Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Better SLU can strengthen the robustness of massively multilingual ASR by levering language semantics to disambiguate utterances via context or exploiting semantic similarities across languages. However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.

2025-01-10

ArXiv (preprint)

AfriHG: News headline generation for African Languages

Toyib Ogunremi

Serah Akojenu

Anthony Soronnadi

Olubayo Adekanmbi

This paper introduces AfriHG -- a news headline generation dataset created by combining from XLSum and MasakhaNEWS datasets focusing on 16 l… (see more)anguages widely spoken by Africa. We experimented with two seq2eq models (mT5-base and AfriTeVa V2), and Aya-101 LLM. Our results show that Africa-centric seq2seq models such as AfriTeVa V2 outperform the massively multilingual mT5-base model. Finally, we show that the performance of fine-tuning AfriTeVa V2 with 313M parameters is competitive to prompting Aya-101 LLM with more than 13B parameters.

2024-12-28

ArXiv (preprint)

The Responsible Foundation Model Development Cheatsheet: A Review of Tools&Resources

Shayne Longpre

Stella Biderman

Alon Albalak

Hailey Schoelkopf

Daniel McDuff

Sayash Kapoor

Kevin Klyman

Kyle Lo

Gabriel Ilharco

Nay San

Maribeth Rauh

Aviya Skowron

Bertie Vidgen

Laura Weidinger

Arvind Narayanan

Victor Sanh

Percy Liang

Rishi Bommasani

Peter Henderson … (see 3 more)

Sasha Luccioni

Yacine Jernite

Luca Soldaini

2024-12-07

TMLR (accepted)

openreview.net

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Shivalika Singh

Angelika Romanou

Cl'ementine Fourrier

Jian Gang Ngui

Daniel Vila-Suero

Peerat Limkonchotiwat

Kelly Marchisio

Wei Qi Leong

Yosephine Susanto

Raymond Ng

Shayne Longpre

Wei-Yin Ko

Madeline Smith

Antoine Bosselut

Alice Oh

André F. T. Martins

Leshem Choshen

Daphne Ippolito

Enzo Ferrante … (see 3 more)

Marzieh Fadaee

Beyza Ermis

Sara Hooker

2024-12-04

ArXiv (preprint)

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Shivalika Singh

Angelika Romanou

Cl'ementine Fourrier

Jian Gang Ngui

Daniel Vila-Suero

Peerat Limkonchotiwat

Kelly Marchisio

Wei Qi Leong

Yosephine Susanto

Raymond Ng

Shayne Longpre

Wei-Yin Ko

Madeline Smith

Antoine Bosselut

Alice Oh

André F. T. Martins

Leshem Choshen

Daphne Ippolito

Enzo Ferrante … (see 3 more)

Marzieh Fadaee

Beyza Ermis

Sara Hooker

2024-12-04

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

C. Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

Choice Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

Choice Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)