David Ifeoluwa Adelani

Jian Gang Ngui

Daniel Vila-Suero

Peerat Limkonchotiwat

Kelly Marchisio

Wei Qi Leong

Yosephine Susanto

Raymond Ng

Shayne Longpre

Wei-Yin Ko

Madeline Smith

Antoine Bosselut

Alice Oh

André F. T. Martins

Leshem Choshen

Daphne Ippolito

Enzo Ferrante … (see 3 more)

Marzieh Fadaee

Beyza Ermis

Sara Hooker

2024-12-04

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

Choice Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

Choice Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Edward Bayes

Israel Abebe Azime

Jesujoba Oluwadara Alabi

Jonas Kgomo

Tyna Eloundou

Elizabeth Proehl

Kai Chen

Imaan Khadir

Naome Etori

Shamsuddeen Hassan Muhammad

C. Mpanza

Igneciah Pocia Thete

Dietrich Klakow

Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

2024-12-01

ArXiv (preprint)

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Genta Indra Winata

Frederikus Hudi

Patrick Amadeus Irawan

David Anugraha

Rifki Afina Putri

Yutong Wang

Adam Nohejl

Ubaidillah Ariq Prathama

Nedjma OUSIDHOUM

Afifa Amriani

Anar Rzayev

Anirban Das

Ashmari Pramodya

Aulia Adila

Bryan Wilie

Candy Olivia Mawalim

Ching Lam Cheng

Daud Abolade

Emmanuele Chersoni

Enrico Santus … (see 31 more)

Fariz Ikhwantri

Garry Kuwanto

Hanyang Zhao

Haryo Akbarianto Wibowo

Holy Lovenia

Jan Christian Blaise Cruz

Jan Wira Gotama Putra

Junho Myung

Lucky Susanto

Maria Angelica Riera Machin

Marina Zhukova

Michael Anugraha

Muhammad Farid Adilazuarda

Natasha Santosa

Peerat Limkonchotiwat

Raj Dabre

Rio Alexander Audino

Samuel Cahyawijaya

Shi-Xiong Zhang

Stephanie Yulia Salim

Yi Zhou

Yinxuan Gui

En-Shiun Annie Lee

Shogo Okada

Ayu Purwarianti

Alham Fikri Aji

Taro Watanabe

Derry Tanti Wijaya

Alice Oh

Chong-Wah Ngo

2024-10-16

ArXiv (preprint)

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Genta Indra Winata

Frederikus Hudi

Patrick Amadeus Irawan

David Anugraha

Rifki Afina Putri

Yutong Wang

Adam Nohejl

Ubaidillah Ariq Prathama

Nedjma OUSIDHOUM

Afifa Amriani

Anar Rzayev

Anirban Das

Ashmari Pramodya

Aulia Adila

Bryan Wilie

Candy Olivia Mawalim

Ching Lam Cheng

Daud Abolade

Emmanuele Chersoni

Enrico Santus … (see 31 more)

Fariz Ikhwantri

Garry Kuwanto

Hanyang Zhao

Haryo Akbarianto Wibowo

Holy Lovenia

Jan Christian Blaise Cruz

Jan Wira Gotama Putra

Junho Myung

Lucky Susanto

Maria Angelica Riera Machin

Marina Zhukova

Michael Anugraha

Muhammad Farid Adilazuarda

Natasha Santosa

Peerat Limkonchotiwat

Raj Dabre

Rio Alexander Audino

Samuel Cahyawijaya

Shi-Xiong Zhang

Stephanie Yulia Salim

Yi Zhou

Yinxuan Gui

En-Shiun Annie Lee

Shogo Okada

Ayu Purwarianti

Alham Fikri Aji

Taro Watanabe

Derry Tanti Wijaya

Alice Oh

Chong-Wah Ngo

2024-10-16

ArXiv (preprint)

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David Orlando Romero Mogrovejo

Chenyang Lyu

Haryo Akbarianto Wibowo

Santiago Góngora

Aishik Mandal

Sukannya Purkayastha

Jesus-German Ortiz-Barajas

Emilio Villa Cueva

Jinheon Baek

Soyeong Jeong

Injy Hamed

Zheng Xin Yong

Zheng Wei Lim

Paula Mónica Silva

Jocelyn Dunstan

D. Meur

Mélanie Jouitteau

David LE MEUR

Joan Nwatu … (see 57 more)

Ganzorig Batnasan

Munkh-Erdene Otgonbold

Munkhjargal Gochoo

Guido Ivetta

Luciana Benotti

Laura Alonso Alemany

Hernán Maina

Jiahui Geng

Tiago Timponi Torrent

Frederico Belcavello

Israel Abebe Azime

Marcelo Viridiano

Jan Christian Blaise Cruz

Dan John Velasco

Jay Gala

Zara Burzo

Chenxi Whitehouse

Artem Abzaliev

Teresa Clifford

Gráinne Caulfield

Teresa Lynn

Christian Salamea-Palacios

Yova Kementchedjhieva

Mihail Minkov Mihaylov

Henok Biadglign Ademtew

Bontu Fufa Balcha

Rada Mihalcea

Atnafu Lambebo Tonja

Maria Camila Buitrago Cabrera

Naome Etori

Gisela Vallejo

Holy Lovenia

Ruochen Zhang

Marcos Estecha-Garitagoitia

Mario Rodríguez-Cantelar

Toqeer Ehsan

Rendi Chevi

Muhammad Farid Adilazuarda

Ryandito Diandaru

Samuel Cahyawijaya

Fajri Koto

Tatsuki Kuribayashi

Haiyue Song

Aditya Nanda Kishore Khandavally

Thanmay Jayakumar

Vladimir Araujo

Raj Dabre

Mohamed Fazli Mohamed Imam

Kumaranage Ravindu Yasas Nagasinghe

Alina Dragonetti

Luis Fernando D'Haro

Oana Ignat

Olivier NIYOMUGISHA

Pranjal A Chitale

Fauzan Farooqui

Alham Fikri Aji

Thamar Solorio

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (oral)

openreview.net

Machine Translation Hallucination Detection for Low and High Resource Languages using Large Language Models

Kenza Benkirane

Laura Gongas

Shahar Pelles

Naomi Fuchs

Joshua Darmon

Pontus Stenetorp

Eduardo Sánchez

Meta

Recent advancements in massively multilingual machine translation systems have significantly enhanced translation accuracy; however, even th… (see more)e best performing systems still generate hallucinations, severely impacting user trust. Detecting hallucinations in Machine Translation (MT) remains a critical challenge, particularly since existing methods excel with High-Resource Languages (HRLs) but exhibit substantial limitations when applied to Low-Resource Languages (LRLs). This paper evaluates sentence-level hallucination detection approaches using Large Language Models (LLMs) and semantic similarity within massively multilingual embeddings. Our study spans 16 language directions, covering HRLs, LRLs, with diverse scripts. We find that the choice of model is essential for performance. On average, for HRLs, Llama3-70B outperforms the previous state of the art by as much as 0.16 MCC (Matthews Correlation Coefficient). However, for LRLs we observe that Claude Sonnet outperforms other LLMs on average by 0.03 MCC. The key takeaway from our study is that LLMs can achieve performance comparable or even better than previously proposed models, despite not being explicitly trained for any machine translation task. However, their advantage is less significant for LRLs.

2024-07-23

ArXiv (preprint)

Mitigating Translationese in Low-resource Languages: The Storyboard Approach

Garry Kuwanto

Eno-Abasi Urua

Priscilla A. Amuok

Shamsuddeen Hassan Muhammad

Aremu Anuoluwapo

Verrah Akinyi Otiende

Loice Emma Nanyanga

T. Nyoike

A. D. Akpan

Nsima Ab Udouboh

Idongesit Udeme Archibong

Idara Effiong Moses

Ifeoluwatayo A. Ige

Benjamin A. Ajibade

Olumide Benjamin Awokoya

Idris Abdulmumin

Saminu Mohammad Aliyu

Ruqayya Nasir Iro

Ibrahim Ahmad

Deontae Smith … (see 4 more)

Praise-EL Michaels

Derry Tanti Wijaya

Anietie U Andy

Low-resource languages often face challenges in acquiring high-quality language data due to the reliance on translation-based methods, which… (see more) can introduce the translationese effect. This phenomenon results in translated sentences that lack fluency and naturalness in the target language. In this paper, we propose a novel approach for data collection by leveraging storyboards to elicit more fluent and natural sentences. Our method involves presenting native speakers with visual stimuli in the form of storyboards and collecting their descriptions without direct exposure to the source text. We conducted a comprehensive evaluation comparing our storyboard-based approach with traditional text translation-based methods in terms of accuracy and fluency. Human annotators and quantitative metrics were used to assess translation quality. The results indicate a preference for text translation in terms of accuracy, while our method demonstrates worse accuracy but better fluency in the language focused.

2024-07-14

ArXiv (preprint)

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Orevaoghene Ahia

Aremu Anuoluwapo

Diana Abagyan

Hila Gonen

Daud Abolade

Noah A. Smith

Yulia Tsvetkov

Yoruba—an African language with roughly 47 million speakers—encompasses a continuum with several dialects. Recent efforts to develop NLP… (see more) technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus; YORULECT across three domains and four regional yoruba dialects. To develop this corpus, we engaged native speakers, traveling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard yoruba and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yoruba and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We will release YORULECT dataset and models publicly under an open license.

2024-06-27

ArXiv (preprint)

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Shayne Longpre

Stella Biderman

Alon Albalak

Hailey Schoelkopf

Daniel McDuff

Sayash Kapoor

Kevin Klyman

Kyle Lo

Gabriel Ilharco

Nay San

Maribeth Rauh

Aviya Skowron

Bertie Vidgen

Laura Weidinger

Arvind Narayanan

Victor Sanh

Percy Liang

Rishi Bommasani

Peter Henderson … (see 3 more)

Sasha Luccioni

Yacine Jernite

Luca Soldaini

Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible deve… (see more)lopment practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.

2024-06-24

ArXiv (preprint)

MINERS: Multilingual Language Models as Semantic Retrievers

Genta Indra Winata

Ruochen Zhang

Words have been represented in a high-dimensional vector space that encodes their semantic similarities, enabling downstream applications su… (see more)ch as retrieving synonyms, antonyms, and relevant contexts. However, despite recent advances in multilingual language models (LMs), the effectiveness of these models' representations in semantic retrieval contexts has not been comprehensively explored. To fill this gap, this paper introduces the MINERS, a benchmark designed to evaluate the ability of multilingual LMs in semantic retrieval tasks, including bitext mining and classification via retrieval-augmented contexts. We create a comprehensive framework to assess the robustness of LMs in retrieving samples across over 200 diverse languages, including extremely low-resource languages in challenging cross-lingual and code-switching settings. Our results demonstrate that by solely retrieving semantically similar embeddings yields performance competitive with state-of-the-art approaches, without requiring any fine-tuning.

2024-06-11

ArXiv (preprint)