David Ifeoluwa Adelani

Bonaventure F. P. Dossou

Abdou Aziz DIOP

Claytone Sikasote

Gilles Q. Hacheme

Happy Buzaaba

Ignatius Majesty Ezeani

Rooweither Mabuya

Salomey Osei

Chris Emezue

Albert Njoroge Kahira

Shamsuddeen Hassan Muhammad

Akintunde Oladipo

Abraham Toluwase Owodunni

Atnafu Lambebo Tonja … (voir 24 de plus)

Iyanuoluwa Shode

Akari Asai

Aremu Anuoluwapo

Ayodele Awokoya

Bernard Opoku

Chiamaka Ijeoma Chukwuneke

Christine Mwase

Clemencia Siro

Stephen Arthur

Tunde Oluwaseyi Ajayi

V. Otiende

Andre Niyongabo Rubungo

B. Sinkala

Daniel A. Ajisafe

Emeka Onwuegbuzia

Falalu Lawan

Ibrahim Ahmad

Jesujoba Alabi

CHINEDU EMMANUEL MBONU

Mofetoluwa Adeyemi

Mofya Phiri

Orevaoghene Ahia

Ruqayya Nasir Iro

Sonia Adhiambo

2023-12-01

Findings of the Association for Computational Linguistics: EMNLP 2023 (publié)

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages

Sebastian Ruder

Jonathan H. Clark

Alexander Gutkin

Mihir Kale

Min Ma

Massimo Nicosia

Shruti Rijhwani

Parker Riley

Jean Michel Amath Sarr

Xinyi Wang

John Frederick Wieting

Nitish Gupta

Anna Katanova

Christo Kirov

Dana L Dickinson

Brian Roark

Bidisha Samanta

Connie Tao

Vera Axelrod … (voir 7 de plus)

Isaac Rayburn Caswell

Colin Cherry

Dan Garrette

Reeve Ingle

Melvin Johnson

Dmitry Panteleev

Partha Talukdar

Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- l… (voir plus)anguages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models

2023-12-01

Findings of the Association for Computational Linguistics: EMNLP 2023 (publié)

openreview.net

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Jiayi Wang

Sweta Agrawal

Marek Masiak

Ricardo Rei

Eleftheria Briakou

Marine Carpuat

Xuanli He

Sofia Bourhim

Andiswa Bukula

Muhidin A. Mohamed

Temitayo Olatoye

Tosin Adewumi

Hamam Mokayede

Christine Mwase

Wangui Kimotho

Foutse Yuehgoh

Aremu Anuoluwapo

Jessica Ojo

Shamsuddeen Hassan Muhammad … (voir 38 de plus)

Salomey Osei

Abdul-Hakeem Omotayo

Chiamaka Ijeoma Chukwuneke

Perez Ogayo

Oumaima Hourrane

Salma El Anigri

Lolwethu Ndolela

Thabiso Mangwana

Shafie Abdi Mohamed

Ayinde Hassan

Oluwabusayo Olufunke Awoyomi

Lama Alkhaled

sana Sabah al-azzawi

Naome A. Etori

Millicent A. Ochieng

Clemencia Siro

Samuel Njoroge

Eric Muchiri

Wangari Kimotho

Lyse Naomi Wamba Momo

Daud Abolade

Simbiat Ajao

Iyanuoluwa Shode

Ricky Macharm

Ruqayya Nasir Iro

Saheed Salahudeen Abdullahi

Stephen E. Moore

Bernard Opoku

Zainab Akinjobi

Abeeb Afolabi

Nnaemeka Casmir Obiefuna

Onyekachi Ogbu

Sam Brian

Verrah Akinyi Otiende

CHINEDU EMMANUEL MBONU

Toadoum Sari Sakayo

Yao Lu

Pontus Stenetorp

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measur… (voir plus)ing this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

2023-11-16

ArXiv (prépublication)

How good are Large Language Models on African Languages?

Jessica Ojo

Kelechi Ogueji

Pontus Stenetorp

2023-11-14

ArXiv (prépublication)

Better Quality Pre-training Data and T5 Models for African Languages

Akintunde Oladipo

Mofetoluwa Adeyemi

Orevaoghene Ahia

Abraham Toluwase Owodunni

Odunayo Ogundepo

Jimmy Lin

In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawl… (voir plus)s have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for

2023-10-07

EMNLP/2023/Conference (accepté)

openreview.net

Improving Language Plasticity via Pretraining with Active Forgetting

Yihong Chen

Kelly Marchisio

Roberta Raileanu

Pontus Stenetorp

Sebastian Riedel

Mikel Artetxe

Pretrained language models (PLMs) are today the primary model for natural language processing. Despite their impressive downstream performan… (voir plus)ce, it can be difficult to apply PLMs to new languages, a barrier to making their capabilities universally accessible. While prior work has shown it possible to address this issue by learning a new embedding layer for the new language, doing so is both data and compute inefficient. We propose to use an active forgetting mechanism during pretraining, as a simple way of creating PLMs that can quickly adapt to new languages. Concretely, by resetting the embedding layer every K updates during pretraining, we encourage the PLM to improve its ability of learning new embeddings within limited number of updates, similar to a meta-learning effect. Experiments with RoBERTa show that models pretrained with our forgetting mechanism not only demonstrate faster convergence during language adaptation, but also outperform standard ones in a low-data regime, particularly for languages that are distant from English. Code will be available at https://github.com/facebookresearch/language-model-plasticity.

openreview.net

YORC: Yoruba Reading Comprehension dataset

Aremu Anuoluwapo

Jesujoba Oluwadara Alabi

In this paper, we create YORC: a new multi-choice Yoruba Reading Comprehension dataset that is based on Yoruba high-school reading comprehen… (voir plus)sion examination. We provide baseline results by performing cross-lingual transfer using existing English RACE dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

2023-08-18

ArXiv (prépublication)

Consultative engagement of stakeholders toward a roadmap for African language technologies

Kathleen Siminyu

Jade Abbott

Kọ́lá Túbọ̀sún

Aremu Anuoluwapo

Blessing Kudzaishe Sibanda

Kofi Yeboah

Masabata Mokgesi-Selinga

Frederick R. Apina

Angela Thandizwe Mthembu

Arshath Ramkilowan

Babatunde Oladimeji

2023-08-01

Patterns (publié)

NollySenti: Leveraging Transfer Learning and Machine Translation for Nigerian Movie Sentiment Classification

Iyanuoluwa Shode

Jing Peng

Anna Feldman

Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there hav… (voir plus)e been progress in developing labelled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross-domain adaptation. We create a new dataset, Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian Pidgin, and Yoruba). We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. By leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While machine translation to low-resource languages are often of low quality, our analysis shows that sentiment related words are often preserved.

2023-05-18

ArXiv (prépublication)

AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Odunayo Ogundepo

Tajuddeen Gwadabe

Clara E. Rivera

Jonathan H. Clark

Sebastian Ruder

Bonaventure F. P. Dossou

Abdoulahat Diop

Claytone Sikasote

Gilles HACHEME

Happy Buzaaba

Ignatius Ezeani

Rooweither Mabuya

Salomey Osei

Chris Emezue

Albert Kahira

Shamsuddeen Hassan Muhammad

Akintunde Oladipo

Abraham Toluwase Owodunni

Atnafu Lambebo Tonja … (voir 32 de plus)

Iyanuoluwa Shode

Akari Asai

Tunde Oluwaseyi Ajayi

Clemencia Siro

Stephen Arthur

Mofetoluwa Adeyemi

Orevaoghene Ahia

Aremu Anuoluwapo

Oyinkansola Awosan

Chiamaka Ijeoma Chukwuneke

Bernard Opoku

A. Ayodele

Verrah Akinyi Otiende

Christine Mwase

Boyd Sinkala

Andre Niyongabo Rubungo

Daniel Ajisafe

Emeka Felix Onwuegbuzia

Habib Mbow

Emile Niyomutabazi

Eunice Mukonde

Falalu Lawan

Ibrahim Ahmad

Jesujoba Oluwadara Alabi

Martin Namukombo

Mbonu Chinedu

Mofya Phiri

Neo Putini

Ndumiso Mngoma

Priscilla A. Amuok

Ruqayya Nasir Iro

Sonia Adhiambo34

2023-05-11

ArXiv (prépublication)

SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)

Shamsuddeen Hassan Muhammad

Idris Abdulmumin

Seid Muhie Yimam

Ibrahim Ahmad

Nedjma OUSIDHOUM

Abinew Ayele

Saif Mohammad

Meriem Beloucif

2023-04-13

ArXiv (prépublication)

Varepsilon kú mask: Integrating Yorùbá cultural greetings into machine translation

Idris Akinade

Jesujoba Oluwadara Alabi

Clement Odoje

Dietrich Klakow

This paper investigates the performance of massively multilingual neural machine translation (NMT) systems in translating Yorùbá greetings… (voir plus) (kú mask), which are a big part of Yorùbá language and culture, into English. To evaluate these models, we present IkiniYorùbá, a Yorùbá-English translation dataset containing some Yorùbá greetings, and sample use cases. We analysed the performance of different multilingual NMT systems including Google and NLLB and show that these models struggle to accurately translate Yorùbá greetings into English. In addition, we trained a Yorùbá-English model by fine-tuning an existing NMT model on the training split of IkiniYorùbá and this achieved better performance when compared to the pre-trained multilingual NMT models, although they were trained on a large volume of data.

2023-03-31

ArXiv (prépublication)