Rahul Aralikatte

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

Rahul Aralikatte

Ziling Cheng

Sumanth Doddapaneni

We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages. This dataset includes 41.8 million news a… (voir plus)rticles in 14 different Indic languages (and English), which come from a variety of high-quality sources. To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available. We use the data collected in a series of experiments to answer important questions related to Indic NLP and multilinguality research in general. We show that the dataset is challenging even for state-of-the-art abstractive models and that they perform only slightly better than extractive baselines. Owing to its size, we also show that the dataset can be used to pretrain strong language models that outperform competitive baselines in both NLU and NLG benchmarks.

2023-05-10

ArXiv (prépublication)

doi.org

arxiv.org

Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023

Francesco Tinner

David Ifeoluwa Adelani

Chris Emezue

Mammad Hajili

Omer Goldman

Muhammad Farid Adilazuarda

Muhammad Dehan Al Kautsar

Aziza Mirsaidova

Müge Kural

Dylan Massey

Chiamaka Ijeoma Chukwuneke

CHINEDU EMMANUEL MBONU

Damilola Oluwaseun Oloyede

Kayode Olaleye

Jonathan Atala

Benjamin A. Ajibade

Saksham Bassi

Rahul Aralikatte

Najoung Kim

Duygu Ataman

Large language models (LLMs) excel in language understanding and generation, especially in English which has ample public benchmarks for var… (voir plus)ious natural language processing (NLP) tasks. Nevertheless, their reliability across different languages and domains remains uncertain. Our new shared task introduces a novel benchmark to assess the ability of multilingual LLMs to comprehend and produce language under sparse settings, particularly in scenarios with under-resourced languages, with an emphasis on the ability to capture logical, factual, or causal relationships within lengthy text contexts. The shared task consists of two sub-tasks crucial to information retrieval: Named Entity Recognition (NER) and Reading Comprehension (RC), in 7 data-scarce languages: Azerbaijani, Igbo, Indonesian, Swiss German, Turkish, Uzbek and Yorùbá, which previously lacked annotated resources in information retrieval tasks. Our evaluation of leading LLMs reveals that, despite their competitive performance, they still have notable weaknesses such as producing output in the non-target language or providing counterfactual information that cannot be inferred from the context. As more advanced models emerge, the benchmark will remain essential for supporting fairness and applicability in information retrieval systems.

2023-01-01

MRL (publié)

doi.org

Minimax and Neyman–Pearson Meta-Learning for Outlier Languages

Edoardo Ponti

Rahul Aralikatte

Disha Shrivastava

Siva Reddy

Anders Sogaard

Model-agnostic meta-learning (MAML) has been recently put forth as a strategy to learn resource-poor languages in a sample-efficient fashion… (voir plus). Nevertheless, the properties of these languages are often not well represented by those available during training. Hence, we argue that the i.i.d. assumption ingrained in MAML makes it ill-suited for cross-lingual NLP. In fact, under a decision-theoretic framework, MAML can be interpreted as minimising the expected risk across training languages (with a uniform prior), which is known as Bayes criterion. To increase its robustness to outlier languages, we create two variants of MAML based on alternative criteria: Minimax MAML reduces the maximum risk across languages, while Neyman–Pearson MAML constrains the risk in each language to a maximum threshold. Both criteria constitute fully differentiable two-player games. In light of this, we propose a new adaptive optimiser solving for a local approximation to their Nash equilibrium. We evaluate both model variants on two popular NLP tasks, part-of-speech tagging and question answering. We report gains for their average and minimum performance across low-resource languages in zeroand few-shot settings, compared to joint multisource transfer and vanilla MAML. The code for our experiments is available at https:// github.com/rahular/robust-maml.

2021-08-01

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (publié)

doi.org

arxiv.org

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Rahul Aralikatte

Publications

La recherche en IA au service du monde réel

Boussole des politiques en IA

Vie étudiante et ressources

Mots-clés populaires:

Rahul Aralikatte

Publications