Ivan Vulić

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt

Goran Glavaš

Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system, since these languag… (see more)es cannot pair automatic speech recognition (ASR) with language models to benefit from language technology. Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Better SLU can strengthen the robustness of massively multilingual ASR by levering language semantics to disambiguate utterances via context or exploiting semantic similarities across languages. However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.

2025-07-07

colmweb.org/COLM/2025/Conference (accepted)

doi.org

openreview.net

Training Plug-and-Play Knowledge Modules with Deep Context Distillation

Lucas Caccia

Alan Ansell

Edoardo Ponti

Ivan Vulić

Alessandro Sordoni

Dynamically integrating new or rapidly evolving information after (Large) Language Model pre-training remains challenging, particularly in l… (see more)ow-data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training document-level Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly as the training objective for KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.

2025-07-07

colmweb.org/COLM/2025/Conference (accepted)

openreview.net

Training Plug n' Play Knowledge Modules with Deep Context Distillation

Lucas Caccia

Alan Ansell

Ivan Vulić

Edoardo Ponti

Alessandro Sordoni

Dynamically integrating new or rapidly evolving information after Language Model (LM) pre-training remains challenging, particularly in low-… (see more)data scenarios or when dealing with private and specialized documents. In-context learning and retrieval-augmented generation (RAG) face limitations, including their high inference costs and their inability to capture global document information. In this paper, we propose a way of modularizing knowledge by training Knowledge Modules (KMs). KMs are lightweight components implemented as parameter-efficient LoRA modules, which are trained to store information about new documents and can be easily plugged into models on demand. We show that next-token prediction performs poorly in training KMs. We instead propose Deep Context Distillation: we learn KMs parameters such as to simulate hidden states and logits of a teacher that takes the document in context. Our method outperforms standard next-token prediction and pre-instruction training techniques, across two datasets. Finally, we highlight synergies between KMs and retrieval-augmented generation.

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepted)

openreview.net

Training Plug-n-Play Knowledge Modules with Deep Context Distillation

Lucas Caccia

Alan Ansell

Edoardo Ponti

Ivan Vulić

Alessandro Sordoni

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepted)

doi.org

openreview.net

Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding

Fabian David Schmidt

Ivan Vuli'c

Goran Glavavs

David Ifeoluwa Adelani

2025-01-10

ArXiv (preprint)

doi.org

arxiv.org

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Emanuele Bugliarello

Fangyu Liu

Jonas Pfeiffer

Siva Reddy

Desmond Elliott

Edoardo Ponti

Ivan Vulić

Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of… (see more) a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.

2022-01-01

ICML (published)

proceedings.mlr.press

arxiv.org

Modelling Latent Translations for Cross-Lingual Transfer

Edoardo Ponti

Julia Kreutzer

Ivan Vulić

Siva Reddy

2021-07-23

ArXiv (preprint)

arxiv.org