Salah Zaiem

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Yingzhi Wang

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (voir 13 de plus)

Florian Mai

Juan Zuluaga-Gomez

Seyed Mahed Mousavi

Andreas Nautsch

Ha Nguyen

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

Gaëlle Laperrière

Mickael Rouvier

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (voir plus)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-28

arXiv (prépublication)

doi.org

arxiv.org

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Luca Della Libera

Pooneh Mousavi

Salah Zaiem

Yusuf Cem Sübakan

Mirco Ravanaelli

2023-12-31

IEEE ACM Trans. Audio Speech Lang. Process. (publié)

doi.org

arxiv.org

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jarod Duret

Yusuf Cem Sübakan

Mirco Ravanaelli

2023-12-31

INTERSPEECH (publié)

doi.org

arxiv.org

TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding

Salima Mdhaffar

Fethi Bougares

Renato De Mori

Salah Zaiem

Mirco Ravanaelli

Yannick Estève

In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves e… (voir plus)xtracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.

2023-12-31

International Conference on Language Resources and Evaluation (publié)

dblp.uni-trier.de

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (voir plus) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.

2023-08-19

INTERSPEECH 2023 (publié)

doi.org

arxiv.org

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

Salah Zaiem

Robin Algayres

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (voir plus)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

2023-06-03

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

doi.org

arxiv.org

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotate… (voir plus)d data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.

2022-12-31

arXiv (prépublication)

doi.org

arxiv.org

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Salah Zaiem

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Salah Zaiem

Publications