Portrait of Salah Zaiem is unavailable

Salah Zaiem

Alumni

Publications

Open-Source Conversational AI with SpeechBrain 1.0
Adel Moumen
Sylvain de Langen
Peter William VanHarn Plantinga
Yingzhi Wang
Zeyu Zhao
Shucong Zhang
Georgios Karakasidis
Pierre Champion
Aku Rouhe
Rudolf Braun … (see 11 more)
Florian Mai
Juan Pablo Zuluaga
Seyed Mahed Mousavi
Andreas Nautsch
Xuechen Liu
Sangeet Sagar
Jarod Duret
Salima Mdhaffar
G. Laperriere
Yannick Estève
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
Open-Source Conversational AI with SpeechBrain 1.0
Adel Moumen
Sylvain de Langen
Peter William VanHarn Plantinga
Yingzhi Wang
Zeyu Zhao
Shucong Zhang
Georgios Karakasidis
Pierre Champion
Aku Rouhe
Rudolf Braun … (see 11 more)
Florian Mai
Juan Pablo Zuluaga
Seyed Mahed Mousavi
Andreas Nautsch
Xuechen Liu
Sangeet Sagar
Jarod Duret
Salima Mdhaffar
G. Laperriere
Yannick Estève
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audi… (see more)o tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding
Salima Mdhaffar
Fethi Bougares
Yannick Estève
In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves e… (see more)xtracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Youcef Kemiche
Slim Essid
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Youcef Kemiche
Slim Essid
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (see more) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.
Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study
Robin Algayres
Slim Essid
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (see more)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0. 81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Youcef Kemiche
Slim Essid
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (see more) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.