Titouan Parcollet

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Yingzhi Wang

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 13 more)

Florian Mai

Juan Zuluaga-Gomez

Seyed Mahed Mousavi

Andreas Nautsch

Ha Nguyen

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

Gaëlle Laperrière

Mickael Rouvier

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-28

arXiv (preprint)

doi.org

arxiv.org

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (see more) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.

2023-08-19

INTERSPEECH 2023 (published)

doi.org

arxiv.org

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

Salah Zaiem

Robin Algayres

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (see more)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

2023-06-03

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

doi.org

arxiv.org

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

Mirco Ravanelli

Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotate… (see more)d data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.

2022-12-31

arXiv (preprint)

doi.org

arxiv.org

SpeechBrain: A General-Purpose Speech Toolkit

Mirco Ravanelli

Titouan Parcollet

Peter Plantinga

Aku Rouhe

Samuele Cornell

Chien-Feng Liao

Elena Rastorgueva

François Grondin

William Aris

Hwidong Na

Yan Gao

Renato De Mori … (see 1 more)

Yoshua Bengio

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech proc… (see more)essing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

2020-12-31

arXiv (preprint)

doi.org

arxiv.org

The PyTorch-Kaldi Speech Recognition Toolkit

Mirco Ravanelli

Titouan Parcollet

Yoshua Bengio

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, … (see more)for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

2019-04-16

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

doi.org

arxiv.org

Quaternion Recurrent Neural Networks

Titouan Parcollet

Mirco Ravanaelli

Mohamed Morchid

Georges Linarès

Chiheb Trabelsi

Renato De Mori

Yoshua Bengio

Recurrent neural networks (RNNs) are powerful architectures to model sequential data, due to their capability to learn short and long-term d… (see more)ependencies between the basic elements of a sequence. Nonetheless, popular tasks such as speech or images recognition, involve multi-dimensional input features that are characterized by strong internal dependencies between the dimensions of the input vector. We propose a novel quaternion recurrent neural network (QRNN), alongside with a quaternion long-short term memory neural network (QLSTM), that take into account both the external relations and these internal structural dependencies with the quaternion algebra. Similarly to capsules, quaternions allow the QRNN to code internal dependencies by composing and processing multidimensional features as single entities, while the recurrent operation reveals correlations between the elements composing the sequence. We show that both QRNN and QLSTM achieve better performances than RNN and LSTM in a realistic application of automatic speech recognition. Finally, we show that QRNN and QLSTM reduce by a maximum factor of 3.3x the number of free parameters needed, compared to real-valued RNNs and LSTMs to reach better results, leading to a more compact representation of the relevant information.

2018-12-31

ICLR.cc/2019/Conference (poster)

openreview.net

Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Mohamed Morchid

Georges Linarès

Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it… (see more) easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.

2018-09-01

Interspeech 2018 (published)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Titouan Parcollet

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Titouan Parcollet

Publications