Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28/04/2021

by Mirco Ravanelli, Loren Lugosch

What is SpeechBrain?

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to make the research and development of neural speech processing technologies easier by being simple, flexible, user-friendly, and well-documented.

We designed it to natively support multiple speech tasks of common interest, including:

Speech Recognition, i.e. speech-to-text.
Spoken Language Understanding, i.e. speech-to-semantics.
Speaker Recognition, i.e. identifying or verifying speaker identities from speech recordings.
Speech Enhancement, i.e. improving the quality of the speech signal by removing noise.
Speech Separation, i.e. separating multiple speakers speaking at the same time.
Speaker Diarization, i.e. detecting who spoke when.
Multi-microphone signal processing, i.e. combining the information recorded by multiple microphones.

Many other tasks such as text-to-speech, sound event classification, and self-supervised learning will be supported soon. The toolkit provides training recipes for popular speech datasets. Pre-trained models are released on Hugging Face (https://huggingface.co/speechbrain/), along with intuitive functionalities for inference and fine-tuning. To help beginners familiarize themselves with the toolkit, we wrote several tutorials on Google Colab (https://speechbrain.github.io/tutorial_basics.html). SpeechBrain is released under the Apache License, version 2.0.

Website: https://speechbrain.github.io/
GitHub: https://github.com/speechbrain/speechbrain

Motivation

The availability of open-source software is playing a remarkable role in the deep learning community, as was demonstrated with Theano [1] and its Deep Learning Tutorials [2] in the early years of deep learning. Nowadays, one of the most commonly used toolkits is PyTorch [3], thanks to its modern and flexible design that supports GPU-based tensor computations and facilitates the development of dynamically structured neural architectures with proper routines for automatic gradient computation.

In parallel to general-purpose deep learning software, some speech processing toolkits have also gained popularity within the research community. Most of these toolkits are limited to specific speech tasks. For instance, Kaldi [4] is an established framework used to develop state-of-the-art speech recognizers.

Even though many of these frameworks work well for the specific task for which they are designed, our experience in the field suggests that having a single, efficient, and flexible toolkit can significantly speed up the research and development of speech and audio processing techniques. It is thus significantly easier to familiarize oneself with a single toolkit than to learn several different frameworks, considering that all state-of-the-art speech processing techniques share the same underlying technology: deep learning. SpeechBrain therefore consolidates all speech processing tasks within a single toolkit for the benefit of the research community.

Only recently, some excellent speech toolkits able to support different speech tasks have been publicly released. Examples are ESPNET [5] and NeMo [6]. Along this line, we recently released SpeechBrain, which we designed from scratch with the goal of making it simple, flexible and modular. We want SpeechBrain to be suitable for education purposes as well. We thus put major efforts towards rich documentation and tutorials, to help beginners familiarize themselves with our toolkit.

Usage Example

You can simply install SpeechBrain in this way:

 pip install speechbrain

If you prefer a local installation, you can type:

git clone https://github.com/speechbrain
cd speechbrain
pip install -r requirements.txt 
pip install --editable .

Inference with a pre-trained model

Once installed, we can start playing with it. Let’s see first how easy it is to use one of our pre-trained models stored on Hugging Face (https://huggingface.co/speechbrain). For instance, you can use a speech recognition model (trained on LibriSpeech) to transcribe an audio recording:

 
from speechbrain.pretrained import EncoderDecoderASR

 
asr_model = 
EncoderDecoderASR.from_hparams(source="speechbrain/asr-crdnn-rnnlm-librispeech", 
savedir="pretrained_models/asr-crdnn-rnnlm-librispeech") 
asr_model.transcribe_file('speechbrain/asr-crdnn-rnnlm-librispeech/example.wav')

You can also perform speaker verification, to check if two recordings come from the same speakers or different ones.

 
from speechbrain.pretrained import SpeakerRecognition
verification =
SpeakerRecognition.from_hparams(source="speechbrain/spkrec-ecapa-vox
celeb", savedir="pretrained_models/spkrec-ecapa-voxceleb")
score, prediction =
verification.verify_files("speechbrain/spkrec-ecapa-voxceleb/example1.wav", 
"speechbrain/spkrec-ecapa-voxceleb/example2.flac")

We also provide some pre-training models for speech separation (using the SepFormer architecture):

from speechbrain.pretrained import SepformerSeparation as separator
import torchaudio

model =
separator.from_hparams(source="speechbrain/sepformer-wsj02mix",
savedir='pretrained_models/sepformer-wsj02mix')

# for custom file, change path
est_sources = 
model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixtur
e.wav')

torchaudio.save("source1hat.wav", est_sources[:, :, 
0].detach().cpu(), 8000)
torchaudio.save("source2hat.wav", est_sources[:, :, 
1].detach().cpu(), 8000)

We support many other tasks (see https://huggingface.co/speechbrain/). As you can see, you can easily use pre-trained SpeechBrain models with just a few lines of code.

Training a model

In addition to commonly used speech processing building blocks and pre-trained models, SpeechBrain comes with many recipes for training state-of-the-art speech models from scratch on a variety of tasks.

If you go into the main project folder, you can type:

cd recipes/{dataset}/{task}

Where the dataset is the corpus that you would like to use for training (e.g., LibriSpeech) and the task is the speech task we want to solve with this dataset (e.g., automatic speech recognition).

Then, we run a simple command, like this:

#Train the model using the default recipe
python train.py hparams/train.yaml

To train and test a model. All the hyperparameters are summarized in a yaml file, while the main script for training is train.py.

yaml allows us to specify the hyperparameters in an elegant, flexible, and transparent way. Let’s see for instance this yaml snippet:

dropout: 0.8
compute_features: !new:speechbrain.lobes.features.MFCC
    n_mels: 40
    left_frames: 5
    right_frames: 5

model: !new:speechbrain.lobes.models.CRDNN.CRDNN
   input_shape: [null, null, 440]
   activation: !name:torch.nn.LeakyReLU []
   dropout: !ref <dropout>
   cnn_blocks: 2
   cnn_channels: (32, 16)
   cnn_kernelsize: (3, 3)
   time_pooling: True
   rnn_layers: 2
   rnn_neurons: 512
   rnn_bidirectional: True
   dnn_blocks: 2
   dnn_neurons: 1024

As you can see, this is not just a plain list of hyperparameters. For each parameter, we specify the class (or function) that is going to use it. This makes the code more transparent and easier to debug.

The yaml file contains all the information to initialize the classes when loading them. In SpeechBrain we load it with a special function called load_hyperpyyaml, which initializes all the declared classes. This makes the code extremely readable and compact.

The experiment file (e.g., example_asr_ctc_experiment.py in the example) trains a model by combining the functions or classes declared in the yaml file. This script defines the data processing pipeline and defines all the computations from the input signal to the final cost function. Everything is designed to be easy to customize.

To make training easier, SpeechBrain includes the Brain class, which uses overloadable routines for training a model over multiple epochs, validation, checkpointing, and data loading. Our flexible DynamicItemDataset data loader class allows the data reading pipeline to be fully customized directly in the experiment file.

As a result, the code used for training is rather compact and organized in meaningful classes/functions with clear functionalities. Even for complex systems, you can run all the training experiments in all the recipes in this simple way. Right now, we have recipes for many speech datasets, including LibriSpeech, VoxCeleb, CommonVoice, AISHELL-1, AMI, DNS, Google Speech Commands, SLURP, TIMIT, Voicebank, WSJ0Mix, Fluent Speech Commands, and Timers and Such.

Future Plans

We plan to progressively build a community working on this open-source toolkit. In the future, we would like to extend the functionalities of the toolkit to include tasks such as text-to-speech, self-supervised learning, models for small footprint devices, and support for real-time online speech processing. An important component in this ambitious growth plan will be played by the open-source community.

Other help can come from sponsors. Sponsoring allows us to keep expanding the SpeechBrain team and highly increasing the number of new features coming out. If you are interested to contribute, do not hesitate to contact us at speechbrainproject@gmail.com.

Acknowledgments

A special thank you to all of the contributors who made this project possible! This project would not have been possible without the generous contribution of our current industrial sponsors: Samsung, Nvidia, Dolby, Nuance, Via-Dialog.

Contributors

Mirco Ravanelli, Mila, University of Montréal (CA)
Titouan Parcollet, Avignon Université (LIA, FR)
Aku Rouhe, Aalto University (FI)
Peter Plantinga, Ohio State University (USA)
Elena Rastorgueva
Loren Lugosch, Mila, McGill University (CA)
Nauman Dawalatabad, Indian Institute of Technology Madras (IN)
Ju-Chieh Chou, National Taiwan University (TW)
Abdel Heba, Linagora / University of Toulouse (IRIT, FR)
Francois Grondin, University of Sherbrooke (CA)
William Aris, University of Sherbrooke (CA)
Chien-Feng Liao, National Taiwan University (TW)
Samuele Cornell, Università Politecnica delle Marche (IT)
Sung-Lin Yeh, National Tsing Hua University (TW)
Hwidong Na, Visiting Researcher Samsung SAIL (CA)
Yan Gao, University of Cambridge (UK)
Szu-Wei Fu, Academia Sinica (TW)
Cem Subakan, Mila, University of Montréal (CA)
Jianyuan Zhong, University of Rochester (USA)
Brecht Desplanques, Ghent University (BE)
Jenthe Thienpondt, Ghent University (BE)
Salima Mdhaffar, Avignon Université (LIA, FR)
Renato De Mori, University of McGill (CA), Avignon University (LIA, FR)
Yoshua Bengio, Mila, University of Montréal (CA)