Mirco Ravanelli

Eleonora Mancini

Stagiaire de recherche - Université de Montréal

Superviseur⋅e principal⋅e :

eleonora.mancini@mila.quebec

gianfranco.bertucci@mila.quebec

Google Scholar

Fırat Öncel

Doctorat - Concordia University

Co-superviseur⋅e :

Laurent Charlin

firat.oncel@mila.quebec

Maîtrise recherche - Concordia University

Site web

hiba.akhaddar@mila.quebec

Hiba Akhaddar

Maîtrise recherche - Concordia University

Jama Mohamud

Doctorat - Université de Montréal

Co-superviseur⋅e :

Yoshua Bengio

hussein-mohamu.jama@mila.quebec

Doctorat - Concordia University

Co-superviseur⋅e :

luca.dellalibera@mila.quebec

pooneh.mousavi@mila.quebec

Pooneh Mousavi

Doctorat - Concordia University

Site web

salman.hussainali@mila.quebec

Google Scholar

Salman Sami Hussain Ali

Collaborateur·rice de recherche - Concordia University University

seina.assadian@mila.quebec

Seina Assadian

Collaborateur·rice de recherche - Concordia University University

tristan.lueger@mila.quebec

Tristan Lueger Lueger

Collaborateur·rice de recherche - Concordia University University

Victor Cruz

Maîtrise recherche - Concordia University

victor.cruz@mila.quebec

Wagner Drew

Baccalauréat - Concordia University

drew.wagner@mila.quebec

SpeechBrain 1.0 : rendre l’IA conversationnelle accessible à tout le monde

Billets de blogue

13 juin 2024

par

Mirco Ravanelli

Lire l'article

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28 avril 2021

Voici SpeechBrain : Une boîte à outils polyvalente de traitement de la parole basée sur PyTorch

par

Mirco Ravanelli

Loren Lugosch

Lire l'article

Publications

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

2023-08-28

ArXiv (prépublication)

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (voir plus) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.

2023-08-20

INTERSPEECH 2023 (publié)

Generalization Limits of Graph Neural Networks in Identity Effects Learning

Giuseppe Alessio D'inverno

Simone Brugiapaglia

Graph Neural Networks (GNNs) have emerged as a powerful tool for data-driven learning on various graph domains. They are usually based on a … (voir plus)message-passing mechanism and have gained increasing popularity for their intuitive formulation, which is closely linked to the Weisfeiler-Lehman (WL) test for graph isomorphism to which they have been proven equivalent in terms of expressive power. In this work, we establish new generalization properties and fundamental limits of GNNs in the context of learning so-called identity effects, i.e., the task of determining whether an object is composed of two identical components or not. Our study is motivated by the need to understand the capabilities of GNNs when performing simple cognitive tasks, with potential applications in computational linguistics and chemistry. We analyze two case studies: (i) two-letters words, for which we show that GNNs trained via stochastic gradient descent are unable to generalize to unseen letters when utilizing orthogonal encodings like one-hot representations; (ii) dicyclic graphs, i.e., graphs composed of two cycles, for which we present positive existence results leveraging the connection between GNNs and the WL test. Our theoretical analysis is supported by an extensive numerical study.

2023-06-30

ArXiv (prépublication)

Simulated Annealing in Early Layers Leads to Better Generalization

Amir M. Sarfi

Zahra Karimpour

Muawiz Chaudhary

Nasir M. Khalid

Sudhir Mudur

Eugene Belilovsky

Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer… (voir plus) periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. 11The code to reproduce our results is publicly available at: https://github.com/amiiir-sarfi/SEAL

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study

Salah Zaiem

Robin Algayres

Titouan Parcollet

Slim Essid

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (voir plus)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0. 81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

2023-06-04

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Salah Zaiem

Youcef Kemiche

Titouan Parcollet

Slim Essid

2023-06-01

ArXiv (preprint)

Posthoc Interpretation via Quantization

Francesco Paissan

In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (voir plus) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.

2023-03-22

ArXiv (prépublication)

Exploring Self-Attention Mechanisms for Speech Separation

Samuele Cornell

François Grondin

Mirko Bronzi

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (voir plus)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.

2023-01-01

IEEE/ACM Transactions on Audio, Speech, and Language Processing (publié)

OSSEM: one-shot speaker adaptive speech enhancement using meta learning

Cheng Yu

Szu-Wei Fu

Tsun-An Hsieh

Yu Tsao

2022-09-18

Interspeech 2022 (publié)

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

Artem Ploujnikov

2022-09-18

Interspeech 2022 (publié)

Real-M: Towards Speech Separation on Real Mixtures

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.

2022-05-23

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Paris Smaragdis

In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.

2022-05-15

ArXiv (preprint)