Portrait de Mirco Ravanelli

Mirco Ravanelli

Membre académique associé
Professeur adjoint, Concordia University, École de génie et d'informatique Gina-Cody
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage profond

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Maîtrise recherche - Concordia
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Doctorat - Concordia
Co-superviseur⋅e :
Baccalauréat - Concordia
Maîtrise recherche - Concordia
Doctorat - Concordia
Co-superviseur⋅e :
Doctorat - Concordia
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - Concordia
Co-superviseur⋅e :
Postdoctorat - McGill
Doctorat - UdeM
Collaborateur·rice de recherche - Concordia University
Maîtrise recherche - Concordia

Publications

Exploring Self-Attention Mechanisms for Speech Separation
Samuele Cornell
François Grondin
Mirko Bronzi
Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (voir plus)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu
Szu-Wei Fu
Tsun-An Hsieh
Yu Tsao
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
Artem Ploujnikov
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu
Szu‐wei Fu
Tsun-An Hsieh
Yu-shan Tsao
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE s… (voir plus)ystem to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems.
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.