Portrait de Mirco Ravanelli

Mirco Ravanelli

Membre académique associé
Professeur adjoint, Concordia University, École de génie et d'informatique Gina-Cody
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage profond

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Maîtrise recherche - Concordia
Baccalauréat - Concordia
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Stagiaire de recherche - Concordia
Stagiaire de recherche - Concordia
Maîtrise recherche - Concordia
Doctorat - Concordia
Co-superviseur⋅e :
Maîtrise recherche - Concordia
Co-superviseur⋅e :
Maîtrise recherche - Concordia
Doctorat - Concordia
Co-superviseur⋅e :
Doctorat - Concordia
Collaborateur·rice de recherche - International School for Advanced Studies (Trieste, Italy)
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - Concordia
Co-superviseur⋅e :
Postdoctorat - McGill
Doctorat - UdeM

Publications

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
Slim Essid
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (voir plus) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.
Posthoc Interpretation via Quantization
Francesco Paissan
In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (voir plus) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.
Exploring Self-Attention Mechanisms for Speech Separation
Samuele Cornell
François Grondin
Mirko Bronzi
Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (voir plus)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu
Szu-Wei Fu
Tsun-An Hsieh
Yu Tsao
SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation
Artem Ploujnikov
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
Biasly: a machine learning based platform for automatic racial discrimination detection in online texts
David Bamman
Chris Dyer
Noah A. Smith. 2014
Steven Bird
Ewan Klein
Edward Loper
Nat-527
Jacob Devlin
Ming-Wei Chang
Kenton Lee
Kristina Toutanova. 2019
Bert
Samuel Gehman
Suchin Gururangan
Maarten Sap
Dan Hendrycks
Kevin Gimpel. 2020
Gaussian
Alex Lamb
Di He … (voir 22 de plus)
Anirudh Goyal
Guolin Ke
Feng Liao
Zhenzhong Lan
Mingda Chen
Sebastian Goodman
Yann LeCun
Bernhard E. Boser
J. Denker
Don-608 nie Henderson
Robin Howard
Wayne Hubbard
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
Mike Lewis
Warning : this paper contains content that may 001 be offensive or upsetting. 002 Detecting hateful, toxic, and otherwise racist 003 or sexi… (voir plus)st language in user-generated online con-004 tents has become an increasingly important task 005 in recent years. Indeed, the anonymity, the 006 transience, the size of messages, and the dif-007 ficulty of management, facilitate the diffusion 008 of racist or hateful messages across the Inter-009 net. The critical influence of this cyber-racism 010 is no longer limited to social media, but also 011 has a significant effect on our society : corpo-012 rate business operation, users’ health, crimes, 013 etc. Traditional racist speech reporting chan-014 nels have proven inadequate due to the enor-015 mous explosion of information, so there is an 016 urgent need for a method to automatically and 017 promptly detect texts with racial discrimination. 018 We propose in this work, a machine learning-019 based approach to enable automatic detection 020 of racist text content over the internet. State-of-021 the-art machine learning models that are able 022 to grasp language structures are adapted in this 023 study. Our main contribution include 1) a large 024 scale racial discrimination data set collected 025 from three distinct sources and annotated ac-026 cording to a guideline developed by specialists, 027 2) a set of machine learning models with vari-028 ous architectures for racial discrimination de-029 tection, and 3) a web-browser-based software 030 that assist users to debias their texts when us-031 ing the internet. All these resources are made 032 publicly available.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
OSSEM: one-shot speaker adaptive speech enhancement using meta learning
Cheng Yu
Szu‐wei Fu
Tsun-An Hsieh
Yu-shan Tsao
Although deep learning (DL) has achieved notable progress in speech enhancement (SE), further research is still required for a DL-based SE s… (voir plus)ystem to adapt effectively and efficiently to particular speakers. In this study, we propose a novel meta-learning-based speaker-adaptive SE approach (called OSSEM) that aims to achieve SE model adaptation in a one-shot manner. OSSEM consists of a modified transformer SE network and a speaker-specific masking (SSM) network. In practice, the SSM network takes an enrolled speaker embedding extracted using ECAPA-TDNN to adjust the input noisy feature through masking. To evaluate OSSEM, we designed a modified Voice Bank-DEMAND dataset, in which one utterance from the testing set was used for model adaptation, and the remaining utterances were used for testing the performance. Moreover, we set restrictions allowing the enhancement process to be conducted in real time, and thus designed OSSEM to be a causal SE system. Experimental results first show that OSSEM can effectively adapt a pretrained SE model to a particular speaker with only one utterance, thus yielding improved SE results. Meanwhile, OSSEM exhibits a competitive performance compared to state-of-the-art causal SE systems.
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
SpeechBrain: A General-Purpose Speech Toolkit
Titouan Parcollet
Peter William VanHarn Plantinga
Aku Rouhe
Samuele Cornell
Loren Lugosch
Nauman Dawalatabad
Abdelwahab HEBA
Jianyuan Zhong
Ju-Chieh Chou
Sung-Lin Yeh
Szu-Wei Fu
Chien-Feng Liao
Elena Rastorgueva
Franccois Grondin
William Aris
Hwidong Na
Yan Gao
Renato De Mori … (voir 1 de plus)
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech proc… (voir plus)essing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.