Mirco Ravanelli

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Allan Allan

Baccalauréat - Concordia

Gaspard Botté

Stagiaire de recherche - Concordia

Cordelle Briac

Maîtrise recherche - Concordia University

Leo Brodeur

Maîtrise recherche - Concordia

Matthieu Cervera

Superviseur⋅e principal⋅e :

Victor Cruz

Maîtrise recherche - Concordia

Luca Della Libera

Doctorat - Concordia

Co-superviseur⋅e :

Wagner Drew

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Irina Rish

Gianfranco Dumoulin Bertucci

Maîtrise recherche - Concordia

Nadine El-Mufti

Maîtrise recherche - Concordia

Maab Elrashid Ahmed Mohamed

Doctorat - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Salman Sami Hussain Ali

Maîtrise recherche - Concordia University

Lovenya Jain

Stagiaire de recherche - Université Laval

Superviseur⋅e principal⋅e :

Doctorat - Université Laval

Superviseur⋅e principal⋅e :

Haoyu Li

Maîtrise professionnelle - Concordia Univesity

Eleonora Mancini

Collaborateur·rice alumni - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - University of Toulon

Superviseur⋅e principal⋅e :

Pierfrancesco Melucci

Stagiaire de recherche - Concordia

Doctorat - UdeM

Doctorat - Concordia

Doctorat - Concordia

Co-superviseur⋅e :

Peter Peter

Postdoctorat - McGill

Doctorat - UdeM

Maîtrise recherche - Concordia

Benjamin Van Niekerk

Postdoctorat - Concordia

FocalCodec : donner l’ouïe et la parole aux LLM à débit ultra-faible

Billets de blogue

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

23 janvier 2026

par

Luca Della Libera

Francesco Paissan

Cem Subakan

Mirco Ravanelli

Lire l'article

13 juin 2024

SpeechBrain 1.0 : rendre l’IA conversationnelle accessible à tout le monde

par

Mirco Ravanelli

Lire l'article

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28 avril 2021

Voici SpeechBrain : Une boîte à outils polyvalente de traitement de la parole basée sur PyTorch

par

Mirco Ravanelli

Loren Lugosch

Lire l'article

Publications

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

Artem Ploujnikov

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their im… (voir plus)pressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

2022-09-17

Interspeech 2022 (publié)

Real-M: Towards Speech Separation on Real Mixtures

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models.

2022-05-22

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Paris Smaragdis

In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framewor… (voir plus)k where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.

2021-12-31

IEEE Signal Processing Letters (publié)

Transformers with Competitive Ensembles of Independent Mechanisms

Alex Lamb

Di He

Anirudh Goyal

Guolin Ke

Chien-Feng Liao

An important development in deep learning from the earliest MLPs has been a move towards architectures with structural inductive biases whic… (voir plus)h enable the model to keep distinct sources of information and routes of processing well-separated. This structure is linked to the notion of independent mechanisms from the causality literature, in which a mechanism is able to retain the same processing as irrelevant aspects of the world are changed. For example, convnets enable separation over positions, while attention-based architectures (especially Transformers) learn which combination of positions to process dynamically. In this work we explore a way in which the Transformer architecture is deficient: it represents each position with a large monolithic hidden representation and a single set of parameters which are applied over the entire hidden representation. This potentially throws unrelated sources of information together, and limits the Transformer's ability to capture independent mechanisms. To address this, we propose Transformers with Independent Mechanisms (TIM), a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. Additionally, we propose a competition mechanism which encourages these mechanisms to specialize over time steps, and thus be more independent. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.

2021-02-26

ArXiv (prépublication)

openreview.net

SpeechBrain: A General-Purpose Speech Toolkit

Titouan Parcollet

Peter Plantinga

Aku Rouhe

Samuele Cornell

Chien-Feng Liao

Elena Rastorgueva

François Grondin

William Aris

Hwidong Na

Yan Gao

Renato De Mori … (voir 1 de plus)

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech proc… (voir plus)essing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

2020-12-31

arXiv (prépublication)

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Jianyuan Zhong

Santiago Pascual

Pawel Swietojanski

Joao Monteiro

Jan Trmal

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To t… (voir plus)ake a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

2020-05-03

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

Learning Speaker Representations with Mutual Information

Learning good representations is of crucial importance in deep learning. Mutual Information (MI) or similar measures of statistical dependen… (voir plus)ce are promising tools for learning these representations in an unsupervised way. Even though the mutual information between two random variables is hard to measure directly in high dimensional spaces, some recent studies have shown that an implicit optimization of MI can be achieved with an encoder-discriminator architecture similar to that of Generative Adversarial Networks (GANs). In this work, we learn representations that capture speaker identities by maximizing the mutual information between the encoded representations of chunks of speech randomly sampled from the same sentence. The proposed encoder relies on the SincNet architecture and transforms raw speech waveform into a compact feature vector. The discriminator is fed by either positive samples (of the joint distribution of encoded chunks) or negative samples (from the product of the marginals) and is trained to separate them. We report experiments showing that this approach effectively learns useful speaker representations, leading to promising results on speaker identification and verification tasks. Our experiments consider both unsupervised and semi-supervised settings and compare the performance achieved with different objective functions.

2019-09-12

Interspeech 2019 (publié)

Speech Model Pre-training for End-to-End Spoken Language Understanding

Loren Lugosch

Patrick Ignoto

Vikrant Singh Tomar

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map spe… (voir plus)ech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

2019-09-12

Interspeech 2019 (publié)

The PyTorch-Kaldi Speech Recognition Toolkit

Titouan Parcollet

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, … (voir plus)for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility. The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters. Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

2019-04-16

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

Speaker Recognition from Raw Waveform with SincNet

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been … (voir plus)recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

2018-12-17

2018 IEEE Spoken Language Technology Workshop (SLT) (publié)