Portrait de Mirco Ravanelli

Mirco Ravanelli

Membre académique associé

Professeur adjoint, Concordia University, École de génie et d'informatique Gina-Cody

Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle

Sujets de recherche

Apprentissage profond

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Baccalauréat - Concordia

Dehestani Amirali

Stagiaire de recherche - Concordia University

Maîtrise recherche - Concordia University

Maîtrise recherche - Concordia

Matthieu Cervera

Superviseur⋅e principal⋅e :

Maîtrise recherche - Concordia

Luca Della Libera

Doctorat - Concordia

Co-superviseur⋅e :

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Gianfranco Dumoulin Bertucci

Maîtrise recherche - Concordia

Nadine El-Mufti

Maîtrise recherche - Concordia

Maab Elrashid Ahmed Mohamed

Doctorat - Concordia

Co-superviseur⋅e :

Bonzi Francesco

Doctorat - Concordia

Salman Sami Hussain Ali

Maîtrise recherche - Concordia University

Doctorat - Université Laval

Superviseur⋅e principal⋅e :

Maîtrise professionnelle - Concordia Univesity

Eleonora Mancini

Collaborateur·rice alumni - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - University of Toulon

Superviseur⋅e principal⋅e :

Pablo Piantanida

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - Concordia

Doctorat - Concordia

Co-superviseur⋅e :

Laurent Charlin

Francesco Paissan

Doctorat - Université Laval

Superviseur⋅e principal⋅e :

Postdoctorat - McGill

Artem Ploujnikov

Doctorat - UdeM

Maîtrise recherche - Concordia

Benjamin Van Niekerk

Postdoctorat - Concordia

Billets de blogue

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

23 janvier 2026

FocalCodec : donner l’ouïe et la parole aux LLM à débit ultra-faible

par

Luca Della Libera

Francesco Paissan

Mirco Ravanelli

13 juin 2024

SpeechBrain 1.0 : rendre l’IA conversationnelle accessible à tout le monde

par

Mirco Ravanelli

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28 avril 2021

Voici SpeechBrain : Une boîte à outils polyvalente de traitement de la parole basée sur PyTorch

par

Mirco Ravanelli

Loren Lugosch

Publications

Progres: Prompted Generative Rescoring on ASR N-Best

Ada Defne Tur

Adel Moumen

Mirco Ravanelli

Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best h… (voir plus)ypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.

2024-08-30

ArXiv (prépublication)

Listenable Maps for Audio Classifiers

Francesco Paissan

Mirco Ravanelli

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (publié)

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter Plantinga

Yingzhi Wang

Luca Della Libera

Artem Ploujnikov

Francesco Paissan

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Pierre Champion

Aku Rouhe

Rudolf Braun … (voir 11 de plus)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (voir plus)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks

2024-06-29

ArXiv (prépublication)

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter William VanHarn Plantinga

Yingzhi Wang

Luca Della Libera

Artem Ploujnikov

Francesco Paissan

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Pierre Champion

Aku Rouhe

Rudolf Braun … (voir 11 de plus)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (voir plus)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-29

ArXiv (prépublication)

DASB -- Discrete Audio and Speech Benchmark

Luca Della Libera

Jarod Duret

Artem Ploujnikov

Mirco Ravanelli

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the… (voir plus) creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

2024-06-20

ArXiv (prépublication)

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jarod Duret

Luca Della Libera

Artem Ploujnikov

Mirco Ravanelli

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audi… (voir plus)o tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

2024-06-15

ArXiv (prépublication)

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

Mirco Ravanelli

Pascal Germain

2024-06-14

ArXiv (prépublication)

Focal Modulation Networks for Interpretable Sound Classification

Luca Della Libera

Mirco Ravanelli

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (voir plus)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

2024-04-14

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

Resource-Efficient Separation Transformer

Luca Della Libera

Mirco Ravanelli

Samuele Cornell

Frédéric Lepoutre

François Grondin

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding … (voir plus)and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

2024-04-14

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Luca Zampierin

Ghouthi Boukli hacene

Bac Nguyen

Mirco Ravanelli

2024-04-14

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Luca Della Libera

Jacopo Andreoli

Davide Dalle Pezze

Mirco Ravanelli

Gian Antonio Susto

A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has … (voir plus)improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

2024-02-02

ArXiv (prépublication)

Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Luca Della Libera

Jacopo Andreoli

Davide Dalle Pezze

Mirco Ravanelli

Gian Antonio Susto

A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has … (voir plus)improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

2024-02-02

ArXiv (prépublication)