Portrait de Mirco Ravanelli

Mirco Ravanelli

Membre académique associé
Professeur adjoint, Concordia University, École de génie et d'informatique Gina-Cody
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage profond

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Maîtrise recherche - Concordia
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Maîtrise recherche - Concordia
Doctorat - Concordia
Co-superviseur⋅e :
Maîtrise recherche - Concordia
Co-superviseur⋅e :
Maîtrise recherche - Concordia
Doctorat - Concordia
Co-superviseur⋅e :
Doctorat - Concordia
Collaborateur·rice de recherche - International School for Advanced Studies (Trieste, Italy)
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice de recherche - Concordia University
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - Concordia
Co-superviseur⋅e :
Postdoctorat - McGill
Doctorat - UdeM
Collaborateur·rice de recherche - Concordia University

Publications

Explaining Network Decision Provides Insights on the Causal Interaction Between Brain Regions in a Motor Imagery Task
Davide Borra
Multi-modal Decoding of Reach-to-Grasping from EEG and EMG via Neural Networks
Davide Borra
Matteo Fraternali
Elisa Magosso
LMAC-TD: Producing Time Domain Explanations for Audio Classifiers
Eleonora Mancini
Francesco Paissan
Audio Editing with Non-Rigid Text Prompts
Francesco Paissan
Zhepei Wang
Paris Smaragdis
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (voir plus)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
Progres: Prompted Generative Rescoring on ASR N-Best
Ada Defne Tur
Adel Moumen
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best h… (voir plus)ypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.
Progres: Prompted Generative Rescoring on ASR N-Best
Ada Defne Tur
Adel Moumen
Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best h… (voir plus)ypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.
Listenable Maps for Audio Classifiers
Francesco Paissan
Open-Source Conversational AI with SpeechBrain 1.0
Titouan Parcollet
Adel Moumen
Sylvain de Langen
Peter William VanHarn Plantinga
Yingzhi Wang
Pooneh Mousavi
Luca Della Libera
Artem Ploujnikov
Francesco Paissan
Davide Borra
Salah Zaiem
Zeyu Zhao
Shucong Zhang
Georgios Karakasidis
Sung-Lin Yeh
Pierre Champion
Aku Rouhe
Rudolf Braun … (voir 11 de plus)
Florian Mai
Juan Pablo Zuluaga
Seyed Mahed Mousavi
Andreas Nautsch
Xuechen Liu
Sangeet Sagar
Jarod Duret
Salima Mdhaffar
G. Laperriere
Renato De Mori
Yannick Estève
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (voir plus)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
Open-Source Conversational AI with SpeechBrain 1.0
Titouan Parcollet
Adel Moumen
Sylvain de Langen
Peter William VanHarn Plantinga
Yingzhi Wang
Pooneh Mousavi
Luca Della Libera
Artem Ploujnikov
Francesco Paissan
Davide Borra
Salah Zaiem
Zeyu Zhao
Shucong Zhang
Georgios Karakasidis
Sung-Lin Yeh
Pierre Champion
Aku Rouhe
Rudolf Braun … (voir 11 de plus)
Florian Mai
Juan Pablo Zuluaga
Seyed Mahed Mousavi
Andreas Nautsch
Xuechen Liu
Sangeet Sagar
Jarod Duret
Salima Mdhaffar
G. Laperriere
Renato De Mori
Yannick Estève
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (voir plus)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks
DASB -- Discrete Audio and Speech Benchmark
Pooneh Mousavi
Luca Della Libera
Jarod Duret
Artem Ploujnikov
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the… (voir plus) creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi
Jarod Duret
Salah Zaiem
Luca Della Libera
Artem Ploujnikov
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi
Jarod Duret
Salah Zaiem
Luca Della Libera
Artem Ploujnikov
Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audi… (voir plus)o tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.