Mirco Ravanelli

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Hiba Akhaddar

Maîtrise recherche - Concordia

Allan Allan

Baccalauréat - Concordia

Dehestani Amirali

Stagiaire de recherche - Concordia University

Seina Assadian

Collaborateur·rice de recherche - Concordia University

Cordelle Briac

Maîtrise recherche - Concordia University

Leo Brodeur

Stagiaire de recherche - Concordia

Matthieu Cervera

Superviseur⋅e principal⋅e :

Victor Cruz

Maîtrise recherche - Concordia

Luca Della Libera

Doctorat - Concordia

Co-superviseur⋅e :

Wagner Drew

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Irina Rish

Gianfranco Dumoulin Bertucci

Maîtrise recherche - Concordia

Nadine El-Mufti

Maîtrise recherche - Concordia

Maab Elrashid Ahmed Mohamed

Doctorat - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Doctorat - Université Laval

Superviseur⋅e principal⋅e :

Haoyu Li

Maîtrise professionnelle - Concordia Univesity

Eleonora Mancini

Collaborateur·rice alumni - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - University of Toulon

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - Concordia

Doctorat - Concordia

Co-superviseur⋅e :

Peter Peter

Postdoctorat - McGill

Doctorat - UdeM

Maîtrise recherche - Concordia

SpeechBrain 1.0 : rendre l’IA conversationnelle accessible à tout le monde

Tianmu Yang Tianmu

Collaborateur·rice de recherche - Concordia University

Francesco Verdini

Stagiaire de recherche - Sapienza University of Rome

Billets de blogue

13 juin 2024

par

Mirco Ravanelli

Lire l'article

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28 avril 2021

Voici SpeechBrain : Une boîte à outils polyvalente de traitement de la parole basée sur PyTorch

par

Mirco Ravanelli

Loren Lugosch

Lire l'article

Publications

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter William VanHarn Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease.… (voir plus) We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper's ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.

2025-10-08

ArXiv (prépublication)

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

2025-10-08

ArXiv (prépublication)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (voir plus)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

2025-09-26

ArXiv (prépublication)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

2025-09-26

ArXiv (prépublication)

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (voir plus)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.

2025-09-21

ArXiv (prépublication)

Virtual Consistency for Audio Editing

2025-09-21

ArXiv (prépublication)

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Luca Della Libera

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reco… (voir plus)nstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

2025-09-01

arXiv (publié)

Autoregressive Speech Enhancement via Acoustic Tokens

Luca Della Libera

2025-07-01

arXiv (publié)

From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson's Disease

Peter Plantinga

Jen-Kai Chen

Roozbeh Sattari

Denise Klein

Speech holds promise as a cost-effective and non-invasive biomarker for neurological conditions such as Parkinson's disease (PD). While deep… (voir plus) learning systems trained on raw audio can find subtle signals not available from hand-crafted features, their black-box nature hinders clinical adoption. To address this, we apply sparse autoencoders (SAEs) to uncover interpretable internal representations from a speech-based PD detection system. We introduce a novel mask-based activation for adapting SAEs to small biomedical datasets, creating sparse disentangled dictionary representations. These dictionary entries are found to have strong associations with characteristic articulatory deficits in PD speech, such as reduced spectral flux and increased spectral flatness in the low-energy regions highlighted by the model attention. We further show that the spectral flux is related to volumetric measurements of the putamen from MRI scans, demonstrating the potential of SAEs to reveal clinically relevant biomarkers for disease monitoring and diagnosis.

2025-07-01

arXiv (publié)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (voir 1 de plus)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (prépublication)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (voir 1 de plus)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (prépublication)

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (voir plus) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

2025-05-24

ArXiv (prépublication)