Mirco Ravanelli

Biography

Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

Hiba Akhaddar

Master's Research - Concordia University

Allan Allan

Undergraduate - Concordia University

Dehestani Amirali

Research Intern - Concordia University University

Seina Assadian

Collaborating researcher - Concordia University University

Cordelle Briac

Master's Research - Concordia University University

Leo Brodeur

Research Intern - Concordia University

Matthieu Cervera

Principal supervisor :

Victor Cruz

Master's Research - Concordia University

Luca Della Libera

PhD - Concordia University

Co-supervisor :

Wagner Drew

Master's Research - Concordia University

Co-supervisor :

Irina Rish

Gianfranco Dumoulin Bertucci

Master's Research - Concordia University

Nadine El-Mufti

Master's Research - Concordia University

Maab Elrashid Ahmed Mohamed

PhD - Concordia University

Co-supervisor :

Bonzi Francesco

PhD - Concordia University

Jihoon Jeong

PhD - Université Laval

Principal supervisor :

Haoyu Li

Professional Master's - Concordia University Univesity

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Principal supervisor :

Collaborating researcher - University of Toulon

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Concordia University

PhD - Concordia University

Co-supervisor :

Peter Peter

Postdoctorate - McGill University

PhD - Université de Montréal

Arpnik Singh

Master's Research - Concordia University

SpeechBrain 1.0: Making Conversational AI Accessible to Everyone

Tianmu Yang Tianmu

Collaborating researcher - Concordia University University

Francesco Verdini

Research Intern - Sapienza University of Rome

Blog Posts

June 13, 2024

Mirco Ravanelli

Read the article

April 28, 2021

Introducing SpeechBrain: A General-Purpose PyTorch Speech Processing Toolkit

Mirco Ravanelli

Loren Lugosch

Read the article

Publications

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter William VanHarn Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease.… (see more) We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper's ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.

2025-10-08

ArXiv (preprint)

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

2025-10-08

ArXiv (preprint)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (see more)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

2025-09-26

ArXiv (preprint)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

2025-09-26

ArXiv (preprint)

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (see more)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.

2025-09-21

ArXiv (preprint)

Virtual Consistency for Audio Editing

2025-09-21

ArXiv (preprint)

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Luca Della Libera

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reco… (see more)nstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

2025-09-01

arXiv (published)

Autoregressive Speech Enhancement via Acoustic Tokens

Luca Della Libera

2025-07-01

arXiv (published)

From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinson's Disease

Peter Plantinga

Jen-Kai Chen

Roozbeh Sattari

Denise Klein

Speech holds promise as a cost-effective and non-invasive biomarker for neurological conditions such as Parkinson's disease (PD). While deep… (see more) learning systems trained on raw audio can find subtle signals not available from hand-crafted features, their black-box nature hinders clinical adoption. To address this, we apply sparse autoencoders (SAEs) to uncover interpretable internal representations from a speech-based PD detection system. We introduce a novel mask-based activation for adapting SAEs to small biomedical datasets, creating sparse disentangled dictionary representations. These dictionary entries are found to have strong associations with characteristic articulatory deficits in PD speech, such as reduced spectral flux and increased spectral flatness in the low-energy regions highlighted by the model attention. We further show that the spectral flux is related to volumetric measurements of the putamen from MRI scans, demonstrating the potential of SAEs to reveal clinically relevant biomarkers for disease monitoring and diagnosis.

2025-07-01

arXiv (published)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (see 1 more)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (see more)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (preprint)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (see 1 more)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (see more)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (preprint)

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (see more) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

2025-05-24

ArXiv (preprint)