Mirco Ravanelli

Biographie

Mirco Ravanelli est professeur adjoint à l'Université Concordia, professeur associé à l'Université de Montréal et membre associé de Mila – Institut québécois d’intelligence artificielle. Lauréat du prix Amazon Research 2022, il est expert en apprentissage profond et en IA conversationnelle, et a publié plus de 60 articles dans ces domaines. Il se concentre principalement sur les nouveaux algorithmes d'apprentissage profond, y compris l'apprentissage autosupervisé, continu, multimodal, coopératif et économe en énergie. Mirco Ravanelli a effectué son postdoctorat à Mila, sous la direction du professeur Yoshua Bengio. Il est notamment le fondateur et le chef de file de SpeechBrain, l'une des boîtes à outils en code source ouvert les plus largement adoptées dans le domaine du traitement de la parole et de l'IA conversationnelle.

Étudiants actuels

Allan Allan

Baccalauréat - Concordia

Gaspard Botté

Stagiaire de recherche - Concordia

Cordelle Briac

Maîtrise recherche - Concordia University

Leo Brodeur

Maîtrise recherche - Concordia

Matthieu Cervera

Superviseur⋅e principal⋅e :

Victor Cruz

Maîtrise recherche - Concordia

Doctorat - Concordia

Co-superviseur⋅e :

Wagner Drew

Maîtrise recherche - Concordia

Co-superviseur⋅e :

Irina Rish

Gianfranco Dumoulin Bertucci

Maîtrise recherche - Concordia

Nadine El-Mufti

Maîtrise recherche - Concordia

Maab Elrashid Ahmed Mohamed

Doctorat - Concordia

Co-superviseur⋅e :

Doctorat - Concordia

Salman Sami Hussain Ali

Maîtrise recherche - Concordia University

Lovenya Jain

Stagiaire de recherche - Université Laval

Superviseur⋅e principal⋅e :

Doctorat - Université Laval

Superviseur⋅e principal⋅e :

Haoyu Li

Maîtrise professionnelle - Concordia Univesity

Eleonora Mancini

Collaborateur·rice alumni - UdeM

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - University of Toulon

Superviseur⋅e principal⋅e :

Pierfrancesco Melucci

Stagiaire de recherche - Concordia

Doctorat - UdeM

Doctorat - Concordia

Doctorat - Concordia

Co-superviseur⋅e :

Peter Peter

Postdoctorat - McGill

Doctorat - UdeM

Postdoctorat - Concordia

FocalCodec : donner l’ouïe et la parole aux LLM à débit ultra-faible

Billets de blogue

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

23 janvier 2026

par

Luca Della Libera

Francesco Paissan

Cem Subakan

Mirco Ravanelli

Lire l'article

13 juin 2024

SpeechBrain 1.0 : rendre l’IA conversationnelle accessible à tout le monde

par

Mirco Ravanelli

Lire l'article

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

28 avril 2021

Voici SpeechBrain : Une boîte à outils polyvalente de traitement de la parole basée sur PyTorch

par

Mirco Ravanelli

Loren Lugosch

Lire l'article

Publications

Cough acoustic analysis using artificial intelligence for COVID-19 detection: A comparative study of patient cohorts from Lima, Peru and Montreal, Canada

A. Zimmer

Vijay Ravi

Patricia Espinoza-Lopez

George P. Kafentzis

Samira Abbasgholizadeh Rahimi

Madhukar Pai

César Ugarte-Gil

Serge Lapierre

2026-05-31

Annals of Epidemiology (publié)

Exploring Token-Space Manipulation in Latent Audio Tokenizers

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as… (voir plus) frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

2026-05-10

arXiv (prépublication)

DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi

Jarod Duret

Darius Petermann

Anastasia Kuznetsova

Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling mult… (voir plus)imodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.

2026-04-12

Transactions on Machine Learning Research (accepté)

openreview.net

Listen First, Then Answer: Timestamp-Grounded Speech Reasoning

Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chai… (voir plus)ns remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.

2026-03-18

arXiv (prépublication)

LL-SDR: Low-Latency Speech enhancement through Discrete Representations

Jingyi Li

Many speech enhancement (SE) methods rely on continuous representations. Recently, discrete audio tokens have been explored to enable autore… (voir plus)gressive generation for SE. However, it remains unclear whether discretization itself consistently improves SE performance. In this paper, we introduce LL-SDR, a token-based speech enhancement framework that explicitly leverages discretization to better separate speech and noise. Our first contribution is a Variance-Ordered Residual Vector Quantizer (VO-RVQ), designed to disentangle speech and noise distributions during tokenization. Second, we propose a latent-space discriminator to better align enhanced embeddings with semantic embeddings. Experiments show that LL-SDR outperforms continuous baselines and matches the performance of autoregressive token-based approaches, while enabling lightweight, low-latency speech enhancement in both reverberant and non-reverberant noisy environments. Demos and source code are available at our project websites.

2026-03-09

arXiv (prépublication)

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to sp… (voir plus)eech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.

2026-03-04

arXiv (prépublication)

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete to… (voir plus)kens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.

2026-01-29

Open MIND (prépublication)

From Speech to Sonography: Spectral Networks for Ultrasound Microstructure Classification

Ali K. Z. Tehrani

An Tang

Guy Cloutier

Iman Rafati

Bich Ngoc Nguyen

Quoc-Huy Trinh

Ivan Rosado-Mendez

Hassan Rivaz

The frequency dependence of backscattered radiofrequency (RF) signals produced by ultrasound scanners carries rich information related to th… (voir plus)e tissue microstructure (i.e., scatterer size, attenuation). This information can be sue to classify tissues based on microstructural changes associated to disease onset and progression. Conventional convolutional neural networks (CNNs) can learn this information directly from radio-frequency (RF) data, but they often struggle to achieve adequate frequency selectivity. This increases model complexity and convergence time, and limits generalization. To overcome these challenges, SincNet, originally developed for speech processing, was adapted to classify RF data based on differences in frequency properties. Rather than learning every filter coefficient, SincNet only learns each filter's low frequency and bandwidth, dramatically reducing the number of parameters and improving frequency resolution. For model interpretability, a Gradient-Weighted Filter Contribution is introduced, which highlights the importance of spectral bands. The approach was validated on three datasets: simulated data with different scatterer sizes, experimental phantom data, and in vivo data of rats which were fed a methionine and choline- deficient diet to develop liver steatosis, inflammation, and fibrosis. The modified SincNet consistently achieved the best results in material/tissue classifications.

2025-11-26

IEEE transactions on bio-medical engineering (publié)

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (voir plus)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

Paolo Torroni

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (voir plus)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

2025-04-05

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)

LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have… (voir plus) proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of L-MAC, Listenable Maps for Audio Classifiers, a method that produces faithful and listenable explanations. We incorporate SepFormer, a popular transformer-based time-domain source separation architecture. We show through a user study that LMAC-TD significantly improves the audio quality of the produced explanations while not sacrificing from faithfulness.

2025-04-05

ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang

Pooneh Mousavi

Artem Ploujnikov

In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are prese… (voir plus)nt in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

2025-04-05

ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (publié)