Mirco Ravanelli

Biography

Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

Allan Allan

Undergraduate - Concordia University

Cordelle Briac

Master's Research - Concordia University University

Leo Brodeur

Master's Research - Concordia University

Matthieu Cervera

Principal supervisor :

Victor Cruz

Master's Research - Concordia University

PhD - Concordia University

Co-supervisor :

Wagner Drew

Master's Research - Concordia University

Co-supervisor :

Irina Rish

Gianfranco Dumoulin Bertucci

Master's Research - Concordia University

Nadine El-Mufti

Master's Research - Concordia University

Maab Elrashid Ahmed Mohamed

PhD - Concordia University

Co-supervisor :

Bonzi Francesco

PhD - Concordia University

Salman Sami Hussain Ali

Master's Research - Concordia University University

Jihoon Jeong

PhD - Université Laval

Principal supervisor :

Haoyu Li

Professional Master's - Concordia University Univesity

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Principal supervisor :

Collaborating researcher - University of Toulon

Principal supervisor :

PhD - Université de Montréal

PhD - Concordia University

PhD - Concordia University

Co-supervisor :

PhD - Université Laval

Principal supervisor :

Peter Peter

Postdoctorate - McGill University

PhD - Université de Montréal

Arpnik Singh

Master's Research - Concordia University

Benjamin Van Niekerk

Postdoctorate - Concordia University

FocalCodec: Giving LLMs Ears and a Voice at Ultra-Low Bitrates

Blog Posts

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

January 23, 2026

Luca Della Libera

Francesco Paissan

Cem Subakan

Mirco Ravanelli

Read the article

June 13, 2024

SpeechBrain 1.0: Making Conversational AI Accessible to Everyone

Mirco Ravanelli

Read the article

April 28, 2021

Introducing SpeechBrain: A General-Purpose PyTorch Speech Processing Toolkit

Mirco Ravanelli

Loren Lugosch

Read the article

Publications

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete to… (see more)kens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.

2026-01-29

ArXiv (preprint)

Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Jacopo Andreoli

Davide Dalle Pezze

Gian Antonio Susto

A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has … (see more)improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

2025-12-31

IEEE Trans Autom. Sci. Eng. (published)

Toward Faithful Explanations in Acoustic Anomaly Detection

Maab Elrashid

Rémi Georges

Interpretability is essential for user trust in real-world anomaly detection applications. However, deep learning models, despite their stro… (see more)ng performance, often lack transparency. In this work, we study the interpretability of autoencoder-based models for audio anomaly detection, by comparing a standard autoencoder (AE) with a mask autoencoder (MAE) in terms of detection performance and interpretability. We applied several attribution methods, including error maps, saliency maps, SmoothGrad, Integrated Gradients, GradSHAP, and Grad-CAM. Although MAE shows a slightly lower detection, it consistently provides more faithful and temporally precise explanations, suggesting a better alignment with true anomalies. To assess the relevance of the regions highlighted by the explanation method, we propose a perturbation-based faithfulness metric that replaces them with their reconstructions to simulate normal input. Our findings, based on experiments in a real industrial scenario, highlight the importance of incorporating interpretability into anomaly detection pipelines and show that masked training improves explanation quality without compromising performance.

2025-12-31

arXiv (published)

From Speech to Sonography: Spectral Networks for Ultrasound Microstructure Classification

Ali K. Z. Tehrani

An Tang

Guy Cloutier

Iman Rafati

Bich Ngoc Nguyen

Quoc-Huy Trinh

Ivan Rosado-Mendez

Hassan Rivaz

The frequency dependence of backscattered radiofrequency (RF) signals produced by ultrasound scanners carries rich information related to th… (see more)e tissue microstructure (i.e., scatterer size, attenuation). This information can be sue to classify tissues based on microstructural changes associated to disease onset and progression. Conventional convolutional neural networks (CNNs) can learn this information directly from radio-frequency (RF) data, but they often struggle to achieve adequate frequency selectivity. This increases model complexity and convergence time, and limits generalization. To overcome these challenges, SincNet, originally developed for speech processing, was adapted to classify RF data based on differences in frequency properties. Rather than learning every filter coefficient, SincNet only learns each filter's low frequency and bandwidth, dramatically reducing the number of parameters and improving frequency resolution. For model interpretability, a Gradient-Weighted Filter Contribution is introduced, which highlights the importance of spectral bands. The approach was validated on three datasets: simulated data with different scatterer sizes, experimental phantom data, and in vivo data of rats which were fed a methionine and choline- deficient diet to develop liver steatosis, inflammation, and fibrosis. The modified SincNet consistently achieved the best results in material/tissue classifications.

2025-11-26

IEEE transactions on bio-medical engineering (published)

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

2025-10-08

ArXiv (preprint)

Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson's Disease

Peter William VanHarn Plantinga

Roozbeh Sattari

Karine Marcotte

Carla Di Gironimo

Madeleine Sharp

Liziane Bouvier

Maiya Geddes

Ingrid Verduyckt

'Etienne de Villers-Sidani

Denise Klein

The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease.… (see more) We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper's ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.

2025-10-08

ArXiv (preprint)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Pooneh Mousavi

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (see more)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

2025-09-26

ArXiv (preprint)

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Pooneh Mousavi

2025-09-26

ArXiv (preprint)

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (see more)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.

2025-09-21

ArXiv (preprint)

Virtual Consistency for Audio Editing

2025-09-21

ArXiv (preprint)

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reco… (see more)nstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

2025-09-01

arXiv (published)