Portrait of Mirco Ravanelli

Mirco Ravanelli

Associate Academic Member
Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Deep Learning

Biography

Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

Master's Research - Concordia University
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Master's Research - Concordia University
PhD - Concordia University
Co-supervisor :
Master's Research - Concordia University
Co-supervisor :
Master's Research - Concordia University
Master's Research - Concordia University
PhD - Concordia University
Co-supervisor :
PhD - Concordia University
Collaborating researcher - International School for Advanced Studies (Trieste, Italy)
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Collaborating Alumni - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Concordia University
PhD - Concordia University
Co-supervisor :
Postdoctorate - McGill University
PhD - Université de Montréal
Collaborating researcher - Concordia University University

Publications

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Luca Della Libera
Francesco Paissan
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Luca Della Libera
Francesco Paissan
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.
Generalization Limits of Graph Neural Networks in Identity Effects Learning
Giuseppe Alessio D’Inverno
Simone Brugiapaglia
Graph Neural Networks (GNNs) have emerged as a powerful tool for data-driven learning on various graph domains. They are usually based on a … (see more)message-passing mechanism and have gained increasing popularity for their intuitive formulation, which is closely linked to the Weisfeiler-Lehman (WL) test for graph isomorphism to which they have been proven equivalent in terms of expressive power. In this work, we establish new generalization properties and fundamental limits of GNNs in the context of learning so-called identity effects, i.e., the task of determining whether an object is composed of two identical components or not. Our study is motivated by the need to understand the capabilities of GNNs when performing simple cognitive tasks, with potential applications in computational linguistics and chemistry. We analyze two case studies: (i) two-letters words, for which we show that GNNs trained via stochastic gradient descent are unable to generalize to unseen letters when utilizing orthogonal encodings like one-hot representations; (ii) dicyclic graphs, i.e., graphs composed of two cycles, for which we present positive existence results leveraging the connection between GNNs and the WL test. Our theoretical analysis is supported by an extensive numerical study.
ProGRes: Prompted Generative Rescoring on ASR n-Best
Ada Defne Tur
Adel Moumen
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
Eleonora Mancini
Francesco Paissan
Paolo Torroni
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (see more)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
Eleonora Mancini
Francesco Paissan
Paolo Torroni
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (see more)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
A protocol for trustworthy EEG decoding with neural networks
Davide Borra
Elisa Magosso
SpeechBrain-MOABB: An open-source Python library for benchmarking deep neural networks applied to EEG signals
Davide Borra
Francesco Paissan
Listenable Maps for Zero-Shot Audio Classifiers
Francesco Paissan
Luca Della Libera
Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthines… (see more)s of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang
Pooneh Mousavi
Artem Ploujnikov
What Are They Doing? Joint Audio-Speech Co-Reasoning
Yingzhi Wang
Pooneh Mousavi
Artem Ploujnikov
In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are prese… (see more)nt in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called"What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.
Dynamic HumTrans: Humming Transcription Using CNNs and Dynamic Programming
Shubham Gupta
Isaac Neri Gomez-Sarmiento
Faez Amjed Mezdari