Portrait of Mirco Ravanelli

Mirco Ravanelli

Associate Academic Member
Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research


Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

PhD - Université de Montréal
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Research Intern - Université de Montréal
Principal supervisor :
Master's Research - Concordia University
Master's Research - Concordia University
PhD - Concordia University
Co-supervisor :
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Master's Research - Concordia University
Research Intern - McGill University
Principal supervisor :
Undergraduate - Concordia University


Listenable Maps for Audio Classifiers
SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning
Luca Zampierin
Ghouthi Boukli Hacene
Bac Nguyen
Focal Modulation Networks for Interpretable Sound Classification
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent
Luca Della Libera
Jacopo Andreoli
Davide Dalle Pezze
Gian Antonio Susto
A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has … (see more)improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Shenyang Huang
Joao Alex Cunha
Zhiyi Li
Gabriela Moisescu-Pareja
Oleksandr Dymov
Samuel Maddrell-Mander
Callum McLean
Frederik Wenkel
Luis Müller
Jama Hussein Mohamud
Ali Parviz
Michael Craig
Michał Koziarski
Jiarui Lu
Zhaocheng Zhu
Cristian Gabellini
Kerstin Klaser
Josef Dean
Cas Wognum … (see 15 more)
Maciej Sypetkowski
Christopher Morris
Ioannis Koutis
Prudencio Tossou
Hadrien Mary
Therence Bois
Andrew William Fitzgibbon
Blazej Banaszewski
Chad Martin
Dominic Masters
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.
Are LLMs Robust for Spoken Dialogues?
Seyed Mahed Mousavi
Gabriel Roccabruna
Simone Alghisi
Massimo Rizzoli
Giuseppe Riccardi
Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tra… (see more)cking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.
Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue Domain
Sangeet Sagar
Bernd Kiefer
Ivana Kruijff-Korbayová
Josef van Genabith
Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional … (see more)speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.
Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang
Alaa Nfissi
Alya Yacoubi
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be consider… (see more)ed as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions and to unify various fine-grained methods under a single objective, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of “Who speaks when?”, Speech Emotion Diarization answers the question of “Which emotion appears when?”. To facilitate the evaluation of the performance and establish a common benchmark, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch
Jeff Hwang
Moto Hira
Caroline Chen
Xiaohui Zhang
Zhaoheng Ni
Guangzhi Sun
Pingchuan Ma
Ruizhe Huang
Vineel Pratap
Yuekai Zhang
Anurag Kumar
Chin-Yun Yu
Chuang Zhu
Chunxi Liu
Jacob Kahn
Peng Sun
Shinji Watanabe
Yangyang Shi
Yumeng Tao … (see 4 more)
Robin Scheibler
Samuele Cornell
Sean Kim
Stavros Petridis
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of au… (see more)dio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio’s development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers
Umberto Cappellazzo
Daniele Falavigna
Alessio Brutti
The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fin… (see more)e-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters and LoRA consistently outperform the other methods across four benchmarks. Whereas adapters prove to be more efficient in few-shot learning settings, LoRA turns out to scale better as we increase the number of learnable parameters. We finally carry out ablation studies to find the best configuration for adapters and LoRA.
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Luca Della Libera
Pooneh Mousavi
Salah Zaiem
Audio Editing with Non-Rigid Text Prompts
Francesco Paissan
Zhepei Wang
Paris Smaragdis
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (see more)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.