Portrait of Mirco Ravanelli

Mirco Ravanelli

Associate Academic Member

Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science

Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research

Research Topics

Deep Learning

Biography

Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

Undergraduate - Concordia University

Dehestani Amirali

Research Intern - Concordia University University

Master's Research - Concordia University University

Master's Research - Concordia University

Matthieu Cervera

Principal supervisor :

Master's Research - Concordia University

Luca Della Libera

PhD - Concordia University

Co-supervisor :

Master's Research - Concordia University

Co-supervisor :

Gianfranco Dumoulin Bertucci

Master's Research - Concordia University

Nadine El-Mufti

Master's Research - Concordia University

Maab Elrashid Ahmed Mohamed

PhD - Concordia University

Co-supervisor :

Bonzi Francesco

PhD - Concordia University

Salman Sami Hussain Ali

Master's Research - Concordia University University

PhD - Université Laval

Principal supervisor :

Professional Master's - Concordia University Univesity

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Principal supervisor :

Collaborating researcher - University of Toulon

Principal supervisor :

Pablo Piantanida

PhD - Université de Montréal

Co-supervisor :

PhD - Concordia University

PhD - Concordia University

Co-supervisor :

Laurent Charlin

Francesco Paissan

PhD - Université Laval

Principal supervisor :

Postdoctorate - McGill University

Artem Ploujnikov

PhD - Université de Montréal

Master's Research - Concordia University

Benjamin Van Niekerk

Postdoctorate - Concordia University

Blog Posts

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

January 23, 2026

FocalCodec: Giving LLMs Ears and a Voice at Ultra-Low Bitrates

by

Luca Della Libera

Francesco Paissan

Mirco Ravanelli

Read the article

June 13, 2024

SpeechBrain 1.0: Making Conversational AI Accessible to Everyone

by

Mirco Ravanelli

Read the article

Introducing SpeechBrain: A general-purpose PyTorch speech processing toolkit

April 28, 2021

Introducing SpeechBrain: A General-Purpose PyTorch Speech Processing Toolkit

by

Mirco Ravanelli

Loren Lugosch

Read the article

Publications

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Jama Hussein Mohamud

Michael Craig

Michał Koziarski

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Guillaume Rabusseau

Reihaneh Rabbany

Christopher Morris

Ioannis Koutis

Mirco Ravanelli

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

Are LLMs Robust for Spoken Dialogues?

Seyed Mahed Mousavi

Gabriel Roccabruna

Massimo Rizzoli

Mirco Ravanelli

Giuseppe Riccardi

Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tra… (see more)cking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.

2024-01-04

ArXiv (preprint)

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Matthias Bethge

Beyza Ermis

Mirco Ravanelli

cCaugatay Yildiz

2024-01-01

EMNLP (published)

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Luca Della Libera

Mirco Ravanelli

2024-01-01

IEEE ACM Trans. Audio Speech Lang. Process. (published)

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jarod Duret

Luca Della Libera

Artem Ploujnikov

Mirco Ravanelli

2024-01-01

INTERSPEECH (published)

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Umberto Cappellazzo

Daniele Falavigna

Alessio Brutti

Mirco Ravanelli

The common modus operandi of fine-tuning large pre-trained Transformer models entails the adaptation of all their parameters (i.e., full fin… (see more)e-tuning). While achieving striking results on multiple tasks, this approach becomes unfeasible as the model size and the number of downstream tasks increase. In natural language processing and computer vision, parameter-efficient approaches like prompt-tuning and adapters have emerged as solid alternatives by fine-tuning only a small number of extra parameters, without sacrificing performance accuracy. For audio classification tasks, the Audio Spectrogram Transformer model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common parameter-efficient methods, revealing that adapters and LoRA consistently outperform the other methods across four benchmarks. Whereas adapters prove to be more efficient in few-shot learning settings, LoRA turns out to scale better as we increase the number of learnable parameters. We finally carry out ablation studies to find the best configuration for adapters and LoRA.

2024-01-01

MLSP (published)

TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding

Salima Mdhaffar

Fethi Bougares

Mirco Ravanelli

Yannick Estève

In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves e… (see more)xtracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.

2024-01-01

International Conference on Language Resources and Evaluation (published)

dblp.uni-trier.de

Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Sangeet Sagar

Mirco Ravanelli

Bernd Kiefer

Ivana Kruijff-KorbayovÃ¡

Josef van Genabith

Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional … (see more)speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.

2023-12-16

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (published)

Speech Emotion Diarization: Which Emotion Appears When?

Yingzhi Wang

Mirco Ravanelli

Alaa Nfissi

Alya Yacoubi

Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be consider… (see more)ed as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions and to unify various fine-grained methods under a single objective, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of “Who speaks when?”, Speech Emotion Diarization answers the question of “Which emotion appears when?”. To facilitate the evaluation of the performance and establish a common benchmark, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

2023-12-16

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (published)

TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch

Jeff Hwang

Moto Hira

Caroline Chen

Xiaohui Zhang

Zhaoheng Ni

Guangzhi Sun

Pingchuan Ma

Ruizhe Huang

Vineel Pratap

Yuekai Zhang

Anurag Kumar

Chin-Yun Yu

Chuang Zhu

Chunxi Liu

Jacob Kahn

Mirco Ravanelli

Peng Sun

Shinji Watanabe

Yangyang Shi

Yumeng Tao … (see 4 more)

Robin Scheibler

Samuele Cornell

Sean Kim

Stavros Petridis

TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of au… (see more)dio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio’s development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.

2023-12-16

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (published)

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

Umberto Cappellazzo

Daniele Falavigna

Alessio Brutti

Mirco Ravanelli

Parameter-efficient transfer learning (PETL) methods have emerged as a solid alternative to the standard full fine-tuning approach. They onl… (see more)y train a few extra parameters for each downstream task, without sacrificing performance and dispensing with the issue of storing a copy of the pre-trained model for each task. For audio classification tasks, the Audio Spectrogram Transformer (AST) model shows impressive results. However, surprisingly, how to efficiently adapt it to several downstream tasks has not been tackled before. In this paper, we bridge this gap and present a detailed investigation of common PETL methods for the adaptation of the AST model to audio/speech tasks. Furthermore, we propose a new adapter design that exploits the convolution module of the Conformer model, leading to superior performance over the standard PETL approaches and surpassing or achieving performance parity with full fine-tuning by updating only 0.29% of the parameters. Finally, we provide ablation studies revealing that our proposed adapter: 1) proves to be effective in few-shot efficient transfer learning, 2) attains optimal results regardless of the amount of the allocated parameters, and 3) can be applied to other pre-trained models. Our code is available at https:/github.com/umbertocappellazzo/PETL_AST.

2023-12-06

ArXiv (preprint)

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Youcef Kemiche

Titouan Parcollet

Slim Essid

Mirco Ravanelli

2023-08-28

ArXiv (preprint)