Portrait of Mirco Ravanelli

Mirco Ravanelli

Associate Academic Member
Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Deep Learning

Biography

Mirco Ravanelli is an assistant professor at Concordia University, adjunct professor at Université de Montréal and associate member of Mila – Quebec Artificial Intelligence Institute.

Ravanelli is an expert in deep learning and conversational AI, publishing over sixty papers in these fields. His contributions were honoured with a 2022 Amazon Research Award.

His research focuses primarily on novel deep learning algorithms, including self-supervised, continual, multimodal, cooperative and energy-efficient learning.

Formerly a postdoctoral fellow at Mila under Yoshua Bengio, he founded and now leads SpeechBrain, one of the most extensively used open-source toolkits in the field of speech processing and conversational AI.

Current Students

Master's Research - Concordia University
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
PhD - Concordia University
Co-supervisor :
Undergraduate - Concordia University
Master's Research - Concordia University
PhD - Concordia University
Co-supervisor :
PhD - Concordia University
Collaborating researcher - Concordia University University
Collaborating researcher - Concordia University University
Collaborating Alumni - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Concordia University
PhD - Concordia University
Co-supervisor :
Postdoctorate - McGill University
PhD - Université de Montréal
Collaborating researcher - Concordia University University
Master's Research - Concordia University

Publications

TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding
Salima Mdhaffar
Fethi Bougares
Renato de Mori
Salah Zaiem
Yannick Estève
In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves e… (see more)xtracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.
Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue Domain
Sangeet Sagar
Bernd Kiefer
Ivana Kruijff-Korbayová
Josef van Genabith
Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional … (see more)speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.
Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang
Alaa Nfissi
Alya Yacoubi
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be consider… (see more)ed as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions and to unify various fine-grained methods under a single objective, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of “Who speaks when?”, Speech Emotion Diarization answers the question of “Which emotion appears when?”. To facilitate the evaluation of the performance and establish a common benchmark, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch
Jeff Hwang
Moto Hira
Caroline Chen
Xiaohui Zhang
Zhaoheng Ni
Guangzhi Sun
Pingchuan Ma
Ruizhe Huang
Vineel Pratap
Yuekai Zhang
Anurag Kumar
Chin-Yun Yu
Chuang Zhu
Chunxi Liu
Jacob Kahn
Peng Sun
Shinji Watanabe
Yangyang Shi
Yumeng Tao … (see 4 more)
Robin Scheibler
Samuele Cornell
Sean Kim
Stavros Petridis
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of au… (see more)dio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio’s development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Luca Della Libera
Pooneh Mousavi
Salah Zaiem
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
Slim Essid
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
Slim Essid
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (see more) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.
Generalization Limits of Graph Neural Networks in Identity Effects Learning
Giuseppe Alessio D'inverno
Simone Brugiapaglia
Graph Neural Networks (GNNs) have emerged as a powerful tool for data-driven learning on various graph domains. They are usually based on a … (see more)message-passing mechanism and have gained increasing popularity for their intuitive formulation, which is closely linked to the Weisfeiler-Lehman (WL) test for graph isomorphism to which they have been proven equivalent in terms of expressive power. In this work, we establish new generalization properties and fundamental limits of GNNs in the context of learning so-called identity effects, i.e., the task of determining whether an object is composed of two identical components or not. Our study is motivated by the need to understand the capabilities of GNNs when performing simple cognitive tasks, with potential applications in computational linguistics and chemistry. We analyze two case studies: (i) two-letters words, for which we show that GNNs trained via stochastic gradient descent are unable to generalize to unseen letters when utilizing orthogonal encodings like one-hot representations; (ii) dicyclic graphs, i.e., graphs composed of two cycles, for which we present positive existence results leveraging the connection between GNNs and the WL test. Our theoretical analysis is supported by an extensive numerical study.
Simulated Annealing in Early Layers Leads to Better Generalization
Amir M. Sarfi
Zahra Karimpour
Muawiz Chaudhary
Nasir M. Khalid
Sudhir Mudur
Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer… (see more) periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. 11The code to reproduce our results is publicly available at: https://github.com/amiiir-sarfi/SEAL
Fine-Tuning Strategies for Faster Inference Using Speech Self-Supervised Models: A Comparative Study
Salah Zaiem
Robin Algayres
Titouan Parcollet
Slim Essid
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. … (see more)In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0. 81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Salah Zaiem
Youcef Kemiche
Titouan Parcollet
Slim Essid
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on… (see more) speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.
Posthoc Interpretation via Quantization
Francesco Paissan
In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (see more) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.