Mirco Ravanelli

Dehestani Amirali

Collaborating researcher - Concordia University University

Seina Assadian

Collaborating researcher - Concordia University University

Cordelle Briac

Collaborating researcher - Concordia University University

Gallegati Caterina

Research Intern - Concordia University

Victor Cruz

Master's Research - Concordia University

Luca Della Libera

PhD - Concordia University

Co-supervisor :

Wagner Drew

Master's Research - Concordia University

Co-supervisor :

Irina Rish

Gianfranco Dumoulin Bertucci

Master's Research - Concordia University

Website

nadine.el-mufti@mila.quebec

Nadine El-Mufti

Master's Research - Concordia University

Website

Maab Elrashid Ahmed Mohamed

Google Scholar

PhD - Concordia University

Co-supervisor :

Bonzi Francesco

PhD - Concordia University

Alessio Giuseppe Alessio

Collaborating researcher - International School for Advanced Studies (Trieste, Italy)

Salman Sami Hussain Ali

Collaborating researcher - Concordia University University

SpeechBrain 1.0: Making Conversational AI Accessible to Everyone

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Concordia University

PhD - Concordia University

Co-supervisor :

Peter Peter

Postdoctorate - McGill University

PhD - Université de Montréal

Blog Posts

June 13, 2024

Mirco Ravanelli

Read the article

April 28, 2021

Introducing SpeechBrain: A General-Purpose PyTorch Speech Processing Toolkit

Mirco Ravanelli

Loren Lugosch

Read the article

Publications

Open-Source Conversational AI with SpeechBrain 1.0

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter William VanHarn Plantinga

Yingzhi Wang

Pooneh Mousavi

Luca Della Libera

Artem Ploujnikov

Francesco Paissan

Davide Borra

Salah Zaiem

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 11 more)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-29

ArXiv (preprint)

DASB -- Discrete Audio and Speech Benchmark

Pooneh Mousavi

Luca Della Libera

Jarod Duret

Artem Ploujnikov

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the… (see more) creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

2024-06-20

ArXiv (preprint)

openreview.net

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi

Jarod Duret

Salah Zaiem

Luca Della Libera

Artem Ploujnikov

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audi… (see more)o tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

2024-06-15

ArXiv (preprint)

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

Shubham Gupta

Pascal Germain

2024-06-14

ArXiv (preprint)

Focal Modulation Networks for Interpretable Sound Classification

Luca Della Libera

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

2024-04-14

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

Resource-Efficient Separation Transformer

Luca Della Libera

Samuele Cornell

Frédéric Lepoutre

François Grondin

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding … (see more)and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

2024-04-14

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

SKILL: Similarity-aware Knowledge distILLation for Speech Self-Supervised Learning

Luca Zampierin

Ghouthi Boukli Hacene

Bac Nguyen

2024-04-14

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Luca Della Libera

Jacopo Andreoli

Davide Dalle Pezze

Gian Antonio Susto

A crucial task in predictive maintenance is estimating the remaining useful life of physical systems. In the last decade, deep learning has … (see more)improved considerably upon traditional model-based and statistical approaches in terms of predictive performance. However, in order to optimally plan maintenance operations, it is also important to quantify the uncertainty inherent to the predictions. This issue can be addressed by turning standard frequentist neural networks into Bayesian neural networks, which are naturally capable of providing confidence intervals around the estimates. Several methods exist for training those models. Researchers have focused mostly on parametric variational inference and sampling-based techniques, which notoriously suffer from limited approximation power and large computational burden, respectively. In this work, we use Stein variational gradient descent, a recently proposed algorithm for approximating intractable distributions that overcomes the drawbacks of the aforementioned techniques. In particular, we show through experimental studies on simulated run-to-failure turbofan engine degradation data that Bayesian deep learning models trained via Stein variational gradient descent consistently outperform with respect to convergence speed and predictive performance both the same models trained via parametric variational inference and their frequentist counterparts trained via backpropagation. Furthermore, we propose a method to enhance performance based on the uncertainty information provided by the Bayesian models. We release the source code at https://github.com/lucadellalib/bdl-rul-svgd.

2024-02-02

ArXiv (preprint)

Bayesian Deep Learning for Remaining Useful Life Estimation via Stein Variational Gradient Descent

Luca Della Libera

Jacopo Andreoli

Davide Dalle Pezze

Gian Antonio Susto

2024-02-02

ArXiv (preprint)

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Luis Müller

Jama Hussein Mohamud

Ali Parviz

Michael Craig

Michał Koziarski

Jiarui Lu

Zhaocheng Zhu

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Guillaume Rabusseau

Reihaneh Rabbany

Jian Tang

Christopher Morris

Ioannis Koutis

Guy Wolf

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Are LLMs Robust for Spoken Dialogues?

Seyed Mahed Mousavi

Gabriel Roccabruna

Simone Alghisi

Massimo Rizzoli

Giuseppe Riccardi

Large Pre-Trained Language Models have demonstrated state-of-the-art performance in different downstream tasks, including dialogue state tra… (see more)cking and end-to-end response generation. Nevertheless, most of the publicly available datasets and benchmarks on task-oriented dialogues focus on written conversations. Consequently, the robustness of the developed models to spoken interactions is unknown. In this work, we have evaluated the performance of LLMs for spoken task-oriented dialogues on the DSTC11 test sets. Due to the lack of proper spoken dialogue datasets, we have automatically transcribed a development set of spoken dialogues with a state-of-the-art ASR engine. We have characterized the ASR-error types and their distributions and simulated these errors in a large dataset of dialogues. We report the intrinsic (perplexity) and extrinsic (human evaluation) performance of fine-tuned GPT-2 and T5 models in two subtasks of response generation and dialogue state tracking, respectively. The results show that LLMs are not robust to spoken noise by default, however, fine-tuning/training such models on a proper dataset of spoken TODs can result in a more robust performance.

2024-01-04

ArXiv (preprint)

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

Firat Oncel

Matthias Bethge

Beyza Ermis

cCaugatay Yildiz

2024-01-01

EMNLP (published)