Portrait of Luca Della Libera

Luca Della Libera

PhD - Concordia University
Supervisor
Co-supervisor
Research Topics
Deep Learning

Publications

Open-Source Conversational AI with SpeechBrain 1.0
Adel Moumen
Sylvain de Langen
Yingzhi Wang
Zeyu Zhao
Shucong Zhang
Georgios Karakasidis
Pierre Champion
Aku Rouhe
Rudolf Braun … (see 13 more)
Florian Mai
Juan Zuluaga-Gomez
Seyed Mahed Mousavi
Andreas Nautsch
Ha Nguyen
Xuechen Liu
Sangeet Sagar
Jarod Duret
Salima Mdhaffar
Gaëlle Laperrière
Mickael Rouvier
Yannick Estève
SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.
DASB -- Discrete Audio and Speech Benchmark
Jarod Duret
Yusuf Cem Sübakan
Mirco Ravanaelli
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the… (see more) creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
Focal Modulation Networks for Interpretable Sound Classification
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
Resource-Efficient Separation Transformer
Samuele Cornell
Frédéric Lepoutre
François Grondin
Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding … (see more)and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Yusuf Cem Sübakan
Mirco Ravanaelli
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Jarod Duret
Yusuf Cem Sübakan
Mirco Ravanaelli