Cem (Yusuf) Subakan

eleonora.mancini@mila.quebec

Github

Google Scholar

Francesco Paissan

Independent visiting researcher

francesco.paissan@mila.quebec

Master's Research - Université Laval

Co-supervisor :

Pascal Germain

jacob.comeau@mila.quebec

Luca Della Libera

PhD - Concordia University

Principal supervisor :

luca.dellalibera@mila.quebec

Github

Shubham Gupta

PhD - Université Laval

Co-supervisor :

Laurent Charlin

shubham.gupta@mila.quebec

Publications

Listenable Maps for Audio Classifiers

Francesco Paissan

2024-05-01

ICML.cc/2024/Conference (oral)

openreview.net

Focal Modulation Networks for Interpretable Sound Classification

Luca Della Libera

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

2024-02-05

ArXiv (preprint)

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Luca Della Libera

Pooneh Mousavi

Salah Zaiem

2023-10-25

ArXiv (preprint)

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Zhepei Wang

Krishna Subramani

Junkai Wu

Tiago Tavares

Fabio Ayres

Paris Smaragdis

Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional trai… (see more)ning approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.

2023-10-22

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (published)

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan

Zhepei Wang

Paris Smaragdis

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (see more)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

2023-10-19

ArXiv (preprint)

Self-Supervised Learning for Infant Cry Analysis

Arsenii Gorin

Sajjad Abdoli

Junhao Wang

Samantha Latremouille

Charles Onu

In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical in… (see more)dications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and timeconsuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.

2023-06-04

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds

David Budaghyan

Arsenii Gorin

Charles Onu

This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is … (see more)a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.

2023-05-01

ArXiv (preprint)

Posthoc Interpretation via Quantization

Francesco Paissan

In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (see more) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.

2023-03-22

ArXiv (preprint)

Exploring Self-Attention Mechanisms for Speech Separation

Samuele Cornell

François Grondin

Mirko Bronzi

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (see more)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.

2023-01-01

IEEE/ACM Transactions on Audio, Speech, and Language Processing (published)

Real-M: Towards Speech Separation on Real Mixtures

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.

2022-05-23

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Paris Smaragdis

In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.

2022-05-15

ArXiv (preprint)

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Paris Smaragdis

2022-01-01

IEEE Signal Processing Letters (published)