Cem Subakan

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Zhepei Wang

Krishna Subramani

Junkai Wu

Tiago Tavares

Fabio Ayres

Paris Smaragdis

Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional trai… (see more)ning approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.

2023-10-22

2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (published)

doi.org

arxiv.org

Self-Supervised Learning for Infant Cry Analysis

Arsenii Gorin

Cem Subakan

Sajjad Abdoli

Junhao Wang

Samantha Latremouille

Charles Onu

In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical in… (see more)dications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and timeconsuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.

2023-06-04

2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

doi.org

arxiv.org

Posthoc Interpretation via Quantization

Cem Subakan

Francesco Paissan

Mirco Ravanelli

In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (see more) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.

2023-03-22

ArXiv (preprint)

doi.org

arxiv.org

Exploring Self-Attention Mechanisms for Speech Separation

Cem Subakan

Mirco Ravanelli

Samuele Cornell

François Grondin

Mirko Bronzi

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (see more)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.

2023-01-01

IEEE/ACM Transactions on Audio, Speech, and Language Processing (published)

doi.org

arxiv.org

Real-M: Towards Speech Separation on Real Mixtures

Cem Subakan

Mirco Ravanelli

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.

2022-05-23

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

doi.org

arxiv.org

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Cem Subakan

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Mirco Ravanelli

Paris Smaragdis

In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.

2022-05-15

ArXiv (preprint)

doi.org

arxiv.org

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

Zhepei Wang

Cem Subakan

Xilin Jiang

Junkai Wu

Efthymios Tzinis

Mirco Ravanelli

Paris Smaragdis

In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.

2022-01-01

IEEE Signal Processing Letters (published)

doi.org

arxiv.org

Real-M: Towards Speech Separation on Real Mixtures

Cem Subakan

Mirco Ravanelli

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.

2021-10-20

ArXiv (preprint)

doi.org

arxiv.org

SpeechBrain: A General-Purpose Speech Toolkit

Mirco Ravanelli

Titouan Parcollet

Peter William VanHarn Plantinga

Aku Rouhe

Samuele Cornell

Loren Lugosch

Cem Subakan

Nauman Dawalatabad

Abdelwahab HEBA

Jianyuan Zhong

Ju-Chieh Chou

Sung-Lin Yeh

Szu-Wei Fu

Chien-Feng Liao

Elena Rastorgueva

Franccois Grondin

William Aris

Hwidong Na

Yan Gao

Renato De Mori … (see 1 more)

Yoshua Bengio

SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech proc… (see more)essing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.

2021-06-08

ArXiv (preprint)

arxiv.org

On the Effectiveness of Two-Step Learning for Latent-Variable Models

Cem Subakan

Maxime Gasse

Laurent Charlin

Latent-variable generative models offer a principled solution for modeling and sampling from complex probability distributions. Implementing… (see more) a joint training objective with a complex prior, however, can be a tedious task, as one is typically required to derive and code a specific cost function for each new type of prior distribution. In this work, we propose a general framework for learning latent variable generative models in a two-step fashion. In the first step of the framework, we train an autoencoder, and in the second step we fit a prior model on the resulting latent distribution. This two-step approach offers a convenient alternative to joint training, as it allows for a straightforward combination of existing models without the hustle of deriving new cost functions, and the need for coding the joint training objectives. Through a set of experiments, we demonstrate that two-step learning results in performances similar to joint training, and in some cases even results in more accurate modeling.

2020-01-01

2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP) (published)

doi.org

Continual Learning of New Sound Classes Using Generative Replay

Zhepei Wang

Cem Subakan

Efthymios Tzinis

Paris Smaragdis

Laurent Charlin

Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this pa… (see more)per, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively updating a model with new data(sets) results in a degradation of already learned tasks, which is referred to as "catastrophic forgetting." We develop a generative replay procedure for generating training audio spectrogram data, in place of keeping older training datasets. We show that by incrementally refining a classifier with generative replay a generator that is 4% of the size of all previous training data matches the performance of refining the classifier keeping 20% of all previous training data. We thus conclude that we can extend a trained sound classifier to learn new classes without having to keep previously used datasets.

2019-10-20

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (published)

doi.org

arxiv.org

Mila AI Policy Fellowship

The Development of the UN Scientific Panel on AI

Mila AI Policy Fellowship

The Development of the UN Scientific Panel on AI

Cem Subakan

Biography

Current Students

Publications

Mila AI Policy Fellowship

The Development of the UN Scientific Panel on AI

Mila AI Policy Fellowship

The Development of the UN Scientific Panel on AI

Popular keywords:

Cem Subakan

Biography

Current Students

Publications