Portrait of Cem Subakan

Cem Subakan

Associate Academic Member
Assistant Professor, Université Laval, Department of Computer Science and Software Engineering
Affiliate Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science
Research Topics
Multimodal Learning

Biography

Cem Subakan is an assistant professor in the Computer Science and Software Engineering Department at Université Laval, and an affiliate assistant professor in the Computer Science and Software Engineering Department at Concordia University. He is also an associate academic member of Mila – Quebec Artificial Intelligence Institute. After receiving his PhD in computer science from the University of Illinois at Urbana-Champaign (UIUC), Subakan did a postdoc at Mila. He serves as a reviewer for many conferences including NeurIPS, ICML, ICLR, ICASSP and MLSP, as well as for journals, such as IEEE Signal Processing Letters and IEEE Transactions on Audio, Speech, and Language Processing. His principal research interest is machine learning for speech and audio. More specifically, he works on deep learning for source separation and speech enhancement under realistic conditions, neural network interpretability, continual learning and multi-modal learning.

Subakan was awarded the Best Student Paper Award at the 2017 IEEE Machine Learning for Signal Processing Conference, and also obtained a Sabura Muroga Fellowship from UIUC’s Department of Computer Science. He is a core contributor to the SpeechBrain project, leading the speech separation component.

Current Students

Master's Research - Université Laval
PhD - Concordia University
Principal supervisor :
Postdoctorate - Université Laval
PhD - Concordia University
Principal supervisor :
PhD - Université Laval
Co-supervisor :
Collaborating Alumni - Université de Montréal
Co-supervisor :
Master's Research - Université Laval

Publications

Self-Supervised Learning for Infant Cry Analysis
Arsenii Gorin
Sajjad Abdoli
Junhao Wang
Samantha Latremouille
Charles Onu
In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical in… (see more)dications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and timeconsuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.
Posthoc Interpretation via Quantization
Francesco Paissan
In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (see more) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.
Exploring Self-Attention Mechanisms for Speech Separation
Samuele Cornell
François Grondin
Mirko Bronzi
Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (see more)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
SpeechBrain: A General-Purpose Speech Toolkit
Titouan Parcollet
Peter William VanHarn Plantinga
Aku Rouhe
Samuele Cornell
Loren Lugosch
Nauman Dawalatabad
Abdelwahab HEBA
Jianyuan Zhong
Ju-Chieh Chou
Sung-Lin Yeh
Szu-Wei Fu
Chien-Feng Liao
E. Rastorgueva
Franccois Grondin
William Aris
Hwidong Na
Yan Gao
Renato De Mori … (see 1 more)
SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to facilitate the research and development of neural speech proc… (see more)essing technologies by being simple, flexible, user-friendly, and well-documented. This paper describes the core architecture designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. SpeechBrain achieves competitive or state-of-the-art performance in a wide range of speech benchmarks. It also provides training recipes, pretrained models, and inference scripts for popular speech datasets, as well as tutorials which allow anyone with basic Python proficiency to familiarize themselves with speech technologies.
On the Effectiveness of Two-Step Learning for Latent-Variable Models
Latent-variable generative models offer a principled solution for modeling and sampling from complex probability distributions. Implementing… (see more) a joint training objective with a complex prior, however, can be a tedious task, as one is typically required to derive and code a specific cost function for each new type of prior distribution. In this work, we propose a general framework for learning latent variable generative models in a two-step fashion. In the first step of the framework, we train an autoencoder, and in the second step we fit a prior model on the resulting latent distribution. This two-step approach offers a convenient alternative to joint training, as it allows for a straightforward combination of existing models without the hustle of deriving new cost functions, and the need for coding the joint training objectives. Through a set of experiments, we demonstrate that two-step learning results in performances similar to joint training, and in some cases even results in more accurate modeling.
Continual Learning of New Sound Classes Using Generative Replay
Zhepei Wang
Efthymios Tzinis
Paris Smaragdis
Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this pa… (see more)per, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively updating a model with new data(sets) results in a degradation of already learned tasks, which is referred to as "catastrophic forgetting." We develop a generative replay procedure for generating training audio spectrogram data, in place of keeping older training datasets. We show that by incrementally refining a classifier with generative replay a generator that is 4% of the size of all previous training data matches the performance of refining the classifier keeping 20% of all previous training data. We thus conclude that we can extend a trained sound classifier to learn new classes without having to keep previously used datasets.