Portrait de Cem (Yusuf) Subakan

Cem (Yusuf) Subakan

Membre académique associé
Professeur adjoint, Université Laval, Département d'informatique et de génie logiciel
Professeur associé, Concordia University, École de génie et d'informatique Gina-Cody

Biographie

Cem Subakan est professeur adjoint à l'Université Laval, au sein du Département d'informatique et de génie logiciel. Il est également professeur adjoint affilié au Département d'informatique et de génie logiciel de l'Université Concordia, ainsi que membre académique associé à Mila – Institut québécois d'intelligence artificielle. Il a obtenu un doctorat en informatique de l'Université de l'Illinois à Urbana-Champaign (UIUC) et a effectué un postdoctorat à Mila. Il agit en tant que relecteur pour plusieurs conférences, notamment NeurIPS, ICML, ICLR, ICASSP et MLSP, ainsi que pour des revues telles que IEEE Signal Processing Letters (SPL) et IEEE Transactions on Audio, Speech, and Language Processing (TASL). Ses recherches portent principalement sur l'apprentissage automatique appliqué à la parole et à l'audio. Plus précisément, il travaille sur l'apprentissage profond pour la séparation de sources et l'amélioration de la parole dans des conditions réalistes, l'interprétabilité des réseaux neuronaux, l'apprentissage continu et l'apprentissage multimodal. Il a reçu le Prix du meilleur article étudiant lors de la conférence IEEE Machine Learning for Signal Processing (MLSP) en 2017, ainsi que la bourse Sabura Muroga du Département d'informatique de l'UIUC. Il est également un contributeur clé au projet SpeechBrain, où il dirige la partie consacrée à la séparation de la parole.

Étudiants actuels

Stagiaire de recherche - Université de Montréal
Co-superviseur⋅e :
Maîtrise recherche - Université Laval
Co-superviseur⋅e :
Doctorat - Concordia University
Superviseur⋅e principal⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :

Publications

Listenable Maps for Audio Classifiers
Focal Modulation Networks for Interpretable Sound Classification
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (voir plus)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Luca Della Libera
Pooneh Mousavi
Salah Zaiem
Unsupervised Improvement of Audio-Text Cross-Modal Representations
Zhepei Wang
Krishna Subramani
Junkai Wu
Tiago Tavares
Fabio Ayres
Paris Smaragdis
Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional trai… (voir plus)ning approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
Audio Editing with Non-Rigid Text Prompts
Francesco Paissan
Zhepei Wang
Paris Smaragdis
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (voir plus)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
Self-Supervised Learning for Infant Cry Analysis
Arsenii Gorin
Sajjad Abdoli
Junhao Wang
Samantha Latremouille
Charles Onu
In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical in… (voir plus)dications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and timeconsuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.
CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds
David Budaghyan
Arsenii Gorin
Charles Onu
This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is … (voir plus)a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.
Posthoc Interpretation via Quantization
In this paper, we introduce a new approach, called Posthoc Interpretation via Quantization (PIQ), for interpreting decisions made by trained… (voir plus) classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. Our model formulation also enables learning concepts by incorporating the supervision of pretrained annotation models such as state-of-the-art image segmentation models. We evaluated our method through quantitative and qualitative studies involving black-and-white images, color images, and audio. As a result of these studies we found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.
Exploring Self-Attention Mechanisms for Speech Separation
Samuele Cornell
François Grondin
Mirko Bronzi
Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (voir plus)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.
Real-M: Towards Speech Separation on Real Mixtures
Samuele Cornell
François Grondin
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (voir plus)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures, i.e. we observe that the performance predictions of the SI-SNR estimator correlate well with human opinions. Moreover, when evaluating popular speech separation models, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow the performance trends achieved on synthetic benchmarks.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (voir plus)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.