Portrait of Cem Subakan

Cem Subakan

Associate Academic Member
Assistant Professor, Université Laval, Department of Computer Science and Software Engineering
Affiliate Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science
Research Topics
Multimodal Learning

Biography

Cem Subakan is an assistant professor in the Computer Science and Software Engineering Department at Université Laval, and an affiliate assistant professor in the Computer Science and Software Engineering Department at Concordia University. He is also an associate academic member of Mila – Quebec Artificial Intelligence Institute. After receiving his PhD in computer science from the University of Illinois at Urbana-Champaign (UIUC), Subakan did a postdoc at Mila. He serves as a reviewer for many conferences including NeurIPS, ICML, ICLR, ICASSP and MLSP, as well as for journals, such as IEEE Signal Processing Letters and IEEE Transactions on Audio, Speech, and Language Processing. His principal research interest is machine learning for speech and audio. More specifically, he works on deep learning for source separation and speech enhancement under realistic conditions, neural network interpretability, continual learning and multi-modal learning.

Subakan was awarded the Best Student Paper Award at the 2017 IEEE Machine Learning for Signal Processing Conference, and also obtained a Sabura Muroga Fellowship from UIUC’s Department of Computer Science. He is a core contributor to the SpeechBrain project, leading the speech separation component.

Current Students

Master's Research - Université Laval
PhD - Concordia University
Principal supervisor :
PhD - Concordia University
Principal supervisor :
PhD - Université Laval
Co-supervisor :
Research Intern - Université Laval
Co-supervisor :
PhD - Université Laval
Co-supervisor :
Collaborating Alumni - Saarland University
Collaborating Alumni - Université de Montréal
Co-supervisor :
PhD - Université Laval
Co-supervisor :

Publications

Hierarchical Retrieval at Scale: Bridging Transparency and Efficiency
Tianyi Chen
Valentina Zantedeschi
Information retrieval is a core component of many intelligent systems as it enables conditioning of outputs on new and large-scale datasets.… (see more) While effective, the standard practice of encoding data into high-dimensional representations for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. Hierarchical retrieval methods offer an interpretable alternative by organizing data at multiple granular levels, yet do not match the efficiency and performance of flat retrieval approaches. In this paper, we propose ReTreever, a tree-based method that makes hierarchical retrieval viable at scale by directly optimizing its structure for retrieval performance while naturally providing transparency through meaningful semantic groupings. Our method offers the flexibility to balance cost and utility by indexing data using representations from any tree level. We show that ReTreever delivers strong coarse (intermediate levels) and fine representations (terminal level), while achieving the highest retrieval accuracy at the lowest latency among hierarchical methods. These results demonstrate that this family of techniques is viable in practical applications.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (see more)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
LMAC-TD: Producing Time Domain Explanations for Audio Classifiers
Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have… (see more) proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of L-MAC, Listenable Maps for Audio Classifiers, a method that produces faithful and listenable explanations. We incorporate SepFormer, a popular transformer-based time-domain source separation architecture. We show through a user study that LMAC-TD significantly improves the audio quality of the produced explanations while not sacrificing from faithfulness.
Audio Prototypical Network For Controllable Music Recommendation
Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While thes… (see more)e models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system's modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.
Retreever: Tree-Based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Valentina Zantedeschi
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (see more)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
Listenable Maps for Zero-Shot Audio Classifiers
Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthines… (see more)s of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
Listenable Maps for Audio Classifiers
Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This … (see more)challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.
Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice
In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of pho… (see more)neme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating the saliency maps to the phoneme representations, this methodology generates explanations that tend to be more understandable than standard saliency maps on magnitude spectrograms.
CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds
David Budaghyan
Arsenii Gorin
Charles C. Onu
This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is … (see more)a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.
Focal Modulation Networks for Interpretable Sound Classification
The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.
Resource-Efficient Separation Transformer
Samuele Cornell
Frédéric Lepoutre
François Grondin
Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding … (see more)and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.