Portrait of Cem Subakan

Cem Subakan

Associate Academic Member

Assistant Professor, Université Laval, Department of Computer Science and Software Engineering

Affiliate Assistant Professor, Concordia University, Gina Cody School of Engineering and Computer Science

Research Topics

Multimodal Learning

Biography

Cem Subakan is an assistant professor in the Computer Science and Software Engineering Department at Université Laval, and an affiliate assistant professor in the Computer Science and Software Engineering Department at Concordia University. He is also an associate academic member of Mila – Quebec Artificial Intelligence Institute. After receiving his PhD in computer science from the University of Illinois at Urbana-Champaign (UIUC), Subakan did a postdoc at Mila. He serves as a reviewer for many conferences including NeurIPS, ICML, ICLR, ICASSP and MLSP, as well as for journals, such as IEEE Signal Processing Letters and IEEE Transactions on Audio, Speech, and Language Processing. His principal research interest is machine learning for speech and audio. More specifically, he works on deep learning for source separation and speech enhancement under realistic conditions, neural network interpretability, continual learning and multi-modal learning.

Subakan was awarded the Best Student Paper Award at the 2017 IEEE Machine Learning for Signal Processing Conference, and also obtained a Sabura Muroga Fellowship from UIUC’s Department of Computer Science. He is a core contributor to the SpeechBrain project, leading the speech separation component.

Current Students

Matthieu Cervera

Co-supervisor :

Mirco Ravanelli

Master's Research - Université Laval

Luca Della Libera

PhD - Concordia University

Principal supervisor :

Mirco Ravanelli

Maab Elrashid Ahmed Mohamed

PhD - Concordia University

Principal supervisor :

Mirco Ravanelli

PhD - Université Laval

Co-supervisor :

Laurent Charlin

Research Intern - Université Laval

Co-supervisor :

Mirco Ravanelli

PhD - Université Laval

Co-supervisor :

Mirco Ravanelli

Aravind Krishnan

Collaborating Alumni - Saarland University

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Co-supervisor :

Mirco Ravanelli

Francesco Paissan

PhD - Université Laval

Co-supervisor :

Aishwarya Agrawal

Blog Posts

Visual of FocalCodec,nouvelle méthode pour compresser la parole sans sacrifier la qualité, en vue d'obtenir des LLM multimodaux plus efficaces.

January 23, 2026

FocalCodec: Giving LLMs Ears and a Voice at Ultra-Low Bitrates

by

Luca Della Libera

Francesco Paissan

Mirco Ravanelli

Read the article

Publications

Retreever: Tree-Based Coarse-to-Fine Representations for Retrieval

Tianyi Chen

Perouz Taslakian

Valentina Zantedeschi

Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (see more)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.

2024-12-31

arXiv (preprint)

Listenable Maps for Zero-Shot Audio Classifiers

Francesco Paissan

Luca Della Libera

Mirco Ravanelli

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthines… (see more)s of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan

Luca Della Libera

Zhepei Wang

Paris Smaragdis

Mirco Ravanelli

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (see more)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

2024-08-31

Interspeech (published)

Listenable Maps for Audio Classifiers

Francesco Paissan

Mirco Ravanelli

Despite the impressive performance of deep learning models across diverse tasks, their complexity poses challenges for interpretation. This … (see more)challenge is particularly evident for audio signals, where conveying interpretations becomes inherently difficult. To address this issue, we introduce Listenable Maps for Audio Classifiers (L-MAC), a posthoc interpretation method that generates faithful and listenable interpretations. L-MAC utilizes a decoder on top of a pretrained classifier to generate binary masks that highlight relevant portions of the input audio. We train the decoder with a loss function that maximizes the confidence of the classifier decision on the masked-in portion of the audio while minimizing the probability of model output for the masked-out portion. Quantitative evaluations on both in-domain and out-of-domain data demonstrate that L-MAC consistently produces more faithful interpretations than several gradient and masking-based methodologies. Furthermore, a user study confirms that, on average, users prefer the interpretations generated by the proposed technique.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter Plantinga

Yingzhi Wang

Luca Della Libera

Artem Ploujnikov

Francesco Paissan

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 13 more)

Florian Mai

Juan Zuluaga-Gomez

Seyed Mahed Mousavi

Andreas Nautsch

Ha Nguyen

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

Gaëlle Laperrière

Mickael Rouvier

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-28

arXiv (preprint)

Phoneme Discretized Saliency Maps for Explainable Detection of AI-Generated Voice

Mirco Ravanelli

Pascal Germain

In this paper, we propose Phoneme Discretized Saliency Maps (PDSM), a discretization algorithm for saliency maps that takes advantage of pho… (see more)neme boundaries for explainable detection of AI-generated voice. We experimentally show with two different Text-to-Speech systems (i.e., Tacotron2 and Fastspeech2) that the proposed algorithm produces saliency maps that result in more faithful explanations compared to standard posthoc explanation methods. Moreover, by associating the saliency maps to the phoneme representations, this methodology generates explanations that tend to be more understandable than standard saliency maps on magnitude spectrograms.

2024-06-13

ArXiv (preprint)

CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds

David Budaghyan

Arsenii Gorin

Charles C. Onu

This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries - and the accompanying CryCeleb 2023 task, which is … (see more)a public speaker verification challenge based on cry sounds. We released more than 6 hours of manually segmented cry sounds from 786 newborns for academic use, aiming to encourage research in infant cry analysis. The inaugural public competition attracted 59 participants, 11 of whom improved the baseline performance. The top-performing system achieved a significant improvement scoring 25.8% equal error rate, which is still far from the performance of state-of-the-art adult speaker verification systems. Therefore, we believe there is room for further research on this dataset, potentially extending beyond the verification task.

2024-04-13

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

Focal Modulation Networks for Interpretable Sound Classification

Luca Della Libera

Mirco Ravanelli

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to inter… (see more)pretability and trust. While there has been extensive exploration of interpretation techniques in vision and language, interpretability in the audio domain has received limited attention, primarily focusing on post-hoc explanations. This paper addresses the problem of interpretability by-design in the audio domain by utilizing the recently proposed attention-free focal modulation networks (FocalNets). We apply FocalNets to the task of environmental sound classification for the first time and evaluate their interpretability properties on the popular ESC-50 dataset. Our method outperforms a similarly sized vision transformer both in terms of accuracy and interpretability. Furthermore, it is competitive against PIQ, a method specifically designed for post-hoc interpretation in the audio domain.

2024-04-13

2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (published)

Resource-Efficient Separation Transformer

Luca Della Libera

Mirco Ravanelli

Samuele Cornell

Frédéric Lepoutre

François Grondin

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally demanding … (see more)and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer-based architectures in terms of memory and inference time, making it more suitable for processing long mixtures.

2024-04-13

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

Exploring Self-Attention Mechanisms for Speech Separation

Mirco Ravanelli

Samuele Cornell

François Grondin

Mirko Bronzi

Transformers have enabled impressive improvements in deep learning. They often outperform recurrent and convolutional models in many tasks w… (see more)hile taking advantage of parallel processing. Recently, we proposed the SepFormer, which obtains state-of-the-art performance in speech separation with the WSJ0-2/3 Mix datasets. This paper studies in-depth Transformers for speech separation. In particular, we extend our previous findings on the SepFormer by providing results on more challenging noisy and noisy-reverberant datasets, such as LibriMix, WHAM!, and WHAMR!. Moreover, we extend our model to perform speech enhancement and provide experimental evidence on denoising and dereverberation tasks. Finally, we investigate, for the first time in speech separation, the use of efficient self-attention mechanisms such as Linformers, Lonformers, and ReFormers. We found that they reduce memory requirements significantly. For example, we show that the Reformer-based attention outperforms the popular Conv-TasNet model on the WSJ0-2Mix dataset while being faster at inference and comparable in terms of memory consumption.

2022-12-31

IEEE/ACM Transactions on Audio, Speech, and Language Processing (published)

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Zhepei Wang

Krishna Subramani

Junkai Wu

Tiago Tavares

Fabio Ayres

Paris Smaragdis

Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional trai… (see more)ning approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.

2022-12-31

WASPAA (published)

Real-M: Towards Speech Separation on Real Mixtures

Mirco Ravanelli

Samuele Cornell

François Grondin

In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation mod… (see more)els on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance trends predicted by our estimator on the REAL-M dataset closely follow those achieved on synthetic benchmarks when evaluating popular speech separation models.

2022-05-22

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)