Portrait de Cem Subakan

Cem Subakan

Membre académique associé
Professeur adjoint, Université Laval, Département d'informatique et de génie logiciel
Professeur associé, Concordia University, École de génie et d'informatique Gina-Cody
Sujets de recherche
Apprentissage multimodal

Biographie

Cem Subakan est professeur adjoint à l'Université Laval, au sein du Département d'informatique et de génie logiciel. Il est également professeur adjoint affilié au Département d'informatique et de génie logiciel de l'Université Concordia, ainsi que membre académique associé à Mila – Institut québécois d'intelligence artificielle. Il a obtenu un doctorat en informatique de l'Université de l'Illinois à Urbana-Champaign (UIUC) et a effectué un postdoctorat à Mila. Il agit en tant que relecteur pour plusieurs conférences, notamment NeurIPS, ICML, ICLR, ICASSP et MLSP, ainsi que pour des revues telles que IEEE Signal Processing Letters (SPL) et IEEE Transactions on Audio, Speech, and Language Processing (TASL). Ses recherches portent principalement sur l'apprentissage automatique appliqué à la parole et à l'audio. Plus précisément, il travaille sur l'apprentissage profond pour la séparation de sources et l'amélioration de la parole dans des conditions réalistes, l'interprétabilité des réseaux neuronaux, l'apprentissage continu et l'apprentissage multimodal. Il a reçu le Prix du meilleur article étudiant lors de la conférence IEEE Machine Learning for Signal Processing (MLSP) en 2017, ainsi que la bourse Sabura Muroga du Département d'informatique de l'UIUC. Il est également un contributeur clé au projet SpeechBrain, où il dirige la partie consacrée à la séparation de la parole.

Étudiants actuels

Co-superviseur⋅e :
Maîtrise recherche - Université Laval
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :
Stagiaire de recherche - Université Laval
Co-superviseur⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :
Collaborateur·rice alumni - Saarland University
Collaborateur·rice alumni - UdeM
Co-superviseur⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :

Publications

Exploring Token-Space Manipulation in Latent Audio Tokenizers
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as… (voir plus) frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.
DASB - Discrete Audio and Speech Benchmark
Jarod Duret
Darius Petermann
Anastasia Kuznetsova
Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling mult… (voir plus)imodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.
Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chai… (voir plus)ns remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.
LL-SDR: Low-Latency Speech enhancement through Discrete Representations
Many speech enhancement (SE) methods rely on continuous representations. Recently, discrete audio tokens have been explored to enable autore… (voir plus)gressive generation for SE. However, it remains unclear whether discretization itself consistently improves SE performance. In this paper, we introduce LL-SDR, a token-based speech enhancement framework that explicitly leverages discretization to better separate speech and noise. Our first contribution is a Variance-Ordered Residual Vector Quantizer (VO-RVQ), designed to disentangle speech and noise distributions during tokenization. Second, we propose a latent-space discriminator to better align enhanced embeddings with semantic embeddings. Experiments show that LL-SDR outperforms continuous baselines and matches the performance of autoregressive token-based approaches, while enabling lightweight, low-latency speech enhancement in both reverberant and non-reverberant noisy environments. Demos and source code are available at our project websites.
WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to sp… (voir plus)eech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.
Hierarchical Retrieval at Scale: Bridging Transparency and Efficiency
Tianyi Chen
Valentina Zantedeschi
Information retrieval is a core component of many intelligent systems as it enables conditioning of outputs on new and large-scale datasets.… (voir plus) While effective, the standard practice of encoding data into high-dimensional representations for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. Hierarchical retrieval methods offer an interpretable alternative by organizing data at multiple granular levels, yet do not match the efficiency and performance of flat retrieval approaches. In this paper, we propose ReTreever, a tree-based method that makes hierarchical retrieval viable at scale by directly optimizing its structure for retrieval performance while naturally providing transparency through meaningful semantic groupings. Our method offers the flexibility to balance cost and utility by indexing data using representations from any tree level. We show that ReTreever delivers strong coarse (intermediate levels) and fine representations (terminal level), while achieving the highest retrieval accuracy at the lowest latency among hierarchical methods. These results demonstrate that this family of techniques is viable in practical applications.
Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering
Dinh Phu Tran
Saad Wazir
Seongah Kim
Thao Do
Daeyoung Kim
We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering (…
Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete to… (voir plus)kens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.
FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (voir plus)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (voir plus)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
LMAC-TD: Producing Time Domain Explanations for Audio Classifiers
Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have… (voir plus) proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a post-hoc explanation method that trains a decoder to produce explanations directly in the time domain. This methodology builds upon the foundation of L-MAC, Listenable Maps for Audio Classifiers, a method that produces faithful and listenable explanations. We incorporate SepFormer, a popular transformer-based time-domain source separation architecture. We show through a user study that LMAC-TD significantly improves the audio quality of the produced explanations while not sacrificing from faithfulness.
Audio Prototypical Network For Controllable Music Recommendation
Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While thes… (voir plus)e models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system's modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.