Portrait de Cem Subakan

Cem Subakan

Membre académique associé
Professeur adjoint, Université Laval, Département d'informatique et de génie logiciel
Professeur associé, Concordia University, École de génie et d'informatique Gina-Cody
Sujets de recherche
Apprentissage multimodal

Biographie

Cem Subakan est professeur adjoint à l'Université Laval, au sein du Département d'informatique et de génie logiciel. Il est également professeur adjoint affilié au Département d'informatique et de génie logiciel de l'Université Concordia, ainsi que membre académique associé à Mila – Institut québécois d'intelligence artificielle. Il a obtenu un doctorat en informatique de l'Université de l'Illinois à Urbana-Champaign (UIUC) et a effectué un postdoctorat à Mila. Il agit en tant que relecteur pour plusieurs conférences, notamment NeurIPS, ICML, ICLR, ICASSP et MLSP, ainsi que pour des revues telles que IEEE Signal Processing Letters (SPL) et IEEE Transactions on Audio, Speech, and Language Processing (TASL). Ses recherches portent principalement sur l'apprentissage automatique appliqué à la parole et à l'audio. Plus précisément, il travaille sur l'apprentissage profond pour la séparation de sources et l'amélioration de la parole dans des conditions réalistes, l'interprétabilité des réseaux neuronaux, l'apprentissage continu et l'apprentissage multimodal. Il a reçu le Prix du meilleur article étudiant lors de la conférence IEEE Machine Learning for Signal Processing (MLSP) en 2017, ainsi que la bourse Sabura Muroga du Département d'informatique de l'UIUC. Il est également un contributeur clé au projet SpeechBrain, où il dirige la partie consacrée à la séparation de la parole.

Étudiants actuels

Co-superviseur⋅e :
Maîtrise recherche - Université Laval
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :
Doctorat - Université Laval
Co-superviseur⋅e :
Collaborateur·rice alumni - UdeM
Co-superviseur⋅e :
Visiteur de recherche indépendant
Maîtrise recherche - Université Laval

Publications

Investigating Faithfulness in Large Audio Language Models
Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (voir plus)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
Virtual Consistency for Audio Editing
Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (voir plus)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.
FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reco… (voir plus)nstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
Discrete Audio Tokens: More Than a Survey!
Gallil Maimon
Adel Moumen
Darius Petermann
Jiatong Shi
Haibin Wu
Haici Yang
Anastasia Kuznetsova
Bhuvana Ramabhadran
Benjamin Elizalde
Loren Lugosch
Jinyu Li
Phil Woodland
Minje Kim
Hung-yi Lee
Shinji Watanabe
Yossi Adi … (voir 1 de plus)
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Audio Prototypical Network for Controllable Music Recommendation
Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While thes… (voir plus)e models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system’s modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.
Autoregressive Speech Enhancement via Acoustic Tokens
Discrete Audio Tokens: More Than a Survey!
Gallil Maimon
Adel Moumen
Darius Petermann
Jiatong Shi
Haibin Wu
Haici Yang
Anastasia Kuznetsova
Bhuvana Ramabhadran
Benjamin Elizalde
Loren Lugosch
Jinyu Li
Phil Woodland
Minje Kim
Hung-yi Lee
Shinji Watanabe
Yossi Adi … (voir 1 de plus)
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Discrete Audio Tokens: More Than a Survey!
Gallil Maimon
Adel Moumen
Darius Petermann
Jiatong Shi
Haibin Wu
Haici Yang
Anastasia Kuznetsova
Bhuvana Ramabhadran
Benjamin Elizalde
Loren Lugosch
Jinyu Li
Phil Woodland
Minje Kim
Hung-yi Lee
Shinji Watanabe
Yossi Adi … (voir 1 de plus)
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (voir plus) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs
LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (voir plus) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (voir plus)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.