Artem Ploujnikov

DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi

Jarod Duret

Darius Petermann

Anastasia Kuznetsova

Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling mult… (see more)imodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.

2026-04-12

Transactions on Machine Learning Research (accepted)

doi.org

openreview.net

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang

Pooneh Mousavi

Artem Ploujnikov

Mirco Ravanelli

In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are prese… (see more)nt in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

2025-04-05

ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

doi.org

arxiv.org

Discrete Audio Tokens: More Than a Survey!

Pooneh Mousavi

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Yusuf Cem Sübakan

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (see 1 more)

Mirco Ravanaelli

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (see more)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2024-12-31

Trans. Mach. Learn. Res. (published)

doi.org

openreview.net

Open-Source Conversational AI with SpeechBrain 1.0

Mirco Ravanelli

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Yingzhi Wang

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 13 more)

Florian Mai

Juan Zuluaga-Gomez

Seyed Mahed Mousavi

Andreas Nautsch

Ha Nguyen

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

Gaëlle Laperrière

Mickael Rouvier

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-28

arXiv (preprint)

doi.org

arxiv.org

DASB -- Discrete Audio and Speech Benchmark

Pooneh Mousavi

Luca Della Libera

Jarod Duret

Artem Ploujnikov

Yusuf Cem Sübakan

Mirco Ravanaelli

Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the… (see more) creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.

2024-06-19

ArXiv (preprint)

doi.org

openreview.net

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jarod Duret

Yusuf Cem Sübakan

Mirco Ravanaelli

2023-12-31

INTERSPEECH (published)

doi.org

arxiv.org

SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

Artem Ploujnikov

Mirco Ravanelli

End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their im… (see more)pressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.

2022-09-17

Interspeech 2022 (published)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Artem Ploujnikov

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Artem Ploujnikov

Publications