Cem Subakan

Investigating Faithfulness in Large Audio Language Models

Lovenya Jain

Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliab… (voir plus)le explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

2025-09-26

ArXiv (prépublication)

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (voir plus)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.

2025-09-21

ArXiv (prépublication)

Virtual Consistency for Audio Editing

Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches r… (voir plus)ely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.

2025-09-21

ArXiv (prépublication)

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Luca Della Libera

Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reco… (voir plus)nstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.

2025-09-01

arXiv (publié)

Autoregressive Speech Enhancement via Acoustic Tokens

Luca Della Libera

2025-07-01

arXiv (publié)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (voir 1 de plus)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (prépublication)

Discrete Audio Tokens: More Than a Survey!

Gallil Maimon

Adel Moumen

Darius Petermann

Jiatong Shi

Haibin Wu

Haici Yang

Anastasia Kuznetsova

Artem Ploujnikov

Ricard Marxer

Bhuvana Ramabhadran

Benjamin Elizalde

Loren Lugosch

Jinyu Li

Phil Woodland

Minje Kim

Hung-yi Lee

Shinji Watanabe

Yossi Adi … (voir 1 de plus)

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics whi… (voir plus)le enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

2025-06-12

ArXiv (prépublication)

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (voir plus) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

2025-05-24

ArXiv (prépublication)

ALAS: Measuring Latent Speech-Text Alignment For Spoken Language Understanding In Multimodal LLMs

Yingzhi Wang

2025-05-01

arXiv (publié)

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting… (voir plus) these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

2025-05-01

arXiv (publié)

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

Paolo Torroni

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (voir plus)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

2025-04-06

2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW) (publié)