Cem Subakan

ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval

Tianyi Chen

Perouz Taslakian

Valentina Zantedeschi

Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (see more)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.

2025-02-11

ArXiv (preprint)

ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval

Tianyi Chen

Perouz Taslakian

Valentina Zantedeschi

2025-02-11

ArXiv (preprint)

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

2025-02-06

ArXiv (preprint)

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by t… (see more)his success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

2025-02-06

ArXiv (preprint)

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

Paolo Torroni

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detectio… (see more)n have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

2024-11-12

ArXiv (preprint)

Listenable Maps for Zero-Shot Audio Classifiers

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthines… (see more)s of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

openreview.net

Dynamic HumTrans: Humming Transcription Using CNNs and Dynamic Programming

Shubham Gupta

Isaac Neri Gomez-Sarmiento

Faez Amjed Mezdari

2024-09-19

Lecture Notes in Computer Science (published)

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan

Zhepei Wang

Paris Smaragdis

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (see more)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

2024-09-01

Interspeech 2024 (published)

Listenable Maps for Audio Classifiers

Francesco Paissan

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

openreview.net

Open-Source Conversational AI with SpeechBrain 1.0

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter William VanHarn Plantinga

Yingzhi Wang

Davide Borra

Salah Zaiem

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 11 more)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks

2024-06-29

ArXiv (preprint)

Open-Source Conversational AI with SpeechBrain 1.0

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter William VanHarn Plantinga

Yingzhi Wang

Davide Borra

Salah Zaiem

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 11 more)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks.

2024-06-29

ArXiv (preprint)