Mirco Ravanelli

Dehestani Amirali

Collaborating researcher - Concordia University University

Seina Assadian

Collaborating researcher - Concordia University University

Cordelle Briac

Collaborating researcher - Concordia University University

Gallegati Caterina

Research Intern - Concordia University

Victor Cruz

Master's Research - Concordia University

Luca Della Libera

PhD - Concordia University

Co-supervisor :

Wagner Drew

Master's Research - Concordia University

Co-supervisor :

Irina Rish

Gianfranco Dumoulin Bertucci

Master's Research - Concordia University

Website

nadine.el-mufti@mila.quebec

Nadine El-Mufti

Master's Research - Concordia University

Website

Maab Elrashid Ahmed Mohamed

Google Scholar

PhD - Concordia University

Co-supervisor :

Bonzi Francesco

PhD - Concordia University

Alessio Giuseppe Alessio

Collaborating researcher - International School for Advanced Studies (Trieste, Italy)

Salman Sami Hussain Ali

Collaborating researcher - Concordia University University

SpeechBrain 1.0: Making Conversational AI Accessible to Everyone

Eleonora Mancini

Collaborating Alumni - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

PhD - Concordia University

PhD - Concordia University

Co-supervisor :

Peter Peter

Postdoctorate - McGill University

PhD - Université de Montréal

Blog Posts

June 13, 2024

Mirco Ravanelli

Read the article

April 28, 2021

Introducing SpeechBrain: A General-Purpose PyTorch Speech Processing Toolkit

Mirco Ravanelli

Loren Lugosch

Read the article

Publications

Listenable Maps for Zero-Shot Audio Classifiers

Francesco Paissan

Luca Della Libera

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthines… (see more)s of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

openreview.net

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang

Pooneh Mousavi

Artem Ploujnikov

In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are prese… (see more)nt in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called"What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

2024-09-22

ArXiv (preprint)

What Are They Doing? Joint Audio-Speech Co-Reasoning

Yingzhi Wang

Pooneh Mousavi

Artem Ploujnikov

2024-09-22

ArXiv (preprint)

Dynamic HumTrans: Humming Transcription Using CNNs and Dynamic Programming

Shubham Gupta

Isaac Neri Gomez-Sarmiento

Faez Amjed Mezdari

2024-09-19

Lecture Notes in Computer Science (published)

Explaining Network Decision Provides Insights on the Causal Interaction Between Brain Regions in a Motor Imagery Task

Davide Borra

2024-09-19

Lecture Notes in Computer Science (published)

Multi-modal Decoding of Reach-to-Grasping from EEG and EMG via Neural Networks

Davide Borra

Matteo Fraternali

Elisa Magosso

2024-09-19

Lecture Notes in Computer Science (published)

LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

Eleonora Mancini

Francesco Paissan

2024-09-13

ArXiv (preprint)

Audio Editing with Non-Rigid Text Prompts

Francesco Paissan

Zhepei Wang

Paris Smaragdis

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits … (see more)that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

2024-09-01

Interspeech 2024 (published)

Progres: Prompted Generative Rescoring on ASR N-Best

Ada Defne Tur

Adel Moumen

Large Language Models (LLMs) have shown their ability to improve the performance of speech recognizers by effectively rescoring the n-best h… (see more)ypotheses generated during the beam search process. However, the best way to exploit recent generative instruction-tuned LLMs for hypothesis rescoring is still unclear. This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. Specifically, we introduce a new zero-shot method for ASR n-best rescoring, which combines confidence scores, LLM sequence scoring, and prompt-based hypothesis generation. We compare Llama-3-Instruct, GPT-3.5 Turbo, and GPT-4 Turbo as prompt-based generators with Llama-3 as sequence scorer LLM. We evaluated our approach using different speech recognizers and observed significant relative improvement in the word error rate (WER) ranging from 5% to 25%.

2024-08-30

ArXiv (preprint)

Progres: Prompted Generative Rescoring on ASR N-Best

Ada Defne Tur

Adel Moumen

2024-08-30

ArXiv (preprint)

Listenable Maps for Audio Classifiers

Francesco Paissan

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

openreview.net

Open-Source Conversational AI with SpeechBrain 1.0

Titouan Parcollet

Adel Moumen

Sylvain de Langen

Peter William VanHarn Plantinga

Yingzhi Wang

Pooneh Mousavi

Luca Della Libera

Artem Ploujnikov

Francesco Paissan

Davide Borra

Salah Zaiem

Zeyu Zhao

Shucong Zhang

Georgios Karakasidis

Sung-Lin Yeh

Pierre Champion

Aku Rouhe

Rudolf Braun … (see 11 more)

Florian Mai

Juan Pablo Zuluaga

Seyed Mahed Mousavi

Andreas Nautsch

Xuechen Liu

Sangeet Sagar

Jarod Duret

Salima Mdhaffar

G. Laperriere

Renato De Mori

Yannick Estève

SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech rec… (see more)ognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete"recipes"of code and algorithms required for training them. This paper presents SpeechBrain 1.0, a significant milestone in the evolution of the toolkit, which now has over 200 recipes for speech, audio, and language processing tasks, and more than 100 models available on Hugging Face. SpeechBrain 1.0 introduces new technologies to support diverse learning modalities, Large Language Model (LLM) integration, and advanced decoding strategies, along with novel models, tasks, and modalities. It also includes a new benchmark repository, offering researchers a unified platform for evaluating models across diverse tasks

2024-06-29

ArXiv (preprint)