Anna (Cheng-Zhi) Huang

Christos Tsirigotis

Ke Chen

Oriol Nieto

Prem Seetharaman

Justin Salamon

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works … (see more)use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

FLAM: Frame-Wise Language-Audio Modeling

Christos Tsirigotis

Ke Chen

Oriol Nieto

Prem Seetharaman

Justin Salamon

2025-05-01

ICML.cc/2025/Conference (poster)

proceedings.mlr.press

openreview.net

Adaptive Accompaniment with ReaLchords

Tim Cooijmans

Kyle Kastner

Adam Roberts

Ian Simon

Alexander Scarlatos

Chris Donahue

Cassie Tarakajian

Shayegan Omidshafiei

Pablo Samuel Castro

Natasha Jaques

Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expr… (see more)essive output but are not able to generate in an online manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

Grammar Generative Models for Music Notation

Deep generative models have been successfully applied in many learning experiments with digital data, such as images or audio. In the field … (see more)of music, they can also be used to generate symbolic representations, in the context of problems such as automatic music generation or transcription [1-3]. A significant challenge for generating structured symbolic data in general is obtaining well-formed results. This is especially true in the case of music. It is indeed widely accepted that musical notation represents, well beyond simple sequences of notes, a hierarchical organization of melodic and harmonic information, inducing non-local dependencies between musical objects [4]. A good representation of this information is essential for the interpretation and analysis of music pieces.

Improving Source Separation by Explicitly Modeling Dependencies between Sources

Ethan Manilow

Curtis Hawthorne

Bryan Pardo

Jesse Engel

We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all c… (see more)ombinations of sources in a mixture. Rather than independently estimating each source from a mix, we reframe the source separation problem as an Orderless Neural Autoregressive Density Estimator (NADE), and estimate each source from both the mix and a random subset of the other sources. We adapt a standard source separation architecture, Demucs, with additional inputs for each individual source, in addition to the input mixture. We randomly mask these input sources during training so that the network learns the conditional dependencies between the sources. By pairing this training method with a blocked Gibbs sampling procedure at inference time, we demonstrate that the network can iteratively improve its separation performance by conditioning a source estimate on its earlier source estimates. Experiments on two source separation datasets show that training a Demucs model with an Orderless NADE approach and using Gibbs sampling (up to 512 steps) at inference time strongly outperforms a Demucs baseline that uses a standard regression loss and direct (one step) estimation of sources.

2022-05-23

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (published)

doi.org

arxiv.org

Improving Source Separation by Explicitly Modeling Dependencies between Sources

Ethan Manilow

Curtis Hawthorne

Bryan A. Pardo

Jesse Engel

We propose a new method for training a supervised source separation system that aims to learn the interdependent relationships between all c… (see more)ombinations of sources in a mixture. Rather than independently estimating each source from a mix, we reframe the source separation problem as an Orderless Neural Autoregressive Density Estimator (NADE), and estimate each source from both the mix and a random subset of the other sources. We adapt a standard source separation architecture, Demucs, with additional inputs for each individual source, in addition to the input mixture. We randomly mask these input sources during training so that the network learns the conditional dependencies between the sources. By pairing this training method with a blocked Gibbs sampling procedure at inference time, we demonstrate that the network can iteratively improve its separation performance by conditioning a source estimate on its earlier source estimates. Experiments on two source separation datasets show that training a Demucs model with an Orderless NADE approach and using Gibbs sampling (up to 512 steps) at inference time strongly outperforms a Demucs baseline that uses a standard regression loss and direct (one step) estimation of sources.

2022-03-28

ArXiv (preprint)

doi.org

arxiv.org

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Ethan Manilow

Yi Deng

Rigel Swavely

Kyle Kastner

Tim Cooijmans