Yusong Wu

Christos Tsirigotis

Ke Chen

Oriol Nieto

Prem Seetharaman

Justin Salamon

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works … (see more)use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

FLAM: Frame-Wise Language-Audio Modeling

Christos Tsirigotis

Ke Chen

Oriol Nieto

Prem Seetharaman

Justin Salamon

2025-05-01

ICML.cc/2025/Conference (poster)

proceedings.mlr.press

openreview.net

Adaptive Accompaniment with ReaLchords

Tim Cooijmans

Kyle Kastner

Adam Roberts

Ian Simon

Alexander Scarlatos

Chris Donahue

Cassie Tarakajian

Shayegan Omidshafiei

Pablo Samuel Castro

Natasha Jaques

Jamming requires coordination, anticipation, and collaborative creativity between musicians. Current generative models of music produce expr… (see more)essive output but are not able to generate in an online manner, meaning simultaneously with other musicians (human or otherwise). We propose ReaLchords, an online generative model for improvising chord accompaniment to user melody. We start with an online model pretrained by maximum likelihood, and use reinforcement learning to finetune the model for online use. The finetuning objective leverages both a novel reward model that provides feedback on both harmonic and temporal coherency between melody and chord, and a divergence term that implements a novel type of distillation from a teacher model that can see the future melody. Through quantitative experiments and listening tests, we demonstrate that the resulting model adapts well to unfamiliar input and produce fitting accompaniment. ReaLchords opens the door to live jamming, as well as simultaneous co-creation in other modalities.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling

Ethan Manilow

Yi Deng

Rigel Swavely

Kyle Kastner

Tim Cooijmans