Kundan Kumar

Chunked Autoregressive GAN for Conditional Waveform Synthesis

Max Morrison

Prem Seetharaman

Conditional waveform synthesis models learn a distribution of audio waveforms given conditioning such as text, mel-spectrograms, or MIDI. Th… (see more)ese systems employ deep generative models that model the waveform via either sequential (autoregressive) or parallel (non-autoregressive) sampling. Generative adversarial networks (GANs) have become a common choice for non-autoregressive waveform synthesis. However, state-of-the-art GAN-based models produce artifacts when performing mel-spectrogram inversion. In this paper, we demonstrate that these artifacts correspond with an inability for the generator to learn accurate pitch and periodicity. We show that simple pitch and periodicity conditioning is insufficient for reducing this error relative to using autoregression. We discuss the inductive bias that autoregression provides for learning the relationship between instantaneous frequency and phase, and show that this inductive bias holds even when autoregressively sampling large chunks of the waveform during each forward pass. Relative to prior state-of-the-art GAN-based models, our proposed model, Chunked Autoregressive GAN (CARGAN) reduces pitch error by 40-60%, reduces training time by 58%, maintains a fast generation speed suitable for real-time or interactive applications, and maintains or improves subjective quality.

2022-04-24

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

NU-GAN: High resolution neural upsampling with GAN

In this paper, we propose NU-GAN, a new method for resampling audio from lower to higher sampling rates (upsampling). Audio upsampling is an… (see more) important problem since productionizing generative speech technology requires operating at high sampling rates. Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution. NU-GAN takes a leap towards solving audio upsampling as a separate component in the text-to-speech (TTS) pipeline by leveraging techniques for audio generation using GANs. ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.

2020-10-21

arXiv (preprint)

doi.org

arxiv.org

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Kundan Kumar

Rithesh Kumar

Thibault de Boissiere

Lucas Gestin

Wei Zhen Teoh

Jose Sotelo

Alexandre De Brébisson

Yoshua Bengio

Aaron Courville

2018-12-31

Neural Information Processing Systems (published)

doi.org

arxiv.org

Probability Distillation: A Caveat and Alternatives

Alexandre Lacoste

Due to Van den Oord et al. (2018), probability distillation has recently been of interest to deep learning practitioners, where, as a practi… (see more)cal workaround for deploying autoregressive models in real-time applications, a student network is used to obtain quality samples in parallel. We identify a pathological optimization issue with the adopted stochastic minimization of the reverse-KL divergence: the curse of dimensionality results in a skewed gradient distribution that renders training inefﬁcient. This means that KL-based “evaluative” training can be susceptible to poor exploration if the target distribution is highly structured. We then explore alternative principles for distillation, including one with an “instructive” signal, and show that it is possible to achieve qualitatively better results than with KL minimization.

2018-12-31

Conference on Uncertainty in Artificial Intelligence (published)

proceedings.mlr.press

On Difficulties of Probability Distillation

Alexandre Lacoste

2018-09-26

(published)

openreview.net

ObamaNet: Photo-realistic lip-sync from text

Rithesh Kumar

Jose Sotelo

Kundan Kumar

Alexandre De Brébisson

Yoshua Bengio

We present ObamaNet, the first architecture that generates both audio and synchronized photo-realistic lip-sync videos from any new text. Co… (see more)ntrary to other published lip-sync approaches, ours is only composed of fully trainable neural modules and does not rely on any traditional computer graphics methods. More precisely, we use three main modules: a text-to-speech network based on Char2Wav, a time-delayed LSTM to generate mouth-keypoints synced to the audio, and a network based on Pix2Pix to generate the video frames conditioned on the keypoints.

2017-12-05

ArXiv (preprint)

arxiv.org

Char2Wav: End-to-End Speech Synthesis

Jose Sotelo

2017-02-16

International Conference on Learning Representations (unknown)

openreview.net

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

Jose Sotelo

In this paper we propose a novel model for unconditional audio generation task that generates one audio sample at a time. We show that our m… (see more)odel which profits from combining memory-less modules, namely autoregressive multilayer perceptron, and stateful recurrent neural networks in a hierarchical structure is de facto powerful to capture the underlying sources of variations in temporal domain for very long time on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.

2017-02-05

ICLR.cc/2017/conference (poster)

openreview.net

PixelVAE: A Latent Variable Model for Natural Images

Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representatio… (see more)n and model global structure well but have difficulty capturing small details. PixelCNN models details very well, but lacks a latent code and is difficult to scale for capturing large structures. We present PixelVAE, a VAE model with an autoregressive decoder based on PixelCNN. Our model requires very few expensive autoregressive layers compared to PixelCNN and learns latent codes that are more compressed than a standard VAE while still capturing most non-trivial structure. Finally, we extend our model to a hierarchy of latent variables at different scales. Our model achieves state-of-the-art performance on binarized MNIST, competitive performance on 64 × 64 ImageNet, and high-quality samples on the LSUN bedrooms dataset.

2016-12-31

ICLR (Poster) (published)

openreview.net

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Kundan Kumar

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Kundan Kumar

Publications