Publications

Improved Training of Wasserstein GANs
Ishaan Gulrajani
Faruk Ahmed
Martin Arjovsky
Vincent Dumoulin
Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserste… (see more)in GAN (WGAN) makes progress toward stable training of GANs, but sometimes can still generate only low-quality samples or fail to converge. We find that these problems are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to undesired behavior. We propose an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input. Our proposed method performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data. We also achieve high quality generations on CIFAR-10 and LSUN bedrooms.
Modulating early visual processing by language
Harm de Vries
Florian Strub
Jérémie Mary
Olivier Pietquin
It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view do… (see more)minates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (\MRN), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial.
Piecewise Latent Variables for Neural Variational Text Processing
Iulian V. Serban
Alexander G. Ororbia II
Advances in neural variational inference have facilitated the learning of powerful directed graphical models with continuous latent variable… (see more)s, such as variational autoencoders. The hope is that such models will learn to represent rich, multi-modal latent factors in real-world data, such as natural language text. However, current models often assume simplistic priors on the latent variables - such as the uni-modal Gaussian distribution - which are incapable of representing complex latent factors efficiently. To overcome this restriction, we propose the simple, but highly flexible, piecewise constant distribution. This distribution has the capacity to represent an exponential number of modes of a latent target distribution, while remaining mathematically tractable. Our results demonstrate that incorporating this new latent distribution into different models yields substantial improvements in natural language processing tasks such as document modeling and natural language generation for dialogue.
PixelVAE: A Latent Variable Model for Natural Images
Ishaan Gulrajani
Kundan Kumar
Faruk Ahmed
Adrien Ali Taiga
Francesco Visin
David Vazquez
Natural image modeling is a landmark challenge of unsupervised learning. Variational Autoencoders (VAEs) learn a useful latent representatio… (see more)n and model global structure well but have difficulty capturing small details. PixelCNN models details very well, but lacks a latent code and is difficult to scale for capturing large structures. We present PixelVAE, a VAE model with an autoregressive decoder based on PixelCNN. Our model requires very few expensive autoregressive layers compared to PixelCNN and learns latent codes that are more compressed than a standard VAE while still capturing most non-trivial structure. Finally, we extend our model to a hierarchy of latent variables at different scales. Our model achieves state-of-the-art performance on binarized MNIST, competitive performance on 64 × 64 ImageNet, and high-quality samples on the LSUN bedrooms dataset.
Recurrent Batch Normalization
Tim Cooijmans
Nicolas Ballas
César Laurent
Caglar Gulcehre
We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks. Whereas previous works… (see more) only apply batch normalization to the input-to-hidden transformation of RNNs, we demonstrate that it is both possible and beneficial to batch-normalize the hidden-to-hidden transition, thereby reducing internal covariate shift between time steps. We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
Soroush Mehri
Kundan Kumar
Ishaan Gulrajani
Rithesh Kumar
Shubham Jain
Jose Sotelo
In this paper we propose a novel model for unconditional audio generation task that generates one audio sample at a time. We show that our m… (see more)odel which profits from combining memory-less modules, namely autoregressive multilayer perceptron, and stateful recurrent neural networks in a hierarchical structure is de facto powerful to capture the underlying sources of variations in temporal domain for very long time on three datasets of different nature. Human evaluation on the generated samples indicate that our model is preferred over competing models. We also show how each component of the model contributes to the exhibited performance.
Sequentialized Sampling Importance Resampling and Scalable IWAE
Chin-Wei Huang
We propose a new sequential algorithm for Sampling Importance Resampling. The algorithm serves as a solution to expensive evaluation of impo… (see more)rtance weight, and can be interpreted as stochastically and iteratively refining the particles by correcting them towards the target distribution as pool size increases. We apply this algorithm to variational inference with Importance Weighted Lower Bound and propose a memory-scalable training procedure 1 that implicitly improves the variational proposal. 1 Sequentializing Sampling Importance Resampling 1.1 Sampling Importance Resampling Given an unnormalized target distribution p̃(x) and proposal distribution q(x), the Sampling Importance Resampling (SIR) proceeds as follows: 1. draw xi for 1 ≤ i ≤ n from q(x) 2. calculate the importance weight wi = p̃(xi) q(xi) 3. calculate the normalized importance weight w̄i = wi ∑ i wi 4. draw index variable yj ∼ mul(w̄1, ..., w̄n) for 1 ≤ j ≤ m The density of the set of resampled particles xy1 , ..., xym should resemble the pdf of the target distribution, and the new samples will be approximately distributed by p(x) (Bishop, 2007). On average, the samples can be improved by increasing the pool size n, and becomes corrected when n→∞. The procedure is visualized in Figure 1a. 1.2 SeqSIR The above procedure can be combined with the idea of reservoir sampling, so that we need not evaluate all n samples at the same time, which will be an issue when n is large or when evaluation of a sample (i.e. computation of wi) is expensive. The intuition is to keep a running sum of the importance weights while we evaluate the pool samples sequentially, and then decide to keep the old sample or replace it with the new one based on the ratio of the new sample’s importance weight to the running sum. This is what we call Sequentialized Sampling Importance Resampling (SEQSIR), which is summarized in Algorithm 1. See Figure 1b for illustration. Note that density and importance weight are computed on log scale to deal with numerical instability, and log-sum-exp operation (LSE) is used in place of addition to calculate the running sum of See https://github.com/CW-Huang/SeqIWAE for implementation. Second workshop on Bayesian Deep Learning (NIPS 2017), Long Beach, CA, USA. Algorithm 1 Sequentialized Sampling Importance Resampling and Stochastic Iterative Refinement procedure SEQSIR ( logp, logq . unnormalized target density function and proposal density function ss . n samples to be evaluated ) A←−∞ . initialize accumulated sum of importance weight on log scale s_old← 0 . initialize sample n← len([s1,...,sn]) for i=1,...,n do s_new = ss[i] A, s_old← STOCHREFINE(logp, logq, A, s_old, s_new) return s_old procedure STOCHREFINE ( logp, logq . unnormalized target density function and proposal density function A . accumulated sum of importance weight on log scale s_old, s_new . old and new samples ) w_new← logp(s_new) logq(s_new) A← LSE(A, w_new) u← unif(0,1) if w_new A >= log u then return A, s_new else return A, s_old
Z-Forcing: Training Stochastic Recurrent Networks
Anirudh Goyal
Marc-Alexandre Côté
Nan Rosemary Ke
Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks… (see more) (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: this https URL
A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images
David Vazquez
Jorge Bernal
F. Javier Sánchez
Gloria Fernández-Esparrach
Antonio M. López
Michal Drozdzal
Colorectal cancer (CRC) is the third cause of cancer death worldwide. Currently, the standard approach to reduce CRC-related mortality is to… (see more) perform regular screening in search for polyps and colonoscopy is the screening tool of choice. The main limitations of this screening procedure are polyp miss rate and the inability to perform visual assessment of polyp malignancy. These drawbacks can be reduced by designing decision support systems (DSS) aiming to help clinicians in the different stages of the procedure by providing endoluminal scene segmentation. Thus, in this paper, we introduce an extended benchmark of colonoscopy image segmentation, with the hope of establishing a new strong benchmark for colonoscopy image analysis research. The proposed dataset consists of 4 relevant classes to inspect the endoluminal scene, targeting different clinical needs. Together with the dataset and taking advantage of advances in semantic segmentation literature, we provide new baselines by training standard fully convolutional networks (FCNs). We perform a comparative study to show that FCNs significantly outperform, without any further postprocessing, prior results in endoluminal scene segmentation, especially with respect to polyp segmentation and localization.
Professor Forcing: A New Algorithm for Training Recurrent Networks
Anirudh Goyal
Alex Lamb
Ying Zhang
Saizheng Zhang
The Teacher Forcing algorithm trains recurrent networks by supplying observed sequence values as inputs during training and using the networ… (see more)k’s own one-step-ahead predictions to do multi-step sampling. We introduce the Professor Forcing algorithm, which uses adversarial domain adaptation to encourage the dynamics of the recurrent network to be the same when training the network and when sampling from the network over multiple time steps. We apply Professor Forcing to language modeling, vocal synthesis on raw waveforms, handwriting generation, and image generation. Empirically we find that Professor Forcing acts as a regularizer, improving test likelihood on character level Penn Treebank and sequential MNIST. We also find that the model qualitatively improves samples, especially when sampling for a large number of time steps. This is supported by human evaluation of sample quality. Trade-offs between Professor Forcing and Scheduled Sampling are discussed. We produce T-SNEs showing that Professor Forcing successfully makes the dynamics of the network during training and sampling more similar.
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
Ying Zhang
Philemon Brakel
Saizheng Zhang
César Laurent
Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic fe… (see more)atures for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.
Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus
Iulian V. Serban
Alberto García-Durán
Caglar Gulcehre
Sungjin Ahn
Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. Howeve… (see more)r, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the question-generation model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear comparable in quality to real human-generated questions.