Publications

Sequentialized Sampling Importance Resampling and Scalable IWAE

We propose a new sequential algorithm for Sampling Importance Resampling. The algorithm serves as a solution to expensive evaluation of impo… (see more)rtance weight, and can be interpreted as stochastically and iteratively refining the particles by correcting them towards the target distribution as pool size increases. We apply this algorithm to variational inference with Importance Weighted Lower Bound and propose a memory-scalable training procedure 1 that implicitly improves the variational proposal. 1 Sequentializing Sampling Importance Resampling 1.1 Sampling Importance Resampling Given an unnormalized target distribution p̃(x) and proposal distribution q(x), the Sampling Importance Resampling (SIR) proceeds as follows: 1. draw xi for 1 ≤ i ≤ n from q(x) 2. calculate the importance weight wi = p̃(xi) q(xi) 3. calculate the normalized importance weight w̄i = wi ∑ i wi 4. draw index variable yj ∼ mul(w̄1, ..., w̄n) for 1 ≤ j ≤ m The density of the set of resampled particles xy1 , ..., xym should resemble the pdf of the target distribution, and the new samples will be approximately distributed by p(x) (Bishop, 2007). On average, the samples can be improved by increasing the pool size n, and becomes corrected when n→∞. The procedure is visualized in Figure 1a. 1.2 SeqSIR The above procedure can be combined with the idea of reservoir sampling, so that we need not evaluate all n samples at the same time, which will be an issue when n is large or when evaluation of a sample (i.e. computation of wi) is expensive. The intuition is to keep a running sum of the importance weights while we evaluate the pool samples sequentially, and then decide to keep the old sample or replace it with the new one based on the ratio of the new sample’s importance weight to the running sum. This is what we call Sequentialized Sampling Importance Resampling (SEQSIR), which is summarized in Algorithm 1. See Figure 1b for illustration. Note that density and importance weight are computed on log scale to deal with numerical instability, and log-sum-exp operation (LSE) is used in place of addition to calculate the running sum of See https://github.com/CW-Huang/SeqIWAE for implementation. Second workshop on Bayesian Deep Learning (NIPS 2017), Long Beach, CA, USA. Algorithm 1 Sequentialized Sampling Importance Resampling and Stochastic Iterative Refinement procedure SEQSIR ( logp, logq . unnormalized target density function and proposal density function ss . n samples to be evaluated ) A←−∞ . initialize accumulated sum of importance weight on log scale s_old← 0 . initialize sample n← len([s1,...,sn]) for i=1,...,n do s_new = ss[i] A, s_old← STOCHREFINE(logp, logq, A, s_old, s_new) return s_old procedure STOCHREFINE ( logp, logq . unnormalized target density function and proposal density function A . accumulated sum of importance weight on log scale s_old, s_new . old and new samples ) w_new← logp(s_new) logq(s_new) A← LSE(A, w_new) u← unif(0,1) if w_new A >= log u then return A, s_new else return A, s_old

2016-12-31

(published)

www.semanticscholar.org

Z-Forcing: Training Stochastic Recurrent Networks

Nan Rosemary Ke

Many efforts have been devoted to training generative latent variable models with autoregressive decoders, such as recurrent neural networks… (see more) (RNN). Stochastic recurrent models have been successful in capturing the variability observed in natural sequential data such as speech. We unify successful ideas from recently proposed architectures into a stochastic recurrent model: each step in the sequence is associated with a latent variable that is used to condition the recurrent dynamics for future steps. Training is performed with amortized variational inference where the approximate posterior is augmented with a RNN that runs backward through the sequence. In addition to maximizing the variational lower bound, we ease training of the latent variables by adding an auxiliary cost which forces them to reconstruct the state of the backward recurrent network. This provides the latent variables with a task-independent objective that enhances the performance of the overall model. We found this strategy to perform better than alternative approaches such as KL annealing. Although being conceptually simple, our model achieves state-of-the-art results on standard speech benchmarks such as TIMIT and Blizzard and competitive performance on sequential MNIST. Finally, we apply our model to language modeling on the IMDB dataset where the auxiliary cost helps in learning interpretable latent variables. Source Code: https://github.com/anirudh9119/zforcing_nips17

2016-12-31

Advances in Neural Information Processing Systems 30 (NIPS 2017) (published)

doi.org

arxiv.org

A Matrix Splitting Perspective on Planning with Options

Pierre-Luc Bacon

Doina Precup

2016-12-02

ArXiv (preprint)

arxiv.org

Smart Classifier Selection for Activity Recognition on Wearable Devices

Negar Ghourchian

Doina Precup

Activity recognition is a key component of human-machine interaction applications. Information obtained from sensors in smart wearable devic… (see more)es is especially valuable, because these devices have become ubiquitous, and they record large amounts of data. Machine learning algorithms can then be used to process this data. However, wearable devices impose restrictions in terms of computation and energy resources, which need to be taken into account by a learning algorithm. We propose to use a real-time learning approach, which interactively determines the most effective set of modalities (or features) for classification, given the task at hand. Our algorithm optimizes sensor selection, in order to consume less power, while still maintaining good accuracy in classifying sequences of activities. Performance on a large, noisy dataset including four different sensing modalities shows that this is a promising approach.

2016-11-22

International Conference on Pattern Recognition Applications and Methods (published)

Deep Learning

Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy… (see more) of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

2016-11-17

MIT Press eBooks (unknown)

openalex.org

Verb Phrase Ellipsis Resolution Using Discriminative and Margin-Infused Algorithms

Kian Kenyon-Dean

Jackie CK Cheung

Doina Precup

Verb Phrase Ellipsis (VPE) is an anaphoric construction in which a verb phrase has been elided. It occurs frequently in dialogue and informa… (see more)l conversational settings, but despite its evident impact on event coreference resolution and extraction, there has been relatively little work on computational methods for identifying and resolving VPE. Here, we present a novel approach to detecting and resolving VPE by using supervised discriminative machine learning techniques trained on features extracted from an automatically parsed, publicly available dataset. Our approach yields state-of-the-art results for VPE detection by improving F1 score by over 11%; additionally, we explore an approach to antecedent identiﬁ-cation that uses the Margin-Infused-Relaxed-Algorithm, which shows promising results.

2016-10-31

Conference on Empirical Methods in Natural Language Processing (published)

doi.org

HeMIS: Hetero-Modal Image Segmentation

Mohammad Havaei

Nicolas Guizard

Nicolas Chapados

Yoshua Bengio

2016-10-01

Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016 (published)

doi.org

arxiv.org

Prediction of Cell Type Specific Transcription Factor Binding Site Occupancy

Faizy Ahsan

Doina Precup

Mathieu Blanchette

2016-10-01

ACM International Conference on Bioinformatics, Computational Biology and Biomedicine (published)

doi.org

Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic fe… (see more)atures for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.

2016-09-07

Interspeech 2016 (published)

doi.org

arxiv.org

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus

Iulian V. Serban

Alberto García-Durán

A. Chandar

Over the past decade, large-scale supervised learning corpora have enabled machine learning researchers to make substantial advances. Howeve… (see more)r, to this date, there are no large-scale question-answer corpora available. In this paper we present the 30M Factoid Question-Answer Corpus, an enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions. The produced question answer pairs are evaluated both by human evaluators and using automatic evaluation metrics, including well-established machine translation and sentence similarity metrics. Across all evaluation criteria the question-generation model outperforms the competing template-based baseline. Furthermore, when presented to human evaluators, the generated questions appear comparable in quality to real human-generated questions.

2016-07-31

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (published)

doi.org

arxiv.org

ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation

Francesco Visin

Adriana Romero

Kyunghyun Cho

Matteo Matteucci

Marco Ciccone

Kyle Kastner

Yoshua Bengio

Aaron Courville

We propose a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and th… (see more)e capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies. The proposed architecture, called ReSeg, is based on the recently introduced ReNet model for image classification. We modify and extend it to perform the more challenging task of semantic segmentation. Each ReNet layer is composed of four RNN that sweep the image horizontally and vertically in both directions, encoding patches or activations, and providing relevant global information. Moreover, ReNet layers are stacked on top of pre-trained convolutional layers, benefiting from generic local features. Upsampling layers follow ReNet layers to recover the original image resolution in the final predictions. The proposed ReSeg architecture is efficient, flexible and suitable for a variety of semantic segmentation tasks. We evaluate ReSeg on several widely-used semantic segmentation datasets: Weizmann Horse, Oxford Flower, and CamVid; achieving state-of-the-art performance. Results show that ReSeg can act as a suitable architecture for semantic segmentation tasks, and may have further applications in other structured prediction problems. The source code and model hyperparameters are available on https://github.com/fvisin/reseg.

2016-06-30

2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (published)

doi.org

arxiv.org

Deconstructing the Ladder Network Architecture

Linxi Fan

The Manual labeling of data is and will remain a costly endeavor. For this reason, semi-supervised learning remains a topic of practical imp… (see more)ortance. The recently proposed Ladder Network is one such approach that has proven to be very successful. In addition to the supervised objective, the Ladder Network also adds an unsupervised objective corresponding to the reconstruction costs of a stack of denoising autoencoders. Although the empirical results are impressive, the Ladder Network has many components intertwined, whose contributions are not obvious in such a complex architecture. In order to help elucidate and disentangle the different ingredients in the Ladder Network recipe, this paper presents an extensive experimental investigation of variants of the Ladder Network in which we replace or remove individual components to gain more insight into their relative importance. We find that all of the components are necessary for achieving optimal performance, but they do not contribute equally. For semi-supervised tasks, we conclude that the most important contribution is made by the lateral connection, followed by the application of noise, and finally the choice of what we refer to as the `combinator function' in the decoder path. We also find that as the number of labeled training examples increases, the lateral connections and reconstruction criterion become less important, with most of the improvement in generalization being due to the injection of noise in each layer. Furthermore, we present a new type of combinator function that outperforms the original design in both fully- and semi-supervised tasks, reducing record test error rates on Permutation-Invariant MNIST to 0.57% for the supervised setting, and to 0.97% and 1.0% for semi-supervised settings with 1000 and 100 labeled examples respectively.

2016-06-10

Proceedings of The 33rd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Publications