Aaron Courville

Colorectal cancer (CRC) is the third cause of cancer death worldwide. Currently, the standard approach to reduce CRC-related mortality is to… (voir plus) perform regular screening in search for polyps and colonoscopy is the screening tool of choice. The main limitations of this screening procedure are polyp miss-rate and inability to perform visual assessment of polyp malignancy. These drawbacks can be reduced by designing Decision Support Systems (DSS) aiming to help clinicians in the different stages of the procedure by providing endoluminal scene segmentation. Thus, in this paper, we introduce an extended benchmark of colonoscopy image, with the hope of establishing a new strong benchmark for colonoscopy image analysis research. We provide new baselines on this dataset by training standard fully convolutional networks (FCN) for semantic segmentation and significantly outperforming, without any further post-processing, prior results in endoluminal scene segmentation.

2017-07-25

Journal of Healthcare Engineering (publié)

Self-organized Hierarchical Softmax

Yikang Shen

Shawn Tan

Christopher Pal

We propose a new self-organizing hierarchical softmax formulation for neural-network-based language models over large vocabularies. Instead … (voir plus)of using a predefined hierarchical structure, our approach is capable of learning word clusters with clear syntactical and semantic meaning during the language model training process. We provide experiments on standard benchmarks for language modeling and sentence compression tasks. We find that this approach is as fast as other efficient softmax approximations, while achieving comparable or even better performance relative to similar full softmax models.

2017-07-25

ArXiv (prépublication)

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

Tegan Maharaj

Nicolas Ballas

Anna Rohrbach

Christopher Pal

While deep convolutional neural networks frequently approach or exceed human-level performance at benchmark tasks involving static images, e… (voir plus)xtending this success to moving images is not straightforward. Having models which can learn to understand video is of interest for many applications, including content recommendation, prediction, summarization, event/object detection and understanding human visual perception, but many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features; the effects of increasing dataset size, the number of frames sampled; and of vocabulary size. We illustrate that: this task is not solvable by a language model alone; our model combining 2D and 3D visual information indeed provides the best result; all models perform significantly worse than human-level. We provide human evaluations for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgement. We suggest avenues for improving video models, and hope that the proposed dataset can be useful for measuring and encouraging progress in this very interesting field.

2017-07-20

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

Harm de Vries

Florian Strub

A. Chandar

Olivier Pietquin

Hugo Larochelle

We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The… (voir plus) goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

2017-07-20

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

Multi-Modal Variational Encoder-Decoders

Iulian V. Serban

Alexander G. Ororbia II

Joelle Pineau

2017-04-23

arXiv.org (prépublication)

Char2Wav: End-to-End Speech Synthesis

Jose Sotelo

2017-02-16

International Conference on Learning Representations (inconnu)

Deep Nets Don't Learn Via Memorization

David Krueger

Nicolas Ballas

Stanislaw Jastrzebski

Maxinder S. Kanwal

2017-02-16

International Conference on Learning Representations (inconnu)

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues

Iulian V. Serban

Sequential data often possesses a hierarchical structure with complex dependencies between subsequences, such as found between the utterance… (voir plus)s in a dialogue. In an effort to model this kind of generative process, we propose a neural network-based generative architecture, with latent stochastic variables that span a variable number of time steps. We apply the proposed model to the task of dialogue response generation and compare it with recent neural network architectures. We evaluate the model performance through automatic evaluation metrics and by carrying out a human evaluation. The experiments demonstrate that our model improves upon recently proposed models and that the latent variables facilitate the generation of long outputs and maintain the context.

2017-02-11

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation

Iulian V. Serban

Tim Klinger

Gerald Tesauro

Kartik Talamadupula

Bowen Zhou

Yoshua Bengio

We introduce a new class of models called multiresolution recurrent neural networks, which explicitly model natural language generation at m… (voir plus)ultiple levels of abstraction. The models extend the sequence-to-sequence framework to generate two parallel stochastic processes: a sequence of high-level coarse tokens, and a sequence of natural language words (e.g. sentences). The coarse sequences follow a latent stochastic process with a factorial representation, which helps the models generalize to new examples. The coarse sequences can also incorporate task-specific knowledge, when available. In our experiments, the coarse sequences are extracted using automatic procedures, which are designed to capture compositional structure and semantics. These procedures enable training the multiresolution recurrent neural networks by maximizing the exact joint log-likelihood over both sequences. We apply the models to dialogue response generation in the technical support domain and compare them with several competing models. The multiresolution recurrent neural networks outperform competing models by a substantial margin, achieving state-of-the-art results according to both a human evaluation study and automatic evaluation metrics. Furthermore, experiments show the proposed models generate more fluent, relevant and goal-oriented responses.

2017-02-11

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

Multiresolution Recurrent Neural Networks: An Application to Dialogue\n Response Generation

Iulian Vlad Serban

Tim Klinger

Gerald Tesauro

Kartik Talamadupula

Bowen Zhou

Yoshua Bengio

We introduce the multiresolution recurrent neural network, which extends the\nsequence-to-sequence framework to model natural language gener… (voir plus)ation as two\nparallel discrete stochastic processes: a sequence of high-level coarse tokens,\nand a sequence of natural language tokens. There are many ways to estimate or\nlearn the high-level coarse tokens, but we argue that a simple extraction\nprocedure is sufficient to capture a wealth of high-level discourse semantics.\nSuch procedure allows training the multiresolution recurrent neural network by\nmaximizing the exact joint log-likelihood over both sequences. In contrast to\nthe standard log- likelihood objective w.r.t. natural language tokens (word\nperplexity), optimizing the joint log-likelihood biases the model towards\nmodeling high-level abstractions. We apply the proposed model to the task of\ndialogue response generation in two challenging domains: the Ubuntu technical\nsupport domain, and Twitter conversations. On Ubuntu, the model outperforms\ncompeting approaches by a substantial margin, achieving state-of-the-art\nresults according to both automatic evaluation metrics and a human evaluation\nstudy. On Twitter, the model appears to generate more relevant and on-topic\nresponses according to automatic evaluation metrics. Finally, our experiments\ndemonstrate that the proposed model is more adept at overcoming the sparsity of\nnatural language and is better able to capture long-term structure.\n

2017-02-11

AAAI Conference on Artificial Intelligence (publié)

Adversarially Learned Inference

Vincent Dumoulin

Ishmael Belghazi

Ben Poole

We introduce the adversarially learned inference (ALI) model, which jointly learns a generation network and an inference network using an ad… (voir plus)versarial process. The generation network maps samples from stochastic latent variables to the data space while the inference network maps training examples in data space to the space of latent variables. An adversarial game is cast between these two networks and a discriminative network is trained to distinguish between joint latent/data-space samples from the generative network and joint samples from the inference network. We illustrate the ability of the model to learn mutually coherent inference and generation networks through the inspections of model samples and reconstructions and confirm the usefulness of the learned representations by obtaining a performance competitive with state-of-the-art on the semi-supervised SVHN and CIFAR10 tasks.

2017-02-05

International Conference on Learning Representations (poster)

Calibrating Energy-based Generative Adversarial Networks

Zihang Dai

Amjad Almahairi

Philip Bachman

Eduard Hovy

In this paper, we propose to equip Generative Adversarial Networks with the ability to produce direct energy estimates for samples. Specific… (voir plus)ally, we propose a flexible adversarial training framework, and prove this framework not only ensures the generator converges to the true data distribution, but also enables the discriminator to retain the density information at the global optimal. We derive the analytic form of the induced solution, and analyze the properties. In order to make the proposed framework trainable in practice, we introduce two effective approximation techniques. Empirically, the experiment results closely match our theoretical analysis, verifying the discriminator is able to recover the energy of data distribution.

2017-02-05

ICLR.cc/2017/conference (poster)