Hugo Larochelle

Publications

Blindfold Baselines for Embodied QA

We explore blindfold (question-only) baselines for Embodied Question Answering. The EmbodiedQA task requires an agent to answer a question b… (see more)y intelligently navigating in a simulated environment, gathering necessary visual information only through first-person vision before finally answering. Consequently, a blindfold baseline which ignores the environment and visual information is a degenerate solution, yet we show through our experiments on the EQAv1 dataset that a simple question-only baseline achieves state-of-the-art results on the EmbodiedQA task in all cases except when the agent is spawned extremely close to the object.

2018-11-12

ArXiv (preprint)

HoME: a Household Multimodal Environment

Simon Brodeur

Luca Celotti

Jean Rouat

We introduce HoME: a Household Multimodal Environment for artificial agents to learn from vision, audio, semantics, physics, and interaction… (see more) with objects and other agents, all within a realistic context. HoME integrates over 45,000 diverse 3D house layouts based on the SUNCG dataset, a scale which may facilitate learning, generalization, and transfer. HoME is an open-source, OpenAI Gym-compatible platform extensible to tasks in reinforcement learning, language grounding, sound-based navigation, robotics, multi-agent learning, and more. We hope HoME better enables artificial agents to learn as humans do: in an interactive, multimodal, and richly contextualized setting.

2018-01-01

ICLR (Workshop) (published)

openreview.net

GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

Olivier Pietquin

We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The… (see more) goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pairs on 66K images. We explain our design decisions in collecting the dataset and introduce the oracle and questioner tasks that are associated with the two players of the game. We prototyped deep learning models to establish initial baselines of the introduced tasks.

2017-07-21

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (published)

Modulating early visual processing by language

Jérémie Mary

Olivier Pietquin

It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view do… (see more)minates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (\MRN), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial.

2017-07-02

ArXiv (preprint)

Movie Description

Anna Rohrbach

Marcus Rohrbach

Niket Tandon

Bernt Schiele

2017-01-25

International Journal of Computer Vision (published)

Brain tumor segmentation with Deep Neural Networks

Pierre-Marc Jodoin

2017-01-01

Medical Image Analysis (published)

Modulating early visual processing by language

Jérémie Mary

Olivier Pietquin

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

J'anos Kram'ar

Nan Rosemary Ke

We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain thei… (see more)r previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.

2016-06-03

ArXiv (preprint)

Movie Description

Anna Rohrbach

Marcus Rohrbach

Niket Tandon

Bernt Schiele

2016-05-12

ArXiv (preprint)

Movie Description

Anna Rohrbach

Marcus Rohrbach

Niket Tandon

Bernt Schiele

2016-05-12

ArXiv (preprint)

Movie Description

Anna Rohrbach

Marcus Rohrbach

Niket Tandon

Bernt Schiele

Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their pee… (see more)rs. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total). The goal of the challenge is to automatically generate descriptions for the movie clips. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in the challenges organized in the context of two workshops at ICCV 2015 and ECCV 2016.

2016-05-12

ArXiv (preprint)

Movie Description

Anna Rohrbach

Marcus Rohrbach

Niket Tandon