Publications

Unsupervised State Representation Learning in Atari

Evan Racah

R Devon Hjelm

State representation learning, or the ability to capture latent generative factors of an environment, is crucial for building intelligent ag… (see more)ents that can perform a wide variety of tasks. Learning such representations without supervision from rewards is a challenging open problem. We introduce a method that learns state representations by maximizing mutual information across spatially and temporally distinct features of a neural encoder of the observations. We also introduce a new benchmark based on Atari 2600 games where we evaluate representations based on how well they capture the ground truth state variables. We believe this new framework for evaluating representation learning models will be crucial for future representation learning research. Finally, we compare our technique with other state-of-the-art generative and contrastive representation learning methods. The code associated with this work is available at this https URL

2018-12-31

Neural Information Processing Systems (published)

Updates of Equilibrium Prop Match Gradients of Backprop Through Time in an RNN with Static Input

Maxence Ernoult

Julie Grollier

Damien Querlioz

Benjamin Scellier

Equilibrium Propagation (EP) is a biologically inspired learning algorithm for convergent recurrent neural networks, i.e. RNNs that are fed … (see more)by a static input x and settle to a steady state. Training convergent RNNs consists in adjusting the weights until the steady state of output neurons coincides with a target y. Convergent RNNs can also be trained with the more conventional Backpropagation Through Time (BPTT) algorithm. In its original formulation EP was described in the case of real-time neuronal dynamics, which is computationally costly. In this work, we introduce a discrete-time version of EP with simplified equations and with reduced simulation time, bringing EP closer to practical machine learning tasks. We first prove theoretically, as well as numerically that the neural and weight updates of EP, computed by forward-time dynamics, are step-by-step equal to the ones obtained by BPTT, with gradients computed backward in time. The equality is strict when the transition function of the dynamics derives from a primitive function and the steady state is maintained long enough. We then show for more standard discrete-time neural network dynamics that the same property is approximately respected and we subsequently demonstrate training with EP with equivalent performance to BPTT. In particular, we define the first convolutional architecture trained with EP achieving ~ 1% test error on MNIST, which is the lowest error reported with EP. These results can guide the development of deep neural networks trained with EP.

2018-12-31

Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (published)

Variational Temporal Abstraction

Taesup Kim

Sungjin Ahn

We introduce a variational approach to learning and inference of temporally hierarchical structure and representation for sequential data. W… (see more)e propose the Variational Temporal Abstraction (VTA), a hierarchical recurrent state space model that can infer the latent temporal structure and thus perform the stochastic state transition hierarchically. We also propose to apply this model to implement the jumpy imagination ability in imagination-augmented agent-learning in order to improve the efficiency of the imagination. In experiments, we demonstrate that our proposed method can model 2D and 3D visual sequence datasets with interpretable temporal structure discovery and that its application to jumpy imagination enables more efficient agent-learning in a 3D navigation task.

2018-12-31

Neural Information Processing Systems (published)

VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering

Cătălina Cangea

Eugene Belilovsky

Pietro Lio

Aaron Courville

Embodied Question Answering (EQA) is a recently proposed task, where an agent is placed in a rich 3D environment and must act based solely o… (see more)n its egocentric input to answer a given question. The desired outcome is that the agent learns to combine capabilities such as scene understanding, navigation and language understanding in order to perform complex reasoning in the visual world. However, initial advancements combining standard vision and language methods with imitation and reinforcement learning algorithms have shown EQA might be too complex and challenging for these techniques. In order to investigate the feasibility of EQA-type tasks, we build the VideoNavQA dataset that contains pairs of questions and videos generated in the House3D environment. The goal of this dataset is to assess question-answering performance from nearly-ideal navigation paths, while considering a much more complete variety of questions than current instantiations of the EQA task. We investigate several models, adapted from popular VQA methods, on this new benchmark. This establishes an initial understanding of how well VQA-style methods can perform within this novel EQA paradigm.

2018-12-31

ViGIL@NeurIPS (published)

Wasserstein Dependency Measure for Representation Learning

Sherjil Ozair

Corey Lynch

Aäron van den Oord

Sergey Levine

Pierre Sermanet

Mutual information maximization has emerged as a powerful learning objective for unsupervised representation learning obtaining state-of-the… (see more)-art performance in applications such as object recognition, speech recognition, and reinforcement learning. However, such approaches are fundamentally limited since a tight lower bound of mutual information requires sample size exponential in the mutual information. This limits the applicability of these approaches for prediction tasks with high mutual information, such as in video understanding or reinforcement learning. In these settings, such techniques are prone to overfit, both in theory and in practice, and capture only a few of the relevant factors of variation. This leads to incomplete representations that are not optimal for downstream tasks. In this work, we empirically demonstrate that mutual information-based representation learning approaches do fail to learn complete representations on a number of designed and real-world tasks. To mitigate these problems we introduce the Wasserstein dependency measure, which learns more complete representations by using the Wasserstein distance instead of the KL divergence in the mutual information estimator. We show that a practical approximation to this theoretically motivated solution, constructed using Lipschitz constraint techniques from the GAN literature, achieves substantially improved results on tasks where incomplete representations are a major challenge.

2018-12-31

Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (published)

»Deep Learning ist keine Religion«

Andreas Sudmann

2018-12-30

Machine-mediated learning (published)

Building a Neural Semantic Parser from a Domain Ontology

Jianpeng Cheng

Siva Reddy

Mirella Lapata

Semantic parsing is the task of converting natural language utterances into machine interpretable meaning representations which can be execu… (see more)ted against a real-world environment such as a database. Scaling semantic parsing to arbitrary domains faces two interrelated challenges: obtaining broad coverage training data effectively and cheaply; and developing a model that generalizes to compositional utterances and complex intentions. We address these challenges with a framework which allows to elicit training data from a domain ontology and bootstrap a neural parser which recursively builds derivations of logical forms. In our framework meaning representations are described by sequences of natural language templates, where each template corresponds to a decomposed fragment of the underlying meaning representation. Although artificial, templates can be understood and paraphrased by humans to create natural utterances, resulting in parallel triples of utterances, meaning representations, and their decompositions. These allow us to train a neural semantic parser which learns to compose rules in deriving meaning representations. We crowdsource training data on six domains, covering both single-turn utterances which exhibit rich compositionality, and sequential utterances where a complex task is procedurally performed in steps. We then develop neural semantic parsers which perform such compositional tasks. In general, our approach allows to deploy neural semantic parsers quickly and cheaply from a given domain ontology.

2018-12-24

ArXiv (preprint)

Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

Ghouthi Boukli hacene

Vincent Gripon

Matthieu Arzel

Nicolas Farrugia

Convolutional Neural Networks (CNNs) are state-of-the-art in numerous computer vision tasks such as object classification and detection. How… (see more)ever, the large amount of parameters they contain leads to a high computational complexity and strongly limits their usability in budget-constrained devices such as embedded devices. In this paper, we propose a combination of a new pruning technique and a quantization scheme that effectively reduce the complexity and memory usage of convolutional layers of CNNs, and replace the complex convolutional operation by a low-cost multiplexer. We perform experiments on the CIFAR10, CIFAR100 and SVHN and show that the proposed method achieves almost state-of-the-art accuracy, while drastically reducing the computational and memory footprints. We also propose an efficient hardware architecture to accelerate CNN operations. The proposed hardware architecture is a pipeline and accommodates multiple layers working at the same time to speed up the inference process.

2018-12-24

ArXiv (preprint)

Clustering-Oriented Representation Learning with Attractive-Repulsive Loss

Lucas Caccia

Jackie CK Cheung

The standard loss function used to train neural network classifiers, categorical cross-entropy (CCE), seeks to maximize accuracy on the trai… (see more)ning data; building useful representations is not a necessary byproduct of this objective. In this work, we propose clustering-oriented representation learning (COREL) as an alternative to CCE in the context of a generalized attractive-repulsive loss framework. COREL has the consequence of building latent representations that collectively exhibit the quality of natural clustering within the latent space of the final hidden layer, according to a predefined similarity function. Despite being simple to implement, COREL variants outperform or perform equivalently to CCE in a variety of scenarios, including image and news article classification using both feed-forward and convolutional neural networks. Analysis of the latent spaces created with different similarity functions facilitates insights on the different use cases COREL variants can satisfy, where the Cosine-COREL variant makes a consistently clusterable latent space, while Gaussian-COREL consistently obtains better classification accuracy than CCE.

2018-12-17

ArXiv (preprint)

Speaker Recognition from Raw Waveform with SincNet

Mirco Ravanelli

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been … (see more)recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

2018-12-17

2018 IEEE Spoken Language Technology Workshop (SLT) (published)

Object Detection using Deep Learning

Chamarty Anusha

P. Avadhani

P. S.

Mohannad Elhamod

Martin D. Levine

Ajeet Ram Pathak

Manjusha Pandey

Siddharth S. Rautaray

Christian Szegedy

Alexander T Toshev

Dumitru Erhan

Xiao Ning

Wen Zhu

Shifeng Chen

Zhong-Qiu Zhao

Peng Zheng

Shou-tao Xu

Xindong Wu

Sakshi Indolia

Anil Kumar Goswani … (see 12 more)

S. P. Mishra

Pooja Asopa

Yann Lecun

Joseph Redmon

Santosh Kumar Divvala

Ross Girshick

Ali Farhadi

M. Kruithof

Henri Bouma

Noelle M. Fischer

Klamer Schutte

Autonomous vehicles, surveillance systems, face detection systems lead to the development of accurate object detection system [1]. These sys… (see more)tems recognize, classify and localize every object in an image by drawing bounding boxes around the object [2]. These systems use existing classification models as backbone for Object Detection purpose. Object detection is the process of finding instances of real-world objects such as human faces, animals and vehicles etc., in pictures, images or in videos. An Object detection algorithm uses extracted features and learning techniques to recognize the objects in an image. In this paper, various Object Detection techniques have been studied and some of them are implemented. As a part of this paper, three algorithms for object detection in an image were implemented and their results were compared. The algorithms are “Object Detection using Deep Learning Framework by OpenCV”, “Object Detection using Tensorflow” and “Object Detection using Keras models”.

2018-12-16

International Journal of Computer Applications (published)