Publications

Community size effect in artificial learning systems

Olivier Tieleman

Angeliki Lazaridou

Shibl Mourad

Charles Blundell

Motivated by theories of language and communication that explain why communities with large numbers of speakers have, on average, simpler la… (voir plus)nguages with more regularity, we cast the representation learning problem in terms of learning to communicate . Our starting point sees the traditional autoencoder setup as a single encoder with a ﬁxed decoder partner that must learn to communicate. Generalizing from there, we introduce community -based autoencoders in which multiple encoders and decoders collectively learn representations by being randomly paired up on successive training iterations. We ﬁnd that increasing community sizes reduce idiosyncrasies in the learned codes, resulting in representations that better encode concept categories and correlate with human feature norms.

2018-12-31

ViGIL@NeurIPS (publié)

dblp.uni-trier.de

Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning ( Supplementary Material ) A Proofs

More precisely, the WFA A = (α, {A}σ∈Σ,Ω) with n states and the linear 2-RNN M = (α,A,Ω) with n hidden units, where A ∈ Rn×Σ×n … (voir plus)is defined by A:,σ,: = A for all σ ∈ Σ, are such that fA(σ1σ2 · · ·σk) = fM (x1,x2, · · · ,xk) for all sequences of input symbols σ1, · · · , σk ∈ Σ, where for each i ∈ [k] the input vector xi ∈ RΣ is the one-hot encoding of the symbol σi. Proof. We first show by induction on k that, for any sequence σ1 · · ·σk ∈ Σ∗, the hidden state hk computed by M (see Eq. (1)) on the corresponding one-hot encoded sequence x1, · · · ,xk ∈ R satisfies hk = (A1 · · ·Ak )>α. The case k = 0 is immediate. Suppose the result true for sequences of length up to k. One can check easily check that A •2 xi = Ai for any index i. Using the induction hypothesis it then follows that hk+1 = A •1 hk •2 xk+1 = Ak+1 •1 hk = (Ak+1)hk = (Aσk+1)>(Aσ1 · · ·Ak )>α = (A1 · · ·Aσk+1)>α.

2018-12-31

(publié)

www.semanticscholar.org

Data-driven Chance Constrained Programming based Electric Vehicle Penetration Analysis

Di Wu

Tracy Can Cui

Benoit Boulet

Transportation electrification has been growing rapidly in recent years. The adoption of electric vehicles (EVs) could help to release the d… (voir plus)ependency on oil and reduce greenhouse gas emission. However, the increasing EV adoption will also impose a high demand on the power grid and may jeopardize the grid network infrastructures. For certain high EV penetration areas, the EV charging demand may lead to transformer overloading at peak hours which makes the maximal EV penetration analysis an urgent problem to solve. This paper proposes a data-driven chance constrained programming based framework for maximal EV penetration analysis. Simulation results are presented for a real-world neighborhood level network. The proposed framework could serve as a guidance for utility companies to schedule infrastructure upgrades.

2018-12-31

(publié)

www.semanticscholar.org

An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation

Vincent Michalski

Vikram Voleti

Samira Ebrahimi Kahou

Anthony Ortiz

Pascal Vincent

Chris Pal

Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act … (voir plus)as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has been shown to yield similar, in some domains even superior performance to batch normalization. All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional batch normalization. We evaluate performances on the tasks of visual question answering, few-shot learning, and conditional image generation.

2018-12-31

arXiv (prépublication)

An Empirical Study of Example Forgetting During Deep Neural Network Learning

Mariya Toneva

Alessandro Sordoni

Remi Tachet des Combes

Adam Trischler

Yoshua Bengio

Geoffrey J. Gordon

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single class… (voir plus)ification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a `forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

2018-12-31

ICLR.cc/2019/Conference (poster)

openreview.net

A Geometric Perspective on Optimal Representations for Reinforcement Learning

Bellemare Marc-Emmanuel

Will Dabney

Robert Dadashi

Adrien Ali Taiga

Pablo Samuel Castro

Nicolas Roux

Dale Schuurmans

Tor Lattimore

Clare Lyle

2018-12-31

NeurIPS (publié)

openreview.net

Gradient based sample selection for online continual learning

A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overc… (voir plus)ome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous works often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameters gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.

2018-12-31

Neural Information Processing Systems (publié)

h-detach: Modifying the LSTM Gradient Towards Better Optimization

Nan Rosemary Ke

Recurrent neural networks are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident i… (voir plus)n tasks where the information needed to correctly solve them exist over long time scales, because EVGP prevents important gradient components from being back-propagated adequately over a large number of steps. We introduce a simple stochastic algorithm (\textit{h}-detach) that is specific to LSTM optimization and targeted towards addressing this problem. Specifically, we show that when the LSTM weights are large, the gradient components through the linear path (cell state) in the LSTM computational graph get suppressed. Based on the hypothesis that these components carry information about long term dependencies (which we show empirically), their suppression can prevent LSTMs from capturing them. Our algorithm\footnote{Our code is available at this https URL.} prevents gradients flowing through this path from getting suppressed, thus allowing the LSTM to capture such dependencies better. We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets.

2018-12-31

ICLR.cc/2019/Conference (poster)

openreview.net

Hindsight Credit Assignment

Anna Harutyunyan

Will Dabney

Thomas Mesnard

Mohammad Gheshlaghi Azar

Bilal Piot

Nicolas Heess

Hado van Hasselt

Greg Wayne

Satinder Singh

Remi Munos

2018-12-31

Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (publié)

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Devansh Arpit

Vı́ctor Campos

Yoshua Bengio

Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initial… (voir plus)ization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.

2018-12-31

NeurIPS (publié)