Publications

Break the Ceiling: Stronger Multi-scale Deep Graph Convolutional Networks
Mingde Zhao
Xiao-Wen Chang
Recently, neural network based approaches have achieved significant improvement for solving large, complex, graph-structured problems. Howev… (voir plus)er, their bottlenecks still need to be addressed, and the advantages of multi-scale information and deep architectures have not been sufficiently exploited. In this paper, we theoretically analyze how existing Graph Convolutional Networks (GCNs) have limited expressive power due to the constraint of the activation functions and their architectures. We generalize spectral graph convolution and deep GCN in block Krylov subspace forms and devise two architectures, both with the potential to be scaled deeper but each making use of the multi-scale information in different ways. We further show that the equivalence of these two architectures can be established under certain conditions. On several node classification tasks, with or without the help of validation, the two new architectures achieve better performance compared to many state-of-the-art methods.
Community size effect in artificial learning systems
Olivier Tieleman
Angeliki Lazaridou
Shibl Mourad
Charles Blundell
Motivated by theories of language and communication that explain why communities with large numbers of speakers have, on average, simpler la… (voir plus)nguages with more regularity, we cast the representation learning problem in terms of learning to communicate . Our starting point sees the traditional autoencoder setup as a single encoder with a fixed decoder partner that must learn to communicate. Generalizing from there, we introduce community -based autoencoders in which multiple encoders and decoders collectively learn representations by being randomly paired up on successive training iterations. We find that increasing community sizes reduce idiosyncrasies in the learned codes, resulting in representations that better encode concept categories and correlate with human feature norms.
Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning ( Supplementary Material ) A Proofs
More precisely, the WFA A = (α, {A}σ∈Σ,Ω) with n states and the linear 2-RNN M = (α,A,Ω) with n hidden units, where A ∈ Rn×Σ×n … (voir plus)is defined by A:,σ,: = A for all σ ∈ Σ, are such that fA(σ1σ2 · · ·σk) = fM (x1,x2, · · · ,xk) for all sequences of input symbols σ1, · · · , σk ∈ Σ, where for each i ∈ [k] the input vector xi ∈ RΣ is the one-hot encoding of the symbol σi. Proof. We first show by induction on k that, for any sequence σ1 · · ·σk ∈ Σ∗, the hidden state hk computed by M (see Eq. (1)) on the corresponding one-hot encoded sequence x1, · · · ,xk ∈ R satisfies hk = (A1 · · ·Ak )>α. The case k = 0 is immediate. Suppose the result true for sequences of length up to k. One can check easily check that A •2 xi = Ai for any index i. Using the induction hypothesis it then follows that hk+1 = A •1 hk •2 xk+1 = Ak+1 •1 hk = (Ak+1)hk = (Aσk+1)>(Aσ1 · · ·Ak )>α = (A1 · · ·Aσk+1)>α.
Data-driven Chance Constrained Programming based Electric Vehicle Penetration Analysis
Tracy Can Cui
Benoit Boulet
Transportation electrification has been growing rapidly in recent years. The adoption of electric vehicles (EVs) could help to release the d… (voir plus)ependency on oil and reduce greenhouse gas emission. However, the increasing EV adoption will also impose a high demand on the power grid and may jeopardize the grid network infrastructures. For certain high EV penetration areas, the EV charging demand may lead to transformer overloading at peak hours which makes the maximal EV penetration analysis an urgent problem to solve. This paper proposes a data-driven chance constrained programming based framework for maximal EV penetration analysis. Simulation results are presented for a real-world neighborhood level network. The proposed framework could serve as a guidance for utility companies to schedule infrastructure upgrades.
An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation
Batch normalization has been widely used to improve optimization in deep neural networks. While the uncertainty in batch statistics can act … (voir plus)as a regularizer, using these dataset statistics specific to the training set impairs generalization in certain tasks. Recently, alternative methods for normalizing feature activations in neural networks have been proposed. Among them, group normalization has been shown to yield similar, in some domains even superior performance to batch normalization. All these methods utilize a learned affine transformation after the normalization operation to increase representational power. Methods used in conditional computation define the parameters of these transformations as learnable functions of conditioning information. In this work, we study whether and where the conditional formulation of group normalization can improve generalization compared to conditional batch normalization. We evaluate performances on the tasks of visual question answering, few-shot learning, and conditional image generation.
An Empirical Study of Example Forgetting During Deep Neural Network Learning
Mariya Toneva
Remi Tachet des Combes
Adam Trischler
Geoffrey J. Gordon
Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single class… (voir plus)ification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a `forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.
A Geometric Perspective on Optimal Representations for Reinforcement Learning
Bellemare Marc-Emmanuel
Will Dabney
Robert Dadashi
Nicolas Roux
Dale Schuurmans
Tor Lattimore
Clare Lyle
Gradient based sample selection for online continual learning
A continual learning agent learns online with a non-stationary and never-ending stream of data. The key to such learning process is to overc… (voir plus)ome the catastrophic forgetting of previously seen data, which is a well known problem of neural networks. To prevent forgetting, a replay buffer is usually employed to store the previous data for the purpose of rehearsal. Previous works often depend on task boundary and i.i.d. assumptions to properly select samples for the replay buffer. In this work, we formulate sample selection as a constraint reduction problem based on the constrained optimization view of continual learning. The goal is to select a fixed subset of constraints that best approximate the feasible region defined by the original constraints. We show that it is equivalent to maximizing the diversity of samples in the replay buffer with parameters gradient as the feature. We further develop a greedy alternative that is cheap and efficient. The advantage of the proposed method is demonstrated by comparing to other alternatives under the continual learning setting. Further comparisons are made against state of the art methods that rely on task boundaries which show comparable or even better results for our method.
h-detach: Modifying the LSTM Gradient Towards Better Optimization
Recurrent neural networks are known for their notorious exploding and vanishing gradient problem (EVGP). This problem becomes more evident i… (voir plus)n tasks where the information needed to correctly solve them exist over long time scales, because EVGP prevents important gradient components from being back-propagated adequately over a large number of steps. We introduce a simple stochastic algorithm (\textit{h}-detach) that is specific to LSTM optimization and targeted towards addressing this problem. Specifically, we show that when the LSTM weights are large, the gradient components through the linear path (cell state) in the LSTM computational graph get suppressed. Based on the hypothesis that these components carry information about long term dependencies (which we show empirically), their suppression can prevent LSTMs from capturing them. Our algorithm\footnote{Our code is available at this https URL.} prevents gradients flowing through this path from getting suppressed, thus allowing the LSTM to capture such dependencies better. We show significant improvements over vanilla LSTM gradient based training in terms of convergence speed, robustness to seed and learning rate, and generalization using our modification of LSTM gradient on various benchmark datasets.
Hindsight Credit Assignment
Anna Harutyunyan
Will Dabney
Mohammad Gheshlaghi Azar
Bilal Piot
Nicolas Heess
Hado van Hasselt
Greg Wayne
Satinder Singh
Remi Munos
How to Initialize your Network? Robust Initialization for WeightNorm & ResNets
Vı́ctor Campos
Residual networks (ResNet) and weight normalization play an important role in various deep learning applications. However, parameter initial… (voir plus)ization strategies have not been studied previously for weight normalized networks and, in practice, initialization methods designed for un-normalized networks are used as a proxy. Similarly, initialization for ResNets have also been studied for un-normalized networks and often under simplified settings ignoring the shortcut connection. To address these issues, we propose a novel parameter initialization strategy that avoids explosion/vanishment of information across layers for weight normalized networks with and without residual connections. The proposed strategy is based on a theoretical analysis using mean field approximation. We run over 2,500 experiments and evaluate our proposal on image datasets showing that the proposed initialization outperforms existing initialization methods in terms of generalization performance, robustness to hyper-parameter values and variance between seeds, especially when networks get deeper in which case existing methods fail to even start training. Finally, we show that using our initialization in conjunction with learning rate warmup is able to reduce the gap between the performance of weight normalized and batch normalized networks.
Improving advance medical directives: lessons from Quebec
Louise G. Bernier
Policy-makers’ efforts to increase the uptake of advance medical directives (AMDs), and the legal constraints they impose on health profes… (voir plus)sionals, are bringing greater scrutiny to provincial AMD regimes. In 2015, Quebec introduced a new, legally binding form to be filled out for AMDs, which limits individuals’ expression of their wishes to narrow, checklist responses to questions on specific medical interventions. This form-focused regime has other shortcomings: it relies on individuals to self-inform and it does not provide them the opportunity to meaningfully convey their preferences for end-of-life care. A more values-based and collaborative approach provides a better path forward for Quebec and for other provinces.