Alex Lamb

Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments

Anirudh Goyal

Phanideep Gampa

Philippe Beaudoin

Charles Blundell

Sergey Levine

Michael Mozer

2020-12-31

ICLR (published)

Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems

Anirudh Goyal

Phanideep Gampa

Philippe Beaudoin

Sergey Levine

Charles Blundell

Michael Curtis Mozer

2020-06-28

ArXiv (preprint)

arxiv.org

Discrete-Valued Neural Communication in Structured Architectures Enhances Generalization

Dianbo Liu

Kenji Kawaguchi

Anirudh Goyal

Chen Sun

Michael C. Mozer

Deep learning has advanced from fully connected architectures to structured models organized into components, e.g., the transformer composed… (see more) of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes. In structured models, an interesting question is how to conduct dynamic and possibly sparse communication among the separate components. Here, we explore the hypothesis that restricting the transmitted information among components to discrete representations is a beneficial bottleneck. The motivating intuition is human language in which communication occurs through discrete symbols. Even though individuals have different understandings of what a "cat" is based on their specific experiences, the shared discrete token makes it possible for communication among individuals to be unimpeded by individual differences in internal representation. To discretize the values of concepts dynamically communicated among specialist components, we extend the quantization mechanism from the Vector-Quantized Variational Autoencoder to multi-headed discretization with shared codebooks and use it for discrete-valued neural communication (DVNC). Our experiments show that DVNC substantially improves systematic generalization in a variety of architectures -- transformers, modular architectures, and graph neural networks. We also show that the DVNC is robust to the choice of hyperparameters, making the method very useful in practice. Moreover, we establish a theoretical justification of our discretization process, proving that it has the ability to increase noise robustness and reduce the underlying dimensionality of the model.

2019-12-31

International Conference on Machine Learning (published)

GraphMix: Improved Training of Graph Neural Networks for Semi-Supervised Learning

Juho Kannala

We present GraphMix , a regularized training scheme for Graph Neural Network based semi-supervised object classiﬁcation, leveraging the re… (see more)cent advances in the regularization of classical deep neural networks. Speciﬁcally, we pro-pose a uniﬁed approach in which we train a fully-connected network jointly with the graph neural network via parameter sharing, interpolation-based regularization and self-predicted-targets. Our proposed method is architecture agnostic in the sense that it can be applied to any variant of graph neural networks which applies a parametric transformation to the features of the graph nodes. Despite its simplicity, with GraphMix we can consistently improve results and achieve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets :Cora-Full, Co-author-CS and Co-author-Physics.

2019-12-31

(published)

www.semanticscholar.org

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules

Murray Shanahan

Michael Mozer

Robust perception relies on both bottom-up and top-down signals. Bottom-up signals consist of what's directly observed through sensation. To… (see more)p-down signals consist of beliefs and expectations based on past experience and short-term memory, such as how the phrase `peanut butter and~...' will be completed. The optimal combination of bottom-up and top-down information remains an open question, but the manner of combination must be dynamic and both context and task dependent. To effectively utilize the wealth of potential top-down information available, and to prevent the cacophony of intermixed signals in a bidirectional architecture, mechanisms are needed to restrict information flow. We explore deep recurrent neural net architectures in which bottom-up and top-down signals are dynamically combined using attention. Modularity of the architecture further restricts the sharing and communication of information. Together, attention and modularity direct information flow, which leads to reliable performance improvements in perceptual and language tasks, and in particular improves robustness to distractions and noisy data. We demonstrate on a variety of benchmarks in language modeling, sequential image classification, video prediction and reinforcement learning that the \emph{bidirectional} information flow can improve results over strong baselines.

2019-12-31

ICML (published)

proceedings.mlr.press

Interpolation Consistency Training for Semi-Supervised Learning

Juho Kannala

David Lopez-Paz

Arno Solin

2019-08-09

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (published)

arxiv.org

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

Denis Kazakov

Michael C. Mozer

Machine learning promises methods that generalize well from finite labeled data. However, the brittleness of existing neural net approaches … (see more)is revealed by notable failures, such as the existence of adversarial examples that are misclassified despite being nearly identical to a training example, or the inability of recurrent sequence-processing nets to stay on track without teacher forcing. We introduce a method, which we refer to as \emph{state reification}, that involves modeling the distribution of hidden states over the training data and then projecting hidden states observed during testing toward this distribution. Our intuition is that if the network can remain in a familiar manifold of hidden space, subsequent layers of the net should be well trained to respond appropriately. We show that this state-reification method helps neural nets to generalize better, especially when labeled data are sparse, and also helps overcome the challenge of achieving robust generalization with adversarial training.

2019-05-23

International Conference on Machine Learning (unknown)

proceedings.mlr.press

On Adversarial Mixup Resynthesis

R Devon Hjelm

Christopher Pal

In this paper, we explore new approaches to combining information encoded within the learned representations of auto-encoders. We explore mo… (see more)dels that are capable of combining the attributes of multiple inputs such that a resynthesised output is trained to fool an adversarial discriminator for real versus synthesised data. Furthermore, we explore the use of such an architecture in the context of semi-supervised learning, where we learn a mixing function whose objective is to produce interpolations of hidden states, or masked combinations of latent representations that are consistent with a conditioned class label. We show quantitative and qualitative evidence that such a formulation is an interesting avenue of research.

2018-12-31

NeurIPS (published)

dblp.uni-trier.de

Adversarial Mixup Resynthesizers

R Devon Hjelm

Christopher Pal

In this paper, we explore new approaches to combining information encoded within the learned representations of autoencoders. We explore mod… (see more)els that are capable of combining the attributes of multiple inputs such that a resynthesised output is trained to fool an adversarial discriminator for real versus synthesised data. Furthermore, we explore the use of such an architecture in the context of semi-supervised learning, where we learn a mixing function whose objective is to produce interpolations of hidden states, or masked combinations of latent representations that are consistent with a conditioned class label. We show quantitative and qualitative evidence that such a formulation is an interesting avenue of research.

2018-12-31

DGS@ICLR (published)

Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Accuracy

Vikas Verma

Juho Kannala

Adversarial robustness has become a central goal in deep learning, both in theory and practice. However, successful methods to improve adver… (see more)sarial robustness (such as adversarial training) greatly hurt generalization performance on the clean data. This could have a major impact on how adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve performance on the clean data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases clean test error from 5.8% to 16.7%, whereas with our Interpolated adversarial training we retain adversarial robustness while achieving a clean test error of only 6.5%. With our technique, the relative error increase for the robust model is reduced from 187.9% to just 12.1%.

2018-12-31

arXiv.org (preprint)

dblp.uni-trier.de

Manifold Mixup: Better Representations by Interpolating Hidden States

Amir Najafi

David Lopez-Paz

Deep neural networks excel at learning the training data, but often provide incorrect and confident predictions when evaluated on slightly d… (see more)ifferent test examples. This includes distribution shifts, outliers, and adversarial examples. To address these issues, we propose Manifold Mixup, a simple regularizer that encourages neural networks to predict less confidently on interpolations of hidden representations. Manifold Mixup leverages semantic interpolations as additional training signal, obtaining neural networks with smoother decision boundaries at multiple levels of representation. As a result, neural networks trained with Manifold Mixup learn class-representations with fewer directions of variance. We prove theory on why this flattening happens under ideal conditions, validate it on practical situations, and connect it to previous works on information theory and generalization. In spite of incurring no significant computation and being implemented in a few lines of code, Manifold Mixup improves strong baselines in supervised learning, robustness to single-step adversarial attacks, and test log-likelihood.

2018-06-12

arXiv (preprint)