Anirudh Goyal

S2RMs: Spatially Structured Recurrent Modules

Nasim Rahaman

Muhammad Waleed Gondal

Manuel Wüthrich

Stefan Bauer

Y. Sharma

Bernhard Schölkopf

2020-07-13

ArXiv (prépublication)

Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems

Alex Lamb

Phanideep Gampa

Philippe Beaudoin

Sergey Levine

Charles Blundell

Michael Curtis Mozer

2020-06-29

ArXiv (prépublication)

Untangling tradeoffs between recurrence and self-attention in neural networks

Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential ta… (voir plus)sks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

2020-06-16

ArXiv (prépublication)

Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules

Murray P. Shanahan

Michael Curtis Mozer

2020-01-01

ICML (publié)

proceedings.mlr.press

Learning Long-term Dependencies Using Cognitive Inductive Biases in Self-attention RNNs

Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential ta… (voir plus)sks. However, most recent progress hinges on heuristic approaches that rely on considerable memory and computational resources that scale poorly. In this work, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. We use simple numerical experiments to demonstrate that this mechanism helps enable recurrent systems on generalization and transfer learning tasks. Based on our results, we propose a concrete direction of research to improve scalability and generalization of attentive recurrent networks.

Learning the Arrow of Time for Problems in Reinforcement Learning.

Nasim Rahaman

Steffen Wolf

Roman Remme

2020-01-01

ICLR (publié)

Meta Attention Networks: Meta Learning Attention To Modulate Information Between Sparsely Interacting Recurrent Modules

Kanika Madan

Nan Rosemary Ke

Decomposing knowledge into interchangeable pieces promises a generalization advantage when, at some level of representation, the learner is … (voir plus)likely to be faced with situations requiring novel combinations of existing pieces of knowledge or computation. We hypothesize that such a decomposition of knowledge is particularly relevant for higher levels of representation as we see this at work in human cognition and natural language in the form of systematicity or systematic generalization. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs, as well as its reward function are stationary and can be re-used across tasks and changes in distribution. As the learner is confronted with variations in experiences, the attention selects which modules should be adapted and the parameters of those selected modules are adapted fast, while the parameters of attention mechanisms are updated slowly as meta-parameters. We ﬁnd that both the meta-learning and the modular aspects of the proposed system greatly help achieve faster learning in experiments with reinforcement learning setup involving navigation in a partially observed grid world.

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms

Tristan Deleu

Nasim Rahaman

Nan Rosemary Ke

We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional change… (voir plus)s, e.g. due to interventions, actions of agents and other sources of non-stationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a meta-learning objective. We demonstrate how this can be used to determine the cause-effect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned end-to-end. We then explore how these ideas could be used to also learn an encoder that would map low-level observed variables to unobserved causal variables leading to faster adaptation out-of-distribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and non-stationarities.

2020-01-01

ICLR (publié)

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Shagun Sodhani

Jonathan Binas

Xue Bin Peng

Sergey Levine

Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavi… (voir plus)or. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

2020-01-01

ICLR.cc/2020/Conference (poster)

Small-GAN: Speeding Up GAN Training Using Core-sets

Samarth Sinha

Han Zhang

Hugo Larochelle

Augustus Odena

Recent work by Brock et al. (2018) suggests that Generative Adversarial Networks (GANs) benefit disproportionately from large mini-batch siz… (voir plus)es. Unfortunately, using large batches is slow and expensive on conventional hardware. Thus, it would be nice if we could generate batches that were effectively large though actually small. In this work, we propose a method to do this, inspired by the use of Coreset-selection in active learning. When training a GAN, we draw a large batch of samples from the prior and then compress that batch using Coreset-selection. To create effectively large batches of 'real' images, we create a cached dataset of Inception activations of each training image, randomly project them down to a smaller dimension, and then use Coreset-selection on those projected activations at training time. We conduct experiments showing that this technique substantially reduces training time and memory usage for modern GAN variants, that it reduces the fraction of dropped modes in a synthetic dataset, and that it allows GANs to reach a new state of the art in anomaly detection.

2020-01-01

ICML (publié)

proceedings.mlr.press

The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget

Matthew Botvinick

Sergey Levine

In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision abo… (voir plus)ut which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged'' input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.

2020-01-01

ICLR (publié)