Portrait de Anirudh Goyal n'est pas disponible

Anirudh Goyal

Alumni

Publications

Neural Function Modules with Sparse Arguments: A Dynamic Approach to Integrating Information across Layers
A. Slowik
Michael Curtis Mozer
Philippe Beaudoin
Feed-forward neural networks consist of a sequence of layers, in which each layer performs some processing on the information from the previ… (voir plus)ous layer. A downside to this approach is that each layer (or module, as multiple modules can operate in parallel) is tasked with processing the entire hidden state, rather than a particular part of the state which is most relevant for that module. Methods which only operate on a small number of input variables are an essential part of most programming languages, and they allow for improved modularity and code re-usability. Our proposed method, Neural Function Modules (NFM), aims to introduce the same structural capability into deep learning. Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems. The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm which, as we show, improves the results in standard classification, out-of-domain generalization, generative modeling, and learning representations in the context of reinforcement learning.
S2RMs: Spatially Structured Recurrent Modules
Nasim Rahaman
Muhammad Waleed Gondal
Manuel Wüthrich
Y. Sharma
Bernhard Schölkopf
Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems
Philippe Beaudoin
Sergey Levine
Charles Blundell
Michael Curtis Mozer
Untangling tradeoffs between recurrence and self-attention in neural networks
Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential ta… (voir plus)sks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.
Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules
Learning Long-term Dependencies Using Cognitive Inductive Biases in Self-attention RNNs
Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential ta… (voir plus)sks. However, most recent progress hinges on heuristic approaches that rely on considerable memory and computational resources that scale poorly. In this work, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. We use simple numerical experiments to demonstrate that this mechanism helps enable recurrent systems on generalization and transfer learning tasks. Based on our results, we propose a concrete direction of research to improve scalability and generalization of attentive recurrent networks.
Learning the Arrow of Time for Problems in Reinforcement Learning.
Nasim Rahaman
Steffen Wolf
Roman Remme
Meta Attention Networks: Meta Learning Attention To Modulate Information Between Sparsely Interacting Recurrent Modules
Decomposing knowledge into interchangeable pieces promises a generalization advantage when, at some level of representation, the learner is … (voir plus)likely to be faced with situations requiring novel combinations of existing pieces of knowledge or computation. We hypothesize that such a decomposition of knowledge is particularly relevant for higher levels of representation as we see this at work in human cognition and natural language in the form of systematicity or systematic generalization. To study these ideas, we propose a particular training framework in which we assume that the pieces of knowledge an agent needs, as well as its reward function are stationary and can be re-used across tasks and changes in distribution. As the learner is confronted with variations in experiences, the attention selects which modules should be adapted and the parameters of those selected modules are adapted fast, while the parameters of attention mechanisms are updated slowly as meta-parameters. We find that both the meta-learning and the modular aspects of the proposed system greatly help achieve faster learning in experiments with reinforcement learning setup involving navigation in a partially observed grid world.
A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms
We propose to meta-learn causal structures based on how fast a learner adapts to new distributions arising from sparse distributional change… (voir plus)s, e.g. due to interventions, actions of agents and other sources of non-stationarities. We show that under this assumption, the correct causal structural choices lead to faster adaptation to modified distributions because the changes are concentrated in one or just a few mechanisms when the learned knowledge is modularized appropriately. This leads to sparse expected gradients and a lower effective number of degrees of freedom needing to be relearned while adapting to the change. It motivates using the speed of adaptation to a modified distribution as a meta-learning objective. We demonstrate how this can be used to determine the cause-effect relationship between two observed variables. The distributional changes do not need to correspond to standard interventions (clamping a variable), and the learner has no direct knowledge of these interventions. We show that causal structures can be parameterized via continuous variables and learned end-to-end. We then explore how these ideas could be used to also learn an encoder that would map low-level observed variables to unobserved causal variables leading to faster adaptation out-of-distribution, learning a representation space where one can satisfy the assumptions of independent mechanisms and of small and sparse changes in these mechanisms due to actions and non-stationarities.
Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives
Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavi… (voir plus)or. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.
Small-GAN: Speeding Up GAN Training Using Core-sets
Samarth Sinha
Han Zhang
Augustus Odena
Recent work by Brock et al. (2018) suggests that Generative Adversarial Networks (GANs) benefit disproportionately from large mini-batch siz… (voir plus)es. Unfortunately, using large batches is slow and expensive on conventional hardware. Thus, it would be nice if we could generate batches that were effectively large though actually small. In this work, we propose a method to do this, inspired by the use of Coreset-selection in active learning. When training a GAN, we draw a large batch of samples from the prior and then compress that batch using Coreset-selection. To create effectively large batches of 'real' images, we create a cached dataset of Inception activations of each training image, randomly project them down to a smaller dimension, and then use Coreset-selection on those projected activations at training time. We conduct experiments showing that this technique substantially reduces training time and memory usage for modern GAN variants, that it reduces the fraction of dropped modes in a synthetic dataset, and that it allows GANs to reach a new state of the art in anomaly detection.
The Variational Bandwidth Bottleneck: Stochastic Evaluation on an Information Budget
Matthew Botvinick
Sergey Levine
In many applications, it is desirable to extract only the relevant information from complex input data, which involves making a decision abo… (voir plus)ut which input features are relevant. The information bottleneck method formalizes this as an information-theoretic optimization problem by maintaining an optimal tradeoff between compression (throwing away irrelevant input information), and predicting the target. In many problem settings, including the reinforcement learning problems we consider in this work, we might prefer to compress only part of the input. This is typically the case when we have a standard conditioning input, such as a state observation, and a ``privileged'' input, which might correspond to the goal of a task, the output of a costly planning algorithm, or communication with another agent. In such cases, we might prefer to compress the privileged input, either to achieve better generalization (e.g., with respect to goals) or to minimize access to costly information (e.g., in the case of communication). Practical implementations of the information bottleneck based on variational inference require access to the privileged input in order to compute the bottleneck variable, so although they perform compression, this compression operation itself needs unrestricted, lossless access. In this work, we propose the variational bandwidth bottleneck, which decides for each example on the estimated value of the privileged information before seeing it, i.e., only based on the standard input, and then accordingly chooses stochastically, whether to access the privileged input or not. We formulate a tractable approximation to this framework and demonstrate in a series of reinforcement learning experiments that it can improve generalization and reduce access to computationally costly information.