Anirudh Goyal

Untangling tradeoffs between recurrence and self-attention in artificial neural networks

Learning the Arrow of Time

Nasim Rahaman

Steffen Wolf

Roman Remme

We humans seem to have an innate understanding of the asymmetric progression of time, which we use to efficiently and safely perceive and ma… (voir plus)nipulate our environment. Drawing inspiration from that, we address the problem of learning an arrow of time in a Markov (Decision) Process. We illustrate how a learned arrow of time can capture meaningful information about the environment, which in turn can be used to measure reachability, detect side-effects and to obtain an intrinsic reward signal. We show empirical results on a selection of discrete and continuous environments, and demonstrate for a class of stochastic processes that the learned arrow of time agrees reasonably well with a known notion of an arrow of time given by the celebrated Jordan-Kinderlehrer-Otto result.

2019-06-12

OpenReview.net/Anonymous_Preprint (inconnu)

openreview.net

Learning Powerful Policies by Using Consistent Dynamics Model

Sergey Levine

Model-based Reinforcement Learning approaches have the promise of being sample efficient. Much of the progress in learning dynamics models i… (voir plus)n RL has been made by learning models via supervised learning. But traditional model-based approaches lead to `compounding errors' when the model is unrolled step by step. Essentially, the state transitions that the learner predicts (by unrolling the model for multiple steps) and the state transitions that the learner experiences (by acting in the environment) may not be consistent. There is enough evidence that humans build a model of the environment, not only by observing the environment but also by interacting with the environment. Interaction with the environment allows humans to carry out experiments: taking actions that help uncover true causal relationships which can be used for building better dynamics models. Analogously, we would expect such interactions to be helpful for a learning agent while learning to model the environment dynamics. In this paper, we build upon this intuition by using an auxiliary cost function to ensure consistency between what the agent observes (by acting in the real world) and what it imagines (by acting in the `learned' world). We consider several tasks - Mujoco based control tasks and Atari games - and show that the proposed approach helps to train powerful policies and better dynamics models.

2019-06-11

ArXiv (prépublication)

arxiv.org

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

Denis Kazakov

Michael Curtis Mozer

Machine learning promises methods that generalize well from finite labeled data. However, the brittleness of existing neural net approaches … (voir plus)is revealed by notable failures, such as the existence of adversarial examples that are misclassified despite being nearly identical to a training example, or the inability of recurrent sequence-processing nets to stay on track without teacher forcing. We introduce a method, which we refer to as \emph{state reification}, that involves modeling the distribution of hidden states over the training data and then projecting hidden states observed during testing toward this distribution. Our intuition is that if the network can remain in a familiar manifold of hidden space, subsequent layers of the net should be well trained to respond appropriately. We show that this state-reification method helps neural nets to generalize better, especially when labeled data are sparse, and also helps overcome the challenge of achieving robust generalization with adversarial training.

2019-05-24

Proceedings of the 36th International Conference on Machine Learning (publié)

proceedings.mlr.press

arxiv.org

Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future

Nan Rosemary Ke

Amanpreet Singh

Ahmed Touati

Anirudh Goyal

Yoshua Bengio

Devi Parikh

Dhruv Batra

In model-based reinforcement learning, the agent interleaves between model learning and planning. These two components are inextricably inte… (voir plus)rtwined. If the model is not able to provide sensible long-term prediction, the executed planner would exploit model flaws, which can yield catastrophic failures. This paper focuses on building a model that reasons about the long-term future and demonstrates how to use this for efficient planning and exploration. To this end, we build a latent-variable autoregressive model by leveraging recent ideas in variational inference. We argue that forcing latent variables to carry future information through an auxiliary task substantially improves long-term predictions. Moreover, by planning in the latent space, the planner's solution is ensured to be within regions where the model is valid. An exploration strategy can be devised by searching for unlikely trajectories under the model. Our method achieves higher reward faster compared to baselines on a variety of tasks and environments in both the imitation learning and model-based reinforcement learning settings.

2019-03-05

ArXiv (prépublication)

arxiv.org

Maximum Entropy Generators for Energy-Based Models

Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In t… (voir plus)his work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.

2019-01-24

ArXiv (prépublication)

arxiv.org

InfoBot: Structured Exploration in ReinforcementLearning Using Information Bottleneck

D. Strouse

Matthew Botvinick

Sergey Levine

InfoBot: Transfer and Exploration via the Information Bottleneck

DJ Strouse

Matthew Botvinick

Sergey Levine

A central challenge in reinforcement learning is discovering effective policies for tasks where rewards are sparsely distributed. We postula… (voir plus)te that in the absence of useful reward signals, an effective exploration strategy should seek out {\it decision states}. These states lie at critical junctions in the state space from where the agent can transition to new, potentially unexplored regions. We propose to learn about decision states from prior experience. By training a goal-conditioned policy with an information bottleneck, we can identify decision states by examining where the model actually leverages the goal state. We find that this simple mechanism effectively identifies decision states, even in partially observed settings. In effect, the model learns the sensory cues that correlate with potential subgoals. In new environments, this model can then identify novel subgoals for further exploration, guiding the agent through a sequence of potential decision states and through new regions of the state space.

2019-01-01

ICLR.cc/2019/Conference (poster)

openreview.net

Recall Traces: Backtracking Models for Efficient Reinforcement Learning

Anirudh Goyal

Philemon Brakel

William Fedus

Soumye Singhal

Timothy P. Lillicrap

Sergey Levine

Hugo Larochelle

Yoshua Bengio

In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provid… (voir plus)e a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.

2019-01-01

ICLR.cc/2019/Conference (poster)

openreview.net

Towards Jumpy Planning

Akilesh

Suriya Singh

Anirudh Goyal

Alexander Neitz

Aaron Courville

Model-free reinforcement learning (RL) is a powerful paradigm for learning complex tasks but suffers from high sample inefficiency as well a… (voir plus)s ignorance of the environment dynamics. On the other hand, a model-based RL agent learns dynamical causal models of the environment and uses them to plan. However, using a model at the scale of time-steps (usually tens of milliseconds) is mostly unfeasible in practice due to compounding prediction errors and computational requirements for making vast numbers of model queries during the planning process. We propose to use a modelbased planner together with a goal-conditioned policy trained with model-free learning. We use a model-based planner that operates at higher levels of abstraction i.e., decision states and use modelfree RL between the decision states. We validate our approach in terms of transfer and generalization performance and show that it leads to improvement over model-based planner that jumps to states that are fixed timesteps ahead.

2019-01-01

(publié)

www.semanticscholar.org

Towards Jumpy Planning

Akilesh

Suriya Singh

Anirudh Goyal

Alexander Neitz

Aaron Courville

Modeling the Long Term Future in Model-Based Reinforcement Learning

Nan Rosemary Ke

Amanpreet Singh