Eric Crawford

Self-supervised Learning of Distance Functions for Goal-Conditioned Reinforcement Learning

Srinivas Venkattaramanujam

Thang Doan

Goal-conditioned policies are used in order to break down complex reinforcement learning (RL) problems by using subgoals, which can be defin… (see more)ed either in state space or in a latent feature space. This can increase the efficiency of learning by using a curriculum, and also enables simultaneous learning and generalization across goals. A crucial requirement of goal-conditioned policies is to be able to determine whether the goal has been achieved. Having a notion of distance to a goal is thus a crucial component of this approach. However, it is not straightforward to come up with an appropriate distance, and in some tasks, the goal space may not even be known a priori. In this work we learn a distance-to-goal estimate which is computed in terms of the number of actions that would need to be carried out in a self-supervised approach. Our method solves complex tasks without prior domain knowledge in the online setting in three different scenarios in the context of goal-conditioned policies a) the goal space is the same as the state space b) the goal space is given but an appropriate distance is unknown and c) the state space is accessible, but only a subset of the state space represents desired goals, and this subset is known a priori. We also propose a goal-generation mechanism as a secondary contribution.

2019-07-04

ArXiv (preprint)

arxiv.org

BanditSum: Extractive Summarization as a Contextual Bandit

Yue Dong

Yikang Shen

Eric Crawford

Herke van Hoof

Jackie CK Cheung

In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristical… (see more)ly-generated extractive labels. We call our approach BanditSum as it treats extractive summarization as a contextual bandit (CB) problem, where the model receives a document to summarize (the context), and chooses a sequence of sentences to include in the summary (the action). A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. We perform a series of experiments demonstrating that BanditSum is able to achieve ROUGE scores that are better than or comparable to the state-of-the-art for extractive summarization, and converges using significantly fewer update steps than competing approaches. In addition, we show empirically that BanditSum performs significantly better than competing approaches when good summary sentences appear late in the source document.

2018-09-30

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (published)

doi.org

arxiv.org

Sequential Coordination of Deep Models for Learning Visual Arithmetic

Eric Crawford

Guillaume Rabusseau

Joelle Pineau

Achieving machine intelligence requires a smooth integration of perception and reasoning, yet models developed to date tend to specialize in… (see more) one or the other; sophisticated manipulation of symbols acquired from rich perceptual spaces has so far proved elusive. Consider a visual arithmetic task, where the goal is to carry out simple arithmetical algorithms on digits presented under natural conditions (e.g. hand-written, placed randomly). We propose a two-tiered architecture for tackling this problem. The lower tier consists of a heterogeneous collection of information processing modules, which can include pre-trained deep neural networks for locating and extracting characters from the image, as well as modules performing symbolic transformations on the representations extracted by perception. The higher tier consists of a controller, trained using reinforcement learning, which coordinates the modules in order to solve the high-level task. For instance, the controller may learn in what contexts to execute the perceptual networks and what symbolic transformations to apply to their outputs. The resulting model is able to solve a variety of tasks in the visual arithmetic domain, and has several advantages over standard, architecturally homogeneous feedforward networks including improved sample efficiency.

2018-02-14

ArXiv (preprint)

openreview.net

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Eric Crawford

Publications

Mila Techaide 2026

Venture Scientist Bootcamp

AI Advantage: Productivity in Public Service

Popular keywords:

Eric Crawford

Publications