Publications

Deep Reinforcement Learning that Matters
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning… (voir plus) (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
Imitation Upper Confidence Bound for Bandits on a Graph
We consider a graph of interconnected agents implementing a common policy and each playing a bandit problem with identical reward distributi… (voir plus)ons. We restrict the information propagated in the graph such that agents can uniquely observe each other's actions. We propose an extension of the Upper Confidence Bound (UCB) algorithm to this setting and empirically demonstrate that our solution improves the performance over UCB according to multiple metrics and within various graph configurations.
Learning Predictive State Representations From Non-Uniform Sampling
Yuri Grinberg
Melanie Lyman-Abramovitch
Borja Balle
Predictive state representations (PSR) have emerged as a powerful method for modelling partially observable environments. PSR learning algor… (voir plus)ithms can build models for predicting all observable variables, or predicting only some of them conditioned on others (e.g., actions or exogenous variables). In the latter case, which we call conditional modelling, the accuracy of different estimates of the conditional probabilities for a fixed dataset can vary significantly, due to the limited sampling of certain conditions. This can have negative consequences on the PSR parameter estimation process, which are not taken into account by the current state-of-the-art PSR spectral learning algorithms. In this paper, we examine closely conditional modelling within the PSR framework. We first establish a new positive but surprisingly non-trivial result: a conditional model can never be larger than the complete model. Then, we address the core shortcoming of existing PSR spectral learning methods for conditional models by incorporating an additional step in the process, which can be seen as a type of matrix denoising. We further refine this objective by adding penalty terms for violations of the system dynamics matrix structure, which improves the PSR predictive performance. Empirical evaluations on both synthetic and real datasets highlight the advantages of the proposed approach.
Learning Visual Reasoning Without Strong Priors
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence ne… (voir plus)ural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.
Learning with Options that Terminate Off-Policy
Anna Harutyunyan
Peter Vrancx
Ann Nowé
A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the… (voir plus) termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(β), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(β) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.
OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning
Reinforcement learning has shown promise in learning policies that can solve complex problems. However, manually specifying a good reward fu… (voir plus)nction can be difficult, especially for intricate tasks. Inverse reinforcement learning offers a useful paradigm to learn the underlying reward function directly from expert demonstrations. Yet in reality, the corpus of demonstrations may contain trajectories arising from a diverse set of underlying reward functions rather than a single one. Thus, in inverse reinforcement learning, it is useful to consider such a decomposition. The options framework in reinforcement learning is specifically designed to decompose policies in a similar light. We therefore extend the options framework and propose a method to simultaneously recover reward options in addition to policy options. We leverage adversarial methods to learn joint reward-policy options using only observed expert states. We show that this approach works well in both simple and complex continuous control tasks and shows significant performance increases in one-shot transfer learning.
When Waiting is not an Option: Learning Options with a Deliberation Cost
Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. W… (voir plus)hile the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability.
Low-memory convolutional neural networks through incremental depth-first processing
We introduce an incremental processing scheme for convolutional neural network (CNN) inference, targeted at embedded applications with limit… (voir plus)ed memory budgets. Instead of processing layers one by one, individual input pixels are propagated through all parts of the network they can influence under the given structural constraints. This depth-first updating scheme comes with hard bounds on the memory footprint: the memory required is constant in the case of 1D input and proportional to the square root of the input dimension in the case of 2D input.
How Do the Open Source Communities Address Usability and UX Issues?: An Exploratory Study
Jinghui Cheng
Jin L.C. Guo
Usability and user experience (UX) issues are often not well emphasized and addressed in open source software (OSS) development. There is an… (voir plus) imperative need for supporting OSS communities to collaboratively identify, understand, and fix UX design issues in a distributed environment. In this paper, we provide an initial step towards this effort and report on an exploratory study that investigated how the OSS communities currently reported, discussed, negotiated, and eventually addressed usability and UX issues. We conducted in-depth qualitative analysis of selected issue tracking threads from three OSS projects hosted on GitHub. Our findings indicated that discussions about usability and UX issues in OSS communities were largely influenced by the personal opinions and experiences of the participants. Moreover, the characteristics of the community may have greatly affected the focus of such discussion.
Minimization of Graph Weighted Models over Circular Strings
Dynamic Frame Skipping for Fast Speech Recognition in Recurrent Neural Network Based Acoustic Models
A recurrent neural network is a powerful tool for modeling sequential data such as text and speech. While recurrent neural networks have ach… (voir plus)ieved record-breaking results in speech recognition, one remaining challenge is their slow processing speed. The main cause comes from the nature of recurrent neural networks that read only one frame at each time step. Therefore, reducing the number of reads is an effective approach to reducing processing time. In this paper, we propose a novel recurrent neural network architecture called Skip-RNN, which dynamically skips speech frames that are less important. The Skip-RNN consists of an acoustic model network and skip-policy network that are jointly trained to classify speech frames and determine how many frames to skip. We evaluate our proposed approach on the Wall Street Journal corpus and show that it can accelerate acoustic model computation by up to 2.4 times without any noticeable degradation in transcription accuracy.
Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask
Stylianos Ioannis Mimilakis
Gerald Schuller
Tuomas Virtanen
Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a le… (voir plus)arnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.