Publications

Hyperbolic Discounting and Learning over Multiple Horizons

William Fedus

Carles Gelada

Bellemare Marc-Emmanuel

Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future re… (see more)wards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

2019-02-18

ArXiv (preprint)

openreview.net

Predicting conversion to psychosis in clinical high risk patients using resting-state functional MRI features

Jolie Mcdonnell

W. Hord

Jenna Reinen

Pablo Polosecki

Irina Rish

Guillermo Cecchi

Recent progress in artificial intelligence provides researchers with a powerful set of machine learning tools for analyzing brain imaging da… (see more)ta. In this work, we explore a variety of classification algorithms and functional network features derived from resting-state fMRI data collected from clinical high-risk (prodromal schizophrenia) patients and controls, trying to identify features predictive of conversion to psychosis among a subset of CHR patients. While there are many existing studies suggesting that functional network features can be highly discriminative of schizophrenia when analyzing fMRI of patients suffering from the disease vs controls, few studies attempt to explore a similar approach to actual prediction of future psychosis development ahead of time, in the prodromal stage. Our preliminary results demonstrate the potential of fMRI functional network features to predict the conversion to psychosis in CHR patients. However, given the high variance of our results across different classifiers and subsets of data, a more extensive empirical investigation is required to reach more robust conclusions.

2019-02-15

Medical Imaging 2019: Biomedical Applications in Molecular, Structural, and Functional Imaging (published)

doi.org

Anytime Tail Averaging

Nicolas Roux

Tail averaging consists in averaging the last examples in a stream. Common techniques either have a memory requirement which grows with the … (see more)number of samples to average, are not available at every timestep or do not accomodate growing windows. We propose two techniques with a low constant memory cost that perform tail averaging with access to the average at every time step. We also show how one can improve the accuracy of that average at the cost of increased memory consumption.

2019-02-12

ArXiv (preprint)

arxiv.org

Dendritic solutions to the credit assignment problem

Blake Aaron Richards

Timothy P Lillicrap

2019-01-31

Current Opinion in Neurobiology (published)

doi.org

Equivalence of Equilibrium Propagation and Recurrent Backpropagation

Benjamin Scellier

Yoshua Bengio

Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which di… (see more)ffer in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.

2019-01-31

Neural Computation (published)

doi.org

openreview.net

The Impact of Time Interval between Extubation and Reintubation on Death or Bronchopulmonary Dysplasia in Extremely Preterm Infants

Wissam Shalish

Lara Kanbar

Lajos Kovacs

Sanjay Chawla

Martin Keszler

Smita Rao

Bogdan Panaitescu

Alyse Laliberte

Doina Precup

Karen Brown

Robert E. Kearney

Guilherme M. Sant'Anna

2019-01-31

The Journal of Pediatrics (published)

doi.org

Author Correction: Why rankings of biomedical image analysis competitions should be interpreted with care

Lena Maier-Hein

Matthias Eisenmann

Annika Reinke

Sinan Onogur

Marko Stankovic

Patrick Scholz

Tal Arbel

Hrvoje Bogunovic

Andrew P. Bradley

Aaron Carass

Carolin Feldmann

Alejandro F. Frangi

Peter M. Full

Bram van Ginneken

Allan Hanbury

Katrin Honauer

Michal Kozubek

Bennett Landman

Keno März

Oskar Maier … (see 18 more)

Klaus Maier-Hein

Bjoern Menze

Henning Müller

Peter F. Neher

Wiro Niessen

Nasir Rajpoot

Gregory C. Sharp

Korsuk Sirinukunwattana

Stefanie Speidel

Christian Stock

Danail Stoyanov

Abdel Aziz Taha

Fons van der Sommen

Ching-Wei Wang

Marc-André Weber

Guoyan Zheng

Pierre Jannin

Annette Kopp-Schneider

2019-01-29

Nature Communications (published)

doi.org

Session-Based Social Recommendation via Dynamic Graph Attention Networks

Weiping Song

Zhiping Xiao

Yifan Wang

Laurent Charlin

Ming Zhang

Jian Tang

Online communities such as Facebook and Twitter are enormously popular and have become an essential part of the daily life of many of their … (see more)users. Through these platforms, users can discover and create information that others will then consume. In that context, recommending relevant information to users becomes critical for viability. However, recommendation in online communities is a challenging problem: 1) users' interests are dynamic, and 2) users are influenced by their friends. Moreover, the influencers may be context-dependent. That is, different friends may be relied upon for different topics. Modeling both signals is therefore essential for recommendations. We propose a recommender system for online communities based on a dynamic-graph-attention neural network. We model dynamic user behaviors with a recurrent neural network, and context-dependent social influence with a graph-attention neural network, which dynamically infers the influencers based on users' current interests. The whole model can be efficiently fit on large-scale data. Experimental results on several real-world data sets demonstrate the effectiveness of our proposed approach over several competitive baselines including state-of-the-art models.

2019-01-29

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (published)

doi.org

arxiv.org

Maximum Entropy Generators for Energy-Based Models

Maximum likelihood estimation of energy-based models is a challenging problem due to the intractability of the log-likelihood gradient. In t… (see more)his work, we propose learning both the energy function and an amortized approximate sampling mechanism using a neural generator network, which provides an efficient approximation of the log-likelihood gradient. The resulting objective requires maximizing entropy of the generated samples, which we perform using recently proposed nonparametric mutual information estimators. Finally, to stabilize the resulting adversarial game, we use a zero-centered gradient penalty derived as a necessary condition from the score matching literature. The proposed technique can generate sharp images with Inception and FID scores competitive with recent GAN techniques, does not suffer from mode collapse, and is competitive with state-of-the-art anomaly detection techniques.

2019-01-23

ArXiv (preprint)

doi.org

arxiv.org

What comes next? Extractive summarization by next-sentence prediction

Jingyun Liu

Jackie CK Cheung

Annie Priyadarshini Louis

Existing approaches to automatic summarization assume that a length limit for the summary is given, and view content selection as an optimiz… (see more)ation problem to maximize informativeness and minimize redundancy within this budget. This framework ignores the fact that human-written summaries have rich internal structure which can be exploited to train a summarization system. We present NEXTSUM, a novel approach to summarization based on a model that predicts the next sentence to include in the summary using not only the source article, but also the summary produced so far. We show that such a model successfully captures summary-specific discourse moves, and leads to better content selection performance, in addition to automatically predicting how long the target summary should be. We perform experiments on the New York Times Annotated Corpus of summaries, where NEXTSUM outperforms lead and content-model summarization baselines by significant margins. We also show that the lengths of summaries produced by our system correlates with the lengths of the human-written gold standards.

2019-01-11

ArXiv (preprint)

arxiv.org

The Benefits of Over-parameterization at Initialization in Deep ReLU Networks

Devansh Arpit

Yoshua Bengio

It has been noted in existing literature that over-parameterization in ReLU networks generally improves performance. While there could be se… (see more)veral factors involved behind this, we prove some desirable theoretical properties at initialization which may be enjoyed by ReLU networks. Specifically, it is known that He initialization in deep ReLU networks asymptotically preserves variance of activations in the forward pass and variance of gradients in the backward pass for infinitely wide networks, thus preserving the flow of information in both directions. Our paper goes beyond these results and shows novel properties that hold under He initialization: i) the norm of hidden activation of each layer is equal to the norm of the input, and, ii) the norm of weight gradient of each layer is equal to the product of norm of the input vector and the error at output layer. These results are derived using the PAC analysis framework, and hold true for finitely sized datasets such that the width of the ReLU network only needs to be larger than a certain finite lower bound. As we show, this lower bound depends on the depth of the network and the number of samples, and by the virtue of being a lower bound, over-parameterized ReLU networks are endowed with these desirable properties. For the aforementioned hidden activation norm property under He initialization, we further extend our theory and show that this property holds for a finite width network even when the number of data samples is infinite. Thus we overcome several limitations of existing papers, and show new properties of deep ReLU networks at initialization.

2019-01-10

ArXiv (preprint)

openreview.net

1. Searching for Big-Oh in the Data: Inferring Asymptotic Complexity from Experiments

Catherine McGeoch

Peter Sanders 0001

Rudolf Fleischer

Paul R. Cohen

Doina Precup

2018-12-31

(published)

www.semanticscholar.org

Mila Ventures Founder in Residence

TRAIL: Responsible AI for Professionals and Leaders

AI Advantage: Productivity in Public Service

Publications

Mila Ventures Founder in Residence

TRAIL: Responsible AI for Professionals and Leaders

AI Advantage: Productivity in Public Service

Popular keywords:

Publications