Publications

Residual Connections Encourage Iterative Inference

Stanisław Jastrzębski

Tong Che

Residual networks (Resnets) have become a prominent architecture in deep learning. However, a comprehensive understanding of Resnets is stil… (see more)l a topic of ongoing research. A recent view argues that Resnets perform iterative refinement of features. We attempt to further expose properties of this aspect. To this end, we study Resnets both analytically and empirically. We formalize the notion of iterative refinement in Resnets by showing that residual connections naturally encourage features of residual blocks to move along the negative gradient of loss as we go from one block to the next. In addition, our empirical analysis suggests that Resnets are able to perform both representation learning and iterative refinement. In general, a Resnet block tends to concentrate representation learning behavior in the first few layers while higher layers perform iterative refinement of features. Finally we observe that sharing residual layers naively leads to representation explosion and counterintuitively, overfitting, and we show that simple existing strategies can help alleviating this problem.

2017-12-31

ICLR.cc/2018/Conference (poster)

openreview.net

Sparse Attentive Backtracking: Temporal Credit Assignment Through Reminding

Nan Rosemary Ke

Michael C. Mozer

Learning long-term dependencies in extended temporal sequences requires credit assignment to events far back in the past. The most common me… (see more)thod for training recurrent neural networks, back-propagation through time (BPTT), requires credit information to be propagated backwards through every single step of the forward computation, potentially over thousands or millions of time steps. This becomes computationally expensive or even infeasible when used with long sequences. Importantly, biological brains are unlikely to perform such detailed reverse replay over very long sequences of internal states (consider days, months, or years.) However, humans are often reminded of past memories or mental states which are associated with the current mental state. We consider the hypothesis that such memory associations between past and present could be used for credit assignment through arbitrarily long sequences, propagating the credit assigned to the current state to the associated past state. Based on this principle, we study a novel algorithm which only back-propagates through a few of these temporal skip connections, realized by a learned attention mechanism that associates current states with relevant past states. We demonstrate in experiments that our method matches or outperforms regular BPTT and truncated BPTT in tasks involving particularly long-term dependencies, but without requiring the biologically implausible backward replay through the whole history of states. Additionally, we demonstrate that the proposed method transfers to longer sequences significantly better than LSTMs trained with BPTT and LSTMs trained with full self-attention.

2017-12-31

Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (published)

doi.org

arxiv.org

Streaming kernel regression with provably adaptive mean, variance, and regularization

Audrey Durand

Odalric-Ambrym Maillard

Joelle Pineau

We consider the problem of streaming kernel regression, when the observations arrive sequentially and the goal is to recover the underlying … (see more)mean function, assumed to belong to an RKHS. The variance of the noise is not assumed to be known. In this context, we tackle the problem of tuning the regularization parameter adaptively at each time step, while maintaining tight confidence bounds estimates on the value of the mean function at each point. To this end, we first generalize existing results for finite-dimensional linear regression with fixed regularization and known variance to the kernel setup with a regularization parameter allowed to be a measurable function of past observations. Then, using appropriate self-normalized inequalities we build upper and lower bound estimates for the variance, leading to Bersntein-like concentration bounds. The later is used in order to define the adaptive regularization. The bounds resulting from our technique are valid uniformly over all observation points and all time steps, and are compared against the literature with numerical experiments. Finally, the potential of these tools is illustrated by an application to kernelized bandits, where we revisit the Kernel UCB and Kernel Thompson Sampling procedures, and show the benefits of the novel adaptive kernel tuning strategy.

2017-12-31

Journal of Machine Learning Research (published)

arxiv.org

Temporal Regularization for Markov Decision Process

Several applications of Reinforcement Learning suffer from instability due to high variance. This is especially prevalent in high dimensiona… (see more)l domains. Regularization is a commonly used technique in machine learning to reduce variance, at the cost of introducing some bias. Most existing regularization techniques focus on spatial (perceptual) regularization. Yet in reinforcement learning, due to the nature of the Bellman equation, there is an opportunity to also exploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization. We formally characterize the bias induced by this technique using Markov chain concepts. We illustrate the various characteristics of temporal regularization via a sequence of simple discrete and continuous MDPs, and show that the technique provides improvement even in high-dimensional Atari games.

2017-12-31

Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (published)

dblp.uni-trier.de

Towards Deep Conversational Recommendations

Raymond Li

Samira Ebrahimi Kahou

Hannes Schulz

Vincent Michalski

Laurent Charlin

Chris Pal

There has been growing interest in using neural networks and deep learning techniques to create dialogue systems. Conversational recommendat… (see more)ion is an interesting setting for the scientific exploration of dialogue with natural language as the associated discourse involves goal-driven dialogue that often transforms naturally into more free-form chat. This paper provides two contributions. First, until now there has been no publicly available large-scale dataset consisting of real-world dialogues centered around recommendations. To address this issue and to facilitate our exploration here, we have collected ReDial, a dataset consisting of over 10,000 conversations centered around the theme of providing movie recommendations. We make this data available to the community for further research. Second, we use this dataset to explore multiple facets of conversational recommendations. In particular we explore new neural architectures, mechanisms, and methods suitable for composing conversational recommendation systems. Our dataset allows us to systematically probe model sub-components addressing different parts of the overall problem domain ranging from: sentiment analysis and cold-start recommendation generation to detailed aspects of how natural language is used in this setting in the real world. We combine such sub-components into a full-blown dialogue system and examine its behavior.

2017-12-31

Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (published)

doi.org

arxiv.org

Towards Text Generation with Adversarially Learned Neural Outlines.

Sandeep Subramanian

Sai Rajeswar

Alessandro Sordoni

Adam Trischler

Aaron C. Courville

Chris Pal

2017-12-31

Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (published)

dblp.uni-trier.de

Trends and Applications in Knowledge Discovery and Data Mining

Lida Rashidi

Benjamin C. M. Fung

Can Wang

2017-12-31

Lecture Notes in Computer Science (published)

doi.org

Twin Networks: Matching the Future for Sequence Generation

Dmitriy Serdyuk

Nan Rosemary Ke

Alessandro Sordoni

Adam Trischler

Christopher Pal

Yoshua Bengio

We propose a simple technique for encouraging generative RNNs to plan ahead. We train a "backward" recurrent network to generate a given seq… (see more)uence in reverse order, and we encourage states of the forward model to predict cotemporal states of the backward model. The backward network is used only during training, and plays no role during sampling or inference. We hypothesize that our approach eases modeling of long-term dependencies by implicitly forcing the forward states to hold information about the longer-term future (as contained in the backward states). We show empirically that our approach achieves 9% relative improvement for a speech recognition task, and achieves significant improvement on a COCO caption generation task.

2017-12-31

ICLR (Poster) (published)

openreview.net

Universal Successor Representations for Transfer Reinforcement Learning

Chen Ma

Junfeng Wen

Yoshua Bengio

The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus o… (see more)n the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.

2017-12-31

ICLR (Workshop) (published)

openreview.net

Unsupervised Depth Estimation, 3D Face Rotation and Replacement

Joel Ruben Antony Moniz

Christopher Beckham

Simon Rajotte

Sina Honari

Christopher Pal

We present an unsupervised approach for learning to estimate three dimensional (3D) facial structure from a single image while also predicti… (see more)ng 3D viewpoint transformations that match a desired pose and facial geometry. We achieve this by inferring the depth of facial keypoints of an input image in an unsupervised manner, without using any form of ground-truth depth information. We show how it is possible to use these depths as intermediate computations within a new backpropable loss to predict the parameters of a 3D affine transformation matrix that maps inferred 3D keypoints of an input face to the corresponding 2D keypoints on a desired target facial geometry or pose. Our resulting approach, called DepthNets, can therefore be used to infer plausible 3D transformations from one face pose to another, allowing faces to be frontalized, transformed into 3D models or even warped to another pose and facial geometry. Lastly, we identify certain shortcomings with our formulation, and explore adversarial image translation techniques as a post-processing step to re-synthesize complete head shots for faces re-targeted to different poses or identities.

2017-12-31

Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (published)

arxiv.org

Boosting Based Multiple Kernel Learning and Transfer Regression for Electricity Load Forecasting

Di Wu

Boyu Wang

Doina Precup

Benoit Boulet

2017-12-29

Machine Learning and Knowledge Discovery in Databases (published)

doi.org

Dendritic error backpropagation in deep cortical microcircuits

João Sacramento

Rui Ponte Costa

Yoshua Bengio

Walter Senn

Animal behaviour depends on learning to associate sensory stimuli with the desired motor command. Understanding how the brain orchestrates t… (see more)he necessary synaptic modifications across different brain areas has remained a longstanding puzzle. Here, we introduce a multi-area neuronal network model in which synaptic plasticity continuously adapts the network towards a global desired output. In this model synaptic learning is driven by a local dendritic prediction error that arises from a failure to predict the top-down input given the bottom-up activities. Such errors occur at apical dendrites of pyramidal neurons where both long-range excitatory feedback and local inhibitory predictions are integrated. When local inhibition fails to match excitatory feedback an error occurs which triggers plasticity at bottom-up synapses at basal dendrites of the same pyramidal neurons. We demonstrate the learning capabilities of the model in a number of tasks and show that it approximates the classical error backpropagation algorithm. Finally, complementing this cortical circuit with a disinhibitory mechanism enables attention-like stimulus denoising and generation. Our framework makes several experimental predictions on the function of dendritic integration and cortical microcircuits, is consistent with recent observations of cross-area learning, and suggests a biological implementation of deep learning.

2017-12-29

ArXiv (preprint)

arxiv.org

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Publications