David Scott Krueger

Charlotte Stix

Peter Mark Henderson

Logan Graham

Carina E. A. Prunkl

Bianca Martin

Elizabeth Seger

Noa Zilberman

Sean O hEigeartaigh

Frens Kroeger

Girish Sastry

R. Kagan

Adrian Weller

Brian Shek-kam Tse

Elizabeth Barnes

Allan Dafoe

Paul D. Scharre

Ariel Herbert-Voss

Martijn Rasser

Shagun Sodhani

Carrick Flynn

Thomas Krendl Gilbert

Lisa Dyer

Saif M. Khan

Yoshua Bengio

Markus Anderljung

2020-04-15

ArXiv (preprint)

Out-of-Distribution Generalization via Risk Extrapolation (REx)

Ethan Caballero

Joern-Henrik Jacobsen

Amy Zhang

Jonathan Binas

Rémi LE PRIOL

Generalizing outside of the training distribution is an open challenge for current machine learning systems. A weak form of out-of-distribut… (see more)ion (OoD) generalization is the ability to successfully interpolate between multiple observed distributions. One way to achieve this is through robust optimization, which seeks to minimize the worst-case risk over convex combinations of the training distributions. However, a much stronger form of OoD generalization is the ability of models to extrapolate beyond the distributions observed during training. In pursuit of strong OoD generalization, we introduce the principle of Risk Extrapolation (REx). REx can be viewed as encouraging robustness over affine combinations of training risks, by encouraging strict equality between training risks. We show conceptually how this principle enables extrapolation, and demonstrate the effectiveness and scalability of instantiations of REx on various OoD generalization tasks. Our code can be found at this https URL.

2020-03-02

ArXiv (preprint)

Neural Autoregressive Flows

Alexandre Lacoste

Normalizing flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via M… (see more)asked Autoregressive Flows (MAF), and to accelerate state-of-the-art WaveNet-based speech synthesis to 20x faster than real-time, via Inverse Autoregressive Flows (IAF). We unify and generalize these approaches, replacing the (conditionally) affine univariate transformations of MAF/IAF with a more general class of invertible univariate transformations expressed as monotonic neural networks. We demonstrate that the proposed neural autoregressive flows (NAF) are universal approximators for continuous probability distributions, and their greater expressivity allows them to better capture multimodal target distributions. Experimentally, NAF yields state-of-the-art performance on a suite of density estimation tasks and outperforms IAF in variational autoencoders trained on binarized MNIST.

2018-07-03

Proceedings of the 35th International Conference on Machine Learning (published)

proceedings.mlr.press

Neural Autoregressive Flows

Alexandre Lacoste

Normalizing flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via M… (see more)asked Autoregressive Flows (MAF), and to accelerate state-of-the-art WaveNet-based speech synthesis to 20x faster than real-time, via Inverse Autoregressive Flows (IAF). We unify and generalize these approaches, replacing the (conditionally) affine univariate transformations of MAF/IAF with a more general class of invertible univariate transformations expressed as monotonic neural networks. We demonstrate that the proposed neural autoregressive flows (NAF) are universal approximators for continuous probability distributions, and their greater expressivity allows them to better capture multimodal target distributions. Experimentally, NAF yields state-of-the-art performance on a suite of density estimation tasks and outperforms IAF in variational autoencoders trained on binarized MNIST.

2018-04-03

ArXiv (preprint)

Bayesian Hypernetworks

Ryan Turner

Alexandre Lacoste

2017-10-13

ArXiv (preprint)

Bayesian Hypernetworks

Ryan Turner

Alexandre Lacoste

We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork, h, is a neura… (see more)l network which learns to transform a simple noise distribution, p(e) = N(0,I), to a distribution q(t) := q(h(e)) over the parameters t of another neural network (the ``primary network). We train q with variational inference, using an invertible h to enable efficient estimation of the variational lower bound on the posterior p(t | D) via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap iid sampling of q(t). In practice, Bayesian hypernets provide a better defense against adversarial examples than dropout, and also exhibit competitive performance on a suite of tasks which evaluate model uncertainty, including regularization, active learning, and anomaly detection.

2017-10-13

ArXiv (preprint)

Bayesian Hypernetworks

Ryan Turner

Alexandre Lacoste

2017-10-13

ArXiv (preprint)

Bayesian Hypernetworks

Ryan Turner

Alexandre Lacoste

We propose Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork, h, is a neura… (see more)l network which learns to transform a simple noise distribution, p(e) = N(0,I), to a distribution q(t) := q(h(e)) over the parameters t of another neural network (the ``primary network). We train q with variational inference, using an invertible h to enable efficient estimation of the variational lower bound on the posterior p(t | D) via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap iid sampling of q(t). In practice, Bayesian hypernets provide a better defense against adversarial examples than dropout, and also exhibit competitive performance on a suite of tasks which evaluate model uncertainty, including regularization, active learning, and anomaly detection.

2017-10-13

ArXiv (preprint)

Bayesian Hypernetworks

Ryan Turner

Alexandre Lacoste

2017-10-13

ArXiv (preprint)

A Closer Look at Memorization in Deep Networks

Devansh Arpit

Stanisław Jastrzębski

Maxinder S. Kanwal

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While dee… (see more)p networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

2017-07-17

Proceedings of the 34th International Conference on Machine Learning (published)

proceedings.mlr.press

A Closer Look at Memorization in Deep Networks

Devansh Arpit

Stanisław Jastrzębski

Maxinder S. Kanwal

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While dee… (see more)p networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.

2017-06-16

ArXiv (preprint)

Deep Nets Don't Learn via Memorization