Portrait of Nicolas Ballas is unavailable

Nicolas Ballas

Alumni

Publications

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio
Stanisław Jastrzębski
Amos Storkey
Finding Flatter Minima with SGD
Stanisław Jastrzębski
Amos Storkey
SGD S MOOTHS THE S HARPEST D IRECTIONS
Stanisław Jastrzębski
Amos Storkey
Stochastic gradient descent (SGD) is able to find regions that generalize well, even in drastically over-parametrized models such as deep ne… (see more)ural networks. We observe that noise in SGD controls the spectral norm and conditioning of the Hessian throughout the training. We hypothesize the cause of this phenomenon is due to the dynamics of neurons saturating their non-linearity along the largest curvature directions, thus leading to improved conditioning.
LATTER M INIMA WITH SGD
Stanisław Jastrzębski
Amos Storkey
It has been discussed that over-parameterized deep neural networks (DNNs) trained using stochastic gradient descent (SGD) with smaller batch… (see more) sizes generalize better compared with those trained with larger batch sizes. Additionally, model parameters found by small batch size SGD tend to be in flatter regions. We extend these empirical observations and experimentally show that both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well. Conversely, we find that small learning rates and large batch sizes lead to sharper minima that correlate with poor generalization in DNNs.
LATTER M INIMA WITH SGD
Stanisław Jastrzębski
Amos Storkey
Three Factors Influencing Minima in SGD
Stanisław Jastrzębski
Amos Storkey
We study the statistical properties of the endpoint of stochastic gradient descent (SGD). We approximate SGD as a stochastic differential eq… (see more)uation (SDE) and consider its Boltzmann Gibbs equilibrium distribution under the assumption of isotropic variance in loss gradients.. Through this analysis, we find that three factors – learning rate, batch size and the variance of the loss gradients – control the trade-off between the depth and width of the minima found by SGD, with wider minima favoured by a higher ratio of learning rate to batch size. In the equilibrium distribution only the ratio of learning rate to batch size appears, implying that it’s invariant under a simultaneous rescaling of each by the same amount. We experimentally show how learning rate and batch size affect SGD from two perspectives: the endpoint of SGD and the dynamics that lead up to it. For the endpoint, the experiments suggest the endpoint of SGD is similar under simultaneous rescaling of batch size and learning rate, and also that a higher ratio leads to flatter minima, both findings are consistent with our theoretical analysis. We note experimentally that the dynamics also seem to be similar under the same rescaling of learning rate and batch size, which we explore showing that one can exchange batch size and learning rate in a cyclical learning rate schedule. Next, we illustrate how noise affects memorization, showing that high noise levels lead to better generalization. Finally, we find experimentally that the similarity under simultaneous rescaling of learning rate and batch size breaks down if the learning rate gets too large or the batch size gets too small.
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering
While deep convolutional neural networks frequently approach or exceed human-level performance in benchmark tasks involving static images, e… (see more)xtending this success to moving images is not straightforward. Video understanding is of interest for many applications, including content recommendation, prediction, summarization, event/object detection, and understanding human visual perception. However, many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features, the effects of increasing dataset size, the number of frames sampled, and of vocabulary size. We illustrate that: this task is not solvable by a language model alone, our model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level. We provide human evaluation for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgment. We suggest avenues for improving video models, and hope that the MovieFIB challenge can be useful for measuring and encouraging progress in this very interesting field.
A Closer Look at Memorization in Deep Networks
We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While dee… (see more)p networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
A Closer Look at Memorization in Deep Networks
We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While dee… (see more)p networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of memorization.
Deep Nets Don't Learn via Memorization
We use empirical methods to argue that deep neural networks (DNNs) do not achieve their performance by memorizing training data in spite of … (see more)overlyexpressive model architectures. Instead, they learn a simple available hypothesis that fits the finite data samples. In support of this view, we establish that there are qualitative differences when learning noise vs. natural datasets, showing: (1) more capacity is needed to fit noise, (2) time to convergence is longer for random labels, but shorter for random inputs, and (3) that DNNs trained on real data examples learn simpler functions than when trained with noise data, as measured by the sharpness of the loss function at convergence. Finally, we demonstrate that for appropriately tuned explicit regularization, e.g. dropout, we can degrade DNN training performance on noise datasets without compromising generalization on real data.
Recurrent Batch Normalization
We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks. Whereas previous works… (see more) only apply batch normalization to the input-to-hidden transformation of RNNs, we demonstrate that it is both possible and beneficial to batch-normalize the hidden-to-hidden transition, thereby reducing internal covariate shift between time steps. We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization.
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain thei… (see more)r previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.