Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio
Stanisław Jastrzębski
Zac Kenton
Devansh Arpit
Nicolas Ballas
Asja Fischer
Amos Storkey
Exploring Uncertainty Measures in Deep Networks for Multiple Sclerosis Lesion Detection and Segmentation
Tanya Nair
Douglas Arnold
How can deep learning advance computational modeling of sensory information processing?
Jessica A.F. Thompson
Elia Formisano
Marc Schönwiesner
Deep learning, computational neuroscience, and cognitive science have overlapping goals related to understanding intelligence such that perc… (see more)eption and behaviour can be simulated in computational systems. In neuroimaging, machine learning methods have been used to test computational models of sensory information processing. Recently, these model comparison techniques have been used to evaluate deep neural networks (DNNs) as models of sensory information processing. However, the interpretation of such model evaluations is muddied by imprecise statistical conclusions. Here, we make explicit the types of conclusions that can be drawn from these existing model comparison techniques and how these conclusions change when the model in question is a DNN. We discuss how DNNs are amenable to new model comparison techniques that allow for stronger conclusions to be made about the computational mechanisms underlying sensory information processing.
On the Learning Dynamics of Deep Neural Networks
Remi Tachet des Combes
Mohammad Pezeshki
Samira Shabanian
While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely… (see more) misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize \textit{gradient starvation} where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.
CNN Prediction of Future Disease Activity for Multiple Sclerosis Patients from Baseline MRI and Lesion Labels
Nazanin Mohammadi Sepahvand
Tal Hassner
Douglas Arnold
3D U-Net for Brain Tumour Segmentation
Raghav Mehta
How to Exploit Weaknesses in Biomedical Challenge Design and Organization
Annika Reinke
Matthias Eisenmann
Sinan Onogur
Marko Stankovic
Patrick Scholz
Peter M. Full
Hrvoje Bogunovic
Bennett Landman
Oskar Maier
Bjoern Menze
Gregory C. Sharp
Korsuk Sirinukunwattana
Stefanie Speidel
F. V. D. Sommen
Guoyan Zheng
Henning Müller
Michal Kozubek
Andrew P. Bradley
Pierre Jannin … (see 2 more)
Annette Kopp-Schneider
Lena Maier-Hein
RS-Net: Regression-Segmentation 3D CNN for Synthesis of Full Resolution Missing Brain MRI in the Presence of Tumours
Raghav Mehta
Social-Affiliation Networks: Patterns and the SOAR Model
Dhivya Eswaran
Artur Dubrawski
Christos Faloutsos
Ghost Units Yield Biologically Plausible Backprop in Deep Neural Networks
Thomas Mesnard
Gaëtan Vignoud
João Sacramento
Walter Senn
Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition
Titouan Parcollet
Ying Zhang
Mohamed Morchid
Chiheb Trabelsi
Georges Linarès
Renato De Mori
Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it… (see more) easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs.
Twin Regularization for online speech recognition
Dmitriy Serdyuk
Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challeng… (see more)ing than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future. Inspired by a recently proposed technique called Twin Networks, we add a regularization term that forces forward hidden states to be as close as possible to cotemporal backward ones, computed by a "twin" neural network running backwards in time. The experiments, conducted on a number of datasets, recurrent architectures, input features, and acoustic conditions, have shown the effectiveness of this approach. One important advantage is that our method does not introduce any additional computation at test time if compared to standard unidirectional recurrent networks.