Understanding the difficulty of training deep feedforward neural networks
Whereas before 2006 it appears that deep multi- layer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experi- mental results showing the superiority of deeper vs less deep architectures. All these experimen- tal results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations func- tions. We find that the logistic sigmoid activation is unsuited for deep networks with random ini- tialization because of its mean value, which can drive especially the top hidden layer into satu- ration. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during train- ing, with the idea that training may be more dif- ficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new ini- tialization scheme that brings substantially faster convergence.