Portrait de Mohammad Pezeshki n'est pas disponible

Mohammad Pezeshki

Collaborateur·rice de recherche
Superviseur⋅e principal⋅e
Co-supervisor
Sujets de recherche
Théorie de l'apprentissage automatique

Publications

Discovering environments with XRM
Diane Bouchacourt
Mark Ibrahim
Nicolas Ballas
David Lopez-Paz
Successful out-of-distribution generalization requires environment annotations. Unfortunately, these are resource-intensive to obtain, and t… (voir plus)heir relevance to model performance is limited by the expectations and perceptual biases of human annotators. Therefore, to enable robust AI systems across applications, we must develop algorithms to automatically discover environments inducing broad generalization. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods add hyper-parameters and early-stopping criteria that are impossible to tune without a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk-Minimization (XRM) to address this issue. XRM trains two twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Domain generalization algorithms built on top of XRM environments achieve oracle worst-group-accuracy, solving a long-standing problem in out-of-distribution generalization.
Multi-scale Feature Learning Dynamics: Insights for Double Descent
A key challenge in building theoretical foundations for deep learning is the complex optimization dynamics of neural networks, resulting fro… (voir plus)m the high-dimensional interactions between the large number of network parameters. Such non-trivial interactions lead to intriguing model behaviors such as the phenomenon of "double descent" of the generalization error. The more commonly studied aspect of this phenomenon corresponds to model-wise double descent where the test error exhibits a second descent with increasing model complexity, beyond the classical U-shaped error curve. In this work, we investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. We study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error. We validate our findings through numerical experiments where our theory accurately predicts empirical findings and remains consistent with observations in deep neural networks.
Gradient Starvation: A Learning Proclivity in Neural Networks
We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks… (voir plus). Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.
Gradient Starvation: A Learning Proclivity in Neural Networks
We identify and formalize a fundamental gradient descent phenomenon resulting in a learning proclivity in over-parameterized neural networks… (voir plus). Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task, despite the presence of other predictive features that fail to be discovered. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks. Using tools from Dynamical Systems theory, we identify simple properties of learning dynamics during gradient descent that lead to this imbalance, and prove that such a situation can be expected given certain statistical structure in training data. Based on our proposed formalism, we develop guarantees for a novel regularization method aimed at decoupling feature learning dynamics, improving accuracy and robustness in cases hindered by gradient starvation. We illustrate our findings with simple and real-world out-of-distribution (OOD) generalization experiments.
Negative Momentum for Improved Game Dynamics
Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (voir plus)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.
On the Learning Dynamics of Deep Neural Networks
Remi Tachet des Combes
Samira Shabanian
While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely… (voir plus) misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features' frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize \textit{gradient starvation} where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.
Negative Momentum for Improved Game Dynamics
Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (voir plus)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.
Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks
Ying Zhang
Philemon Brakel
Saizheng Zhang
César Laurent
Convolutional Neural Networks (CNNs) are effective models for reducing spectral variations and modeling spectral correlations in acoustic fe… (voir plus)atures for automatic speech recognition (ASR). Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models (HMMs/GMMs) have achieved the state-of-the-art in various benchmarks. Meanwhile, Connectionist Temporal Classification (CTC) with Recurrent Neural Networks (RNNs), which is proposed for labeling unsegmented sequences, makes it feasible to train an end-to-end speech recognition system instead of hybrid settings. However, RNNs are computationally expensive and sometimes difficult to train. In this paper, inspired by the advantages of both CNNs and the CTC approach, we propose an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC directly without recurrent connections. By evaluating the approach on the TIMIT phoneme recognition task, we show that the proposed model is not only computationally efficient, but also competitive with the existing baseline systems. Moreover, we argue that CNNs have the capability to model temporal correlations with appropriate context information.
Deconstructing the Ladder Network Architecture
Linxi Fan
Philemon Brakel
The Ladder Network is a recent new approach to semi-supervised learning that turned out to be very successful. While showing impressive perf… (voir plus)ormance, the Ladder Network has many components intertwined, whose contributions are not obvious in such a complex architecture. This paper presents an extensive experimental investigation of variants of the Ladder Network in which we replaced or removed individual components to learn about their relative importance. For semi-supervised tasks, we conclude that the most important contribution is made by the lateral connections, followed by the application of noise, and the choice of what we refer to as the 'combinator function'. As the number of labeled training examples increases, the lateral connections and the reconstruction criterion become less important, with most of the generalization improvement coming from the injection of noise in each layer. Finally, we introduce a combinator function that reduces test error rates on Permutation-Invariant MNIST to 0.57% for the supervised setting, and to 0.97% and 1.0% for semi-supervised settings with 1000 and 100 labeled examples, respectively.
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
J'anos Kram'ar
Nicolas Ballas
Nan Rosemary Ke
Anirudh Goyal
We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain thei… (voir plus)r previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that zoneout gives significant performance improvements across tasks. We achieve competitive results with relatively simple models in character- and word-level language modelling on the Penn Treebank and Text8 datasets, and combining with recurrent batch normalization yields state-of-the-art results on permuted sequential MNIST.
Theano: A Python framework for fast computation of mathematical expressions
Rami Al-rfou'
Amjad Almahairi
Christof Angermüller
Nicolas Ballas
Frédéric Bastien
Justin S. Bayer
A. Belikov
A. Belopolsky
James Bergstra
Valentin Bisson
Josh Bleecher Snyder
Nicolas Bouchard
Nicolas Boulanger-Lewandowski
Alexandre De Brébisson
Kyunghyun Cho
Jan Chorowski
Paul F. Christiano
Tim Cooijmans
Marc-Alexandre Côté
Myriam Côté
Yann Dauphin
Olivier Delalleau
Julien Demouth
Guillaume Desjardins
Sander Dieleman
Laurent Dinh
M'elanie Ducoffe
Vincent Dumoulin
Dumitru Erhan
Ziye Fan
Orhan Firat
Mathieu Germain
Xavier Glorot
Ian G Goodfellow
Matthew Graham
Caglar Gulcehre
Philippe Hamel
Iban Harlouchet
Jean-philippe Heng
Balázs Hidasi
Sina Honari
Arjun Jain
Sébastien Jean
Kai Jia
Mikhail V. Korobov
Vivek Kulkarni
Alex Lamb
Pascal Lamblin
Eric Larsen
César Laurent
S. Lee
Simon-mark Lefrancois
Simon Lemieux
Nicholas Léonard
Zhouhan Lin
J. Livezey
Cory R. Lorenz
Jeremiah L. Lowin
Qianli M. Ma
Pierre-Antoine Manzagol
Olivier Mastropietro
R. McGibbon
Roland Memisevic
Bart van Merriënboer
Mehdi Mirza
Alberto Orlandi
Colin Raffel
Daniel Renshaw
Matthew David Rocklin
Markus Dr. Roth
Peter Sadowski
John Salvatier
François Savard
Jan Schlüter
John D. Schulman
Gabriel Schwartz
Iulian V. Serban
Dmitriy Serdyuk
Samira Shabanian
Etienne Simon
Sigurd Spieckermann
S. Subramanyam
Jakub Sygnowski
Jérémie Tanguay
Gijs van Tulder
Joseph Turian
Sebastian Urban
Francesco Visin
Harm de Vries
David Warde-Farley
Dustin J. Webb
M. Willson
Kelvin Xu
Lijun Xue
Li Yao
Saizheng Zhang
Ying Zhang
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficie… (voir plus)ntly. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.
Theano: A Python framework for fast computation of mathematical expressions
Rami Al-rfou'
Amjad Almahairi
Christof Angermüller
Nicolas Ballas
Frédéric Bastien
Justin S. Bayer
A. Belikov
A. Belopolsky
J. Bergstra
Valentin Bisson
Josh Bleecher Snyder
Nicolas Bouchard
Nicolas Boulanger-Lewandowski
Alexandre De Brébisson
Kyunghyun Cho
Jan Chorowski
Paul F. Christiano
Tim Cooijmans
Marc-Alexandre Côté
Myriam Côté
Yann Dauphin
Olivier Delalleau
Julien Demouth
Guillaume Desjardins
Sander Dieleman
Laurent Dinh
M'elanie Ducoffe
Vincent Dumoulin
Dumitru Erhan
Ziye Fan
Orhan Firat
Mathieu Germain
Xavier Glorot
Ian J. Goodfellow
Matthew Graham
Caglar Gulcehre
Philippe Hamel
Iban Harlouchet
Jean-philippe Heng
Balázs Hidasi
Sina Honari
Arjun Jain
S'ebastien Jean
Kai Jia
Mikhail V. Korobov
Vivek Kulkarni
Alex Lamb
Pascal Lamblin
Eric P. Larsen
César Laurent
S. Lee
Simon-mark Lefrancois
Simon Lemieux
Nicholas Léonard
Zhouhan Lin
J. Livezey
Cory R. Lorenz
Jeremiah L. Lowin
Qianli M. Ma
Pierre-Antoine Manzagol
Olivier Mastropietro
R. McGibbon
Roland Memisevic
Bart van Merriënboer
Mehdi Mirza
Alberto Orlandi
Colin Raffel
Daniel Renshaw
Matthew David Rocklin
Markus Dr. Roth
Peter Sadowski
John Salvatier
François Savard
Jan Schlüter
John D. Schulman
Gabriel Schwartz
Iulian V. Serban
Dmitriy Serdyuk
Samira Shabanian
Etienne Simon
Sigurd Spieckermann
S. Subramanyam
Jakub Sygnowski
Jérémie Tanguay
Gijs van Tulder
Joseph P. Turian
Sebastian Urban
Francesco Visin
Harm de Vries
David Warde-Farley
Dustin J. Webb
M. Willson
Kelvin Xu
Lijun Xue
Li Yao
Saizheng Zhang
Ying Zhang
Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficie… (voir plus)ntly. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.