Damien Scieur

Understanding Adam Requires Better Rotation Dependent Assumptions

Lucas Maes

Tianyue H. Zhang

Alexia Jolicoeur-Martineau

Ioannis Mitliagkas

Damien Scieur

Simon Lacoste-Julien

Charles Guille-escuret

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (voir plus) paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

2024-10-25

ArXiv (prépublication)

doi.org

arxiv.org

Understanding Adam Requires Better Rotation Dependent Assumptions

Lucas Maes

Tianyue H. Zhang

Alexia Jolicoeur-Martineau

Ioannis Mitliagkas

Damien Scieur

Simon Lacoste-Julien

Charles Guille-escuret

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (voir plus) paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (publié)

doi.org

openreview.net

Only tails matter: Average-Case Universality and Robustness in the Convex Regime

Leonardo Cunha

Gauthier Gidel

Fabian Pedregosa

Damien Scieur

Courtney Paquette

2022-06-28

Proceedings of the 39th International Conference on Machine Learning (publié)

doi.org

openreview.net

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime

Leonardo Cunha

Gauthier Gidel

Fabian Pedregosa

Damien Scieur

Courtney Paquette

2022-06-20

ArXiv (prépublication)

doi.org

openreview.net

Accelerating Smooth Games by Manipulating Spectral Shapes

Waiss Azizian

2020-06-03

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

arxiv.org

Accelerating Smooth Games by Manipulating Spectral Shapes

Waiss Azizian

We use matrix iteration theory to characterize acceleration in smooth games. We define the spectral shape of a family of games as the set co… (voir plus)ntaining all eigenvalues of the Jacobians of standard gradient dynamics in the family. Shapes restricted to the real line represent well-understood classes of problems, like minimization. Shapes spanning the complex plane capture the added numerical challenges in solving smooth games. In this framework, we describe gradient-based methods, such as extragradient, as transformations on the spectral shape. Using this perspective, we propose an optimal algorithm for bilinear games. For smooth and strongly monotone operators, we identify a continuum between convex minimization, where acceleration is possible using Polyak's momentum, and the worst case where gradient descent is optimal. Finally, going beyond first-order methods, we propose an accelerated version of consensus optimization.

2020-01-02

ArXiv (preprint)

arxiv.org

Conférence sur les politiques de l'IA de Mila

À l’avant-garde d’une nouvelle ère

Éclaireurs autochtones en IA

Damien Scieur

Publications

Conférence sur les politiques de l'IA de Mila

À l’avant-garde d’une nouvelle ère

Éclaireurs autochtones en IA

Mots-clés populaires:

Damien Scieur

Publications