Elliot Paquette

Phases of Muon: When Muon Eclipses SignSGD

Lucas Benigni

Atish Agarwala

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperfo… (voir plus)rming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent

2026-05-09

arXiv (prépublication)

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

Andy Zeyi Liu

John Sous

Training loss and throughput can hide distinct internal representation in language-model training. To examine these hidden mechanics, we use… (voir plus) spectral measurements as practical and operational diagnostics. Using a controlled family of decoder-only models adapted from the modded NanoGPT codebase, we introduce an empirical protocol based on activation covariance and per-sample gradient SVD spectra. This dual-view reveals three empirical findings and one mechanistic explanation. First, batch size acts as a latent determinant of representation geometry: runs that reach equal loss settle into systematically distinct activation spectra. Second, the activation covariance tail measured early in training reliably forecasts downstream token efficiency. Third, movement of the activation spectrum head (leading modes), together with gradient spectra, characterizes underlying learning-dynamics changes, separating learning-side architectural improvements from primarily execution-side gains. These predictive and diagnostic signals persist across the 12-, 36-, and 48-layer model tiers. Finally, a mechanistic model proves the main observations and explains how activation covariance spectra correlate with task-aligned feature learning.

2026-05-06

arXiv (prépublication)

Power-Law Spectrum of the Random Feature Model

Ke Liang Xiao

Yizhe Zhu

Scaling laws for neural networks, in which the loss decays as a power-law in the number of parameters, data, and compute, depend fundamental… (voir plus)ly on the spectral structure of the data covariance, with power-law eigenvalue decay appearing ubiquitously in vision and language tasks. A central question is whether this spectral structure is preserved or destroyed when data passes through the basic building block of a neural network: a random linear projection followed by a nonlinear activation. We study this question for the random feature model: given data

2026-03-14

arXiv (prépublication)

Anisotropic local law for non-separable sample covariance matrices

Fan Zhou

Renyuan Ma

Zhichao Wang

Zhou Fan

We establish local laws for sample covariance matrices …

2026-02-19

arXiv (prépublication)

Logarithmic-time Schedules for Scaling Language Models with Momentum

In practice, the hyperparameters …

2026-02-04

arXiv (prépublication)

High-Dimensional Privacy-Utility Dynamics of Noisy Stochastic Gradient Descent on Least Squares

Shurong Lin

Eric D. Kolaczyk

Adam Smith

2025-10-18

ArXiv (prépublication)

Dimension-adapted Momentum Outscales SGD

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (voir plus)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

openreview.net

Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

Ke Liang Xiao

Noah Marshall

Atish Agarwala

In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers lik… (voir plus)e Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

2025-07-14

International Conference on Machine Learning (Accept (poster))

proceedings.mlr.press

4+3 Phases of Compute-Optimal Neural Scaling Laws

Courtney Paquette

Lechao Xiao

Jeffrey Pennington

We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use t… (voir plus)his neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.

2024-09-24

NeurIPS.cc/2024/Conference (spotlight)

Elizabeth Collins-Woodfin

openreview.net

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

INBAR SEROUSSI

Begoña García Malaxechebarría

Andrew W. Mackenzie

Courtney Paquette

We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, whi… (voir plus)ch we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution. We provide our code for evaluation at https://github.com/amackenzie1/highline2024.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

openreview.net

Differentially Private Linear Regression With Linked Data

Shurong Lin

Eric D. Kolaczyk

There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential … (voir plus)privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.

2024-07-30

Harvard Data Science Review (publié)