Elliot Paquette

Anisotropic local law for non-separable sample covariance matrices

Fan Zhou

Renyuan Ma

Zhichao Wang

Zhou Fan

We establish local laws for sample covariance matrices …

2026-02-19

arXiv (prépublication)

Logarithmic-time Schedules for Scaling Language Models with Momentum

In practice, the hyperparameters …

2026-02-04

arXiv (prépublication)

High-Dimensional Privacy-Utility Dynamics of Noisy Stochastic Gradient Descent on Least Squares

Shurong Lin

Eric D. Kolaczyk

Adam Smith

2025-10-18

ArXiv (prépublication)

Dimension-adapted Momentum Outscales SGD

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (voir plus)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

Ke Liang Xiao

Noah Marshall

Atish Agarwala

In recent years, signSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers lik… (voir plus)e Adam. Though there is a general consensus that signSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of signSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of signSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

2025-07-14

International Conference on Machine Learning (Accept (poster))

proceedings.mlr.press

4+3 Phases of Compute-Optimal Neural Scaling Laws

Lechao Xiao

Jeffrey Pennington

We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use t… (voir plus)his neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.

2024-09-24

NeurIPS.cc/2024/Conference (spotlight)

Elizabeth Collins-Woodfin

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

INBAR SEROUSSI

Begoña García Malaxechebarría

Andrew W. Mackenzie

We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, whi… (voir plus)ch we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution. We provide our code for evaluation at https://github.com/amackenzie1/highline2024.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

Differentially Private Linear Regression With Linked Data

Shurong Lin

Eric D. Kolaczyk

There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential … (voir plus)privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.

2024-07-08

Harvard Data Science Review (publié)

Elizabeth Collins-Woodfin

Hitting the High-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models

INBAR SEROUSSI

We analyze the dynamics of streaming stochastic gradient descent (SGD) in the high-dimensional limit when applied to generalized linear mode… (voir plus)ls and multi-index models (e.g. logistic regression, phase retrieval) with general data-covariance. In particular, we demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations that describes a wide class of statistics, such as the risk and other measures of sub-optimality. This equivalence holds with overwhelming probability when the model parameter count grows proportionally to the number of data. This framework allows us to obtain learning rate thresholds for stability of SGD as well as convergence guarantees. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient (homogenized SGD) which allows us to analyze the dynamics of general statistics of SGD iterates. Finally, we illustrate this theory on some standard examples and show numerical simulations which give an excellent match to the theory.

2023-08-16

ArXiv (prépublication)

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

Ben Adlam

Jeffrey Pennington

2022-05-13

ArXiv (prépublication)

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Ben Adlam

Jeffrey Pennington

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of… (voir plus) problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).

2021-12-31

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (publié)