Portrait de Courtney Paquette

Courtney Paquette

Membre académique associé
Chaire en IA Canada-CIFAR
Professeure adjointe, McGill University, Département de mathématiques et statistiques
Chercheuse scientifique, Google Brain
Sujets de recherche
Optimisation

Biographie

Courtney Paquette est professeure adjointe à l'Université McGill et titulaire d'une chaire en IA Canada-CIFAR à Mila – Institut québécois d’intelligence artificielle. Sa recherche se concentre sur la conception et l'analyse d'algorithmes pour les problèmes d'optimisation à grande échelle, et vise des applications en science des données. Courtney Paquette a obtenu un doctorat en mathématiques de l'Université de Washington (2017), a occupé des postes postdoctoraux à l'Université Lehigh (2017-2018) et à l'Université de Waterloo (bourse postdoctorale de la NSF, 2018-2019), et a été chercheuse scientifique chez Google Research, Brain Montréal (2019-2020).

Étudiants actuels

Maîtrise recherche - McGill
Maîtrise recherche - McGill
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Stagiaire de recherche - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill

Publications

Logarithmic-time Schedules for Scaling Language Models with Momentum
In practice, the hyperparameters …
Dimension-adapted Momentum Outscales SGD
We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (voir plus)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.
Two-point deterministic equivalence for SGD in random feature models
Alexander Atanasov
Blake Bordelon
Jacob A Zavatone-Veth
Cengiz Pehlevan
Implicit Diffusion: Efficient Optimization through Stochastic Sampling
Pierre Marion
Anna Korba
Peter Bartlett
Mathieu Blondel
Valentin De Bortoli
Arnaud Doucet
Felipe Llinares-López
Quentin Berthet
High Dimensional First Order Mini-Batch Algorithms on Quadratic Problems
Andrew Nicholas Cheng
Kiwon Lee
We analyze the dynamics of general mini-batch first order algorithms on the …
4+3 Phases of Compute-Optimal Neural Scaling Laws
Lechao Xiao
Jeffrey Pennington
We consider the solvable neural scaling model with three parameters: data complexity, target complexity, and model-parameter-count. We use t… (voir plus)his neural scaling model to derive new predictions about the compute-limited, infinite-data scaling law regime. To train the neural scaling model, we run one-pass stochastic gradient descent on a mean-squared loss. We derive a representation of the loss curves which holds over all iteration counts and improves in accuracy as the model parameter count grows. We then analyze the compute-optimal model-parameter-count, and identify 4 phases (+3 subphases) in the data-complexity/target-complexity phase-plane. The phase boundaries are determined by the relative importance of model capacity, optimizer noise, and embedding of the features. We furthermore derive, with mathematical proof and extensive numerical evidence, the scaling-law exponents in all of these phases, in particular computing the optimal model-parameter-count as a function of floating point operation budget.
The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, whi… (voir plus)ch we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution. We provide our code for evaluation at https://github.com/amackenzie1/highline2024.
Mirror Descent Algorithms with Nearly Dimension-Independent Rates for Differentially-Private Stochastic Saddle-Point Problems extended abstract
Tomas Gonzalez
Cristobal Guzman
Mirror Descent Algorithms with Nearly Dimension-Independent Rates for Differentially-Private Stochastic Saddle-Point Problems
Tom'as Gonz'alez
Crist'obal Guzm'an
Hitting the High-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models
We analyze the dynamics of streaming stochastic gradient descent (SGD) in the high-dimensional limit when applied to generalized linear mode… (voir plus)ls and multi-index models (e.g. logistic regression, phase retrieval) with general data-covariance. In particular, we demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations that describes a wide class of statistics, such as the risk and other measures of sub-optimality. This equivalence holds with overwhelming probability when the model parameter count grows proportionally to the number of data. This framework allows us to obtain learning rate thresholds for stability of SGD as well as convergence guarantees. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient (homogenized SGD) which allows us to analyze the dynamics of general statistics of SGD iterates. Finally, we illustrate this theory on some standard examples and show numerical simulations which give an excellent match to the theory.
Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime
The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than… (voir plus) usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.
Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties
Ben Adlam
Jeffrey Pennington