Portrait de Courtney Paquette

Courtney Paquette

Membre académique associé
Chaire en IA Canada-CIFAR
Professeure adjointe, McGill University, Département de mathématiques et statistiques
Chercheuse scientifique, Google Brain
Sujets de recherche
Optimisation

Biographie

Courtney Paquette est professeure adjointe à l'Université McGill et titulaire d'une chaire en IA Canada-CIFAR à Mila – Institut québécois d’intelligence artificielle. Sa recherche se concentre sur la conception et l'analyse d'algorithmes pour les problèmes d'optimisation à grande échelle, et vise des applications en science des données. Courtney Paquette a obtenu un doctorat en mathématiques de l'Université de Washington (2017), a occupé des postes postdoctoraux à l'Université Lehigh (2017-2018) et à l'Université de Waterloo (bourse postdoctorale de la NSF, 2018-2019), et a été chercheuse scientifique chez Google Research, Brain Montréal (2019-2020).

Étudiants actuels

Maîtrise recherche - McGill
Maîtrise recherche - McGill
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Stagiaire de recherche - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill

Publications

Hitting the High-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models
We analyze the dynamics of streaming stochastic gradient descent (SGD) in the high-dimensional limit when applied to generalized linear mode… (voir plus)ls and multi-index models (e.g. logistic regression, phase retrieval) with general data-covariance. In particular, we demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations that describes a wide class of statistics, such as the risk and other measures of sub-optimality. This equivalence holds with overwhelming probability when the model parameter count grows proportionally to the number of data. This framework allows us to obtain learning rate thresholds for stability of SGD as well as convergence guarantees. In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient (homogenized SGD) which allows us to analyze the dynamics of general statistics of SGD iterates. Finally, we illustrate this theory on some standard examples and show numerical simulations which give an excellent match to the theory.
Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime
The recently developed average-case analysis of optimization methods allows a more fine-grained and representative convergence analysis than… (voir plus) usual worst-case results. In exchange, this analysis requires a more precise hypothesis over the data generating process, namely assuming knowledge of the expected spectral distribution (ESD) of the random matrix associated with the problem. This work shows that the concentration of eigenvalues near the edges of the ESD determines a problem's asymptotic average complexity. This a priori information on this concentration is a more grounded assumption than complete knowledge of the ESD. This approximate concentration is effectively a middle ground between the coarseness of the worst-case scenario convergence and the restrictive previous average-case analysis. We also introduce the Generalized Chebyshev method, asymptotically optimal under a hypothesis on this concentration and globally optimal when the ESD follows a Beta distribution. We compare its performance to classical optimization algorithms, such as gradient descent or Nesterov's scheme, and we show that, in the average-case context, Nesterov's method is universally nearly optimal asymptotically.
Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties
Ben Adlam
Jeffrey Pennington
Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis
Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more… (voir plus) representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with first-order methods including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.
Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions
Ben Adlam
Jeffrey Pennington
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of… (voir plus) problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).
Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions
Kiwon Lee
Andrew N. Cheng
We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number o… (voir plus)f samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of