Katie Everett

Logarithmic-time Schedules for Scaling Language Models with Momentum

In practice, the hyperparameters …

2026-02-04

arXiv (preprint)

doi.org

arxiv.org

Dimension-adapted Momentum Outscales SGD

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (see more)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

2025-09-17

NeurIPS.cc/2025/Conference (spotlight)

doi.org

openreview.net

Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse Actions, Interventions and Sparse Temporal Dependencies

Sébastien Lachapelle

Pau Rodríguez

Yash Sharma

Katie Everett

Rémi Le Priol

Alexandre Lacoste

Simon Lacoste-Julien

2024-01-09

ArXiv (preprint)

doi.org

arxiv.org

GFlowNet-EM for Learning Compositional Latent Variable Models

Edward J. Hu

Alexandros Graikos

Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large nu… (see more)mber of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictive approximations to the posterior. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density by learning a stochastic policy for sequential construction of samples, for this intractable E-step. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational inference algorithms for complex distributions over discrete structures. Our approach, GFlowNet-EM, enables the training of expressive LVMs with discrete compositional latents, as shown by experiments on non-context-free grammar induction and on images using discrete variational autoencoders (VAEs) without conditional independence enforced in the encoder.

2023-07-02

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

GFlowNets and Variational Inference

This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically us… (see more)ed to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.

2023-01-31

ICLR.cc/2023/Conference (poster)

doi.org

openreview.net

Mila on Udemy

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Katie Everett

Blog Posts

Publications

Mila on Udemy

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Popular keywords:

Katie Everett

Blog Posts

Publications