Portrait of Ekaterina Lobacheva

Ekaterina Lobacheva

Postdoctorate - Université de Montréal
Supervisor
Research Topics
Deep Learning
Optimization

Publications

Revisiting the Goldilocks Zone in Inhomogeneous Networks
Zacharie Garnier Cuchet
We investigate how architectural inhomogeneities—such as biases, layer normalization, and residual connections—affect the curvature of t… (see more)he loss landscape at initialization and its link to trainability. We focus on the Goldilocks zone, a region in parameter space with excess positive curvature, previously associated with improved optimization in homogeneous networks. To extend this analysis, we compare two scaling strategies: weight scaling and softmax temperature scaling. Our results show that in networks with biases or residual connections, both strategies identify a Goldilocks zone aligned with better training. In contrast, layer normalization leads to lower or negative curvature, yet stable optimization—revealing a disconnect between curvature and trainability. Softmax temperature scaling behaves more consistently across models, making it a more robust probe. Overall, the Goldilocks zone remains relevant in inhomogeneous networks, but its geometry and predictive power depend on architectural choices, particularly normalization.
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Supriyo Chakraborty
Nima Chitsazan
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (see more)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Supriyo Chakraborty
Nima Chitsazan
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (see more)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
How Learning Rates Shape Neural Network Focus: Insights from Example Ranking
The learning rate is a key hyperparameter that affects both the speed of training and the generalization performance of neural networks. Th… (see more)rough a new {\it loss-based example ranking} analysis, we show that networks trained with different learning rates focus their capacity on different parts of the data distribution, leading to solutions with different generalization properties. These findings, which hold across architectures and datasets, provide new insights into how learning rates affect model performance and example-level dynamics in neural networks.
Language model scaling laws and zero-sum learning
Supriyo Chakraborty
Nima Chitsazan
This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We fin… (see more)d that these improvements can be tied back to loss deceleration, an abrupt transition in the rate of loss improvement, characterized by piece-wise linear behavior in log-log space. Notably, improvements from increased model size appear to be a result of (1) improving the loss at which this transition occurs; and (2) improving the rate of loss improvement after this transition. As an explanation for the mechanism underlying this transition (and the effect of model size on loss it mediates), we propose the zero-sum learning (ZSL) hypothesis. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where the model can't improve loss on one token without harming it on another; bottlenecking the overall rate at which loss can improve. We find compelling evidence of ZSL, as well as unexpected results which shed light on other factors contributing to ZSL.
Gradient Dissent in Language Model Training and Saturation
We seek to shed light on language model (LM) saturation from the perspective of learning dynamics. To this end, we define a decomposition o… (see more)f the cross-entropy gradient, which forms a shared low-dimensional basis for analyzing the training dynamics of models across scales. Intuitively, this decomposition consists of attractive and repulsive components that increase the logit of the correct class and decrease the logits of incorrect classes, respectively. Our analysis in this subspace reveals a phenomenon we term \textit{gradient dissent}, characterized by gradient components becoming systematically opposed such that loss cannot be improved along one component without being degraded along the other. Notably, we find that complete opposition, which we term \textit{total dissent}, reliably occurs in tandem with the saturation of smaller LMs. Based on these results, we hypothesize that gradient dissent can provide a useful foundation for better understanding and mitigating saturation.