Ekaterina Lobacheva

Is Depth Heterogeneity a Barrier to Model Merging?

Model merging offers a way to combine the capabilities of several networks at test time without retraining or additional finetuning, but mos… (voir plus)t merging methods assume identical architectures. Depth differences are commonly viewed as a major obstacle because they remove clear layer correspondences. We test this assumption by merging residual networks that differ only in depth, using a simple training-free pipeline based on identity expansion and permutation alignment. Across both same-task and multitask image classification experiments, heterogeneous merges closely match homogeneous ones. The results suggest that, for residual networks, depth mismatch is not the main barrier to effective model merging, and that the main difficulty in model merging comes from aligning independently trained weights in a homogeneous setting.

2026-02-28

TTU_Main_Track @ International Conference on Learning Representations (publié)

openreview.net

Loss Smoothing for Continual Adaptation

Neural networks are often adapted in nonstationary data distributions settings where the objective is to optimize performance on the current… (voir plus) task, and preserving accuracy on previous tasks is not required. As a result, existing methods primarily focus on improving plasticity, while stability is largely studied in the context of continual learning. In this work, we examine whether preserving stability can also be beneficial in model adaptation settings where past-task performance is irrelevant. We propose a simple loss smoothing approach that encourages selective adaptation by preserving task-shared features while modifying task-inconsistent ones. We evaluate our method on continual supervised model adaptation benchmarks and reinforcement learning benchmarks, and show that promoting representational stability during adaptation can improve performance across settings.

2026-02-27

CAO @ International Conference on Learning Representations (poster)

openreview.net

Revisiting the Goldilocks Zone in Inhomogeneous Networks

Zacharie Garnier Cuchet

A. Chandar

Ekaterina Lobacheva

We investigate how architectural inhomogeneities—such as biases, layer normalization, and residual connections—affect the curvature of t… (voir plus)he loss landscape at initialization and its link to trainability. We focus on the Goldilocks zone, a region in parameter space with excess positive curvature, previously associated with improved optimization in homogeneous networks. To extend this analysis, we compare two scaling strategies: weight scaling and softmax temperature scaling. Our results show that in networks with biases or residual connections, both strategies identify a Goldilocks zone aligned with better training. In contrast, layer normalization leads to lower or negative curvature, yet stable optimization—revealing a disconnect between curvature and trainability. Softmax temperature scaling behaves more consistently across models, making it a more robust probe. Overall, the Goldilocks zone remains relevant in inhomogeneous networks, but its geometry and predictive power depend on architectural choices, particularly normalization.

2025-06-08

ICML.cc/2025/Workshop/HiLD (poster)

openreview.net

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Andrei Mircea

Supriyo Chakraborty

Nima Chitsazan

Irina Rish

Ekaterina Lobacheva

This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (voir plus)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

2024-12-31

ACL (1) (publié)

doi.org

arxiv.org

How Learning Rates Shape Neural Network Focus: Insights from Example Ranking

Ekaterina Lobacheva

Keller Jordan

Aristide Baratin

Nicolas Roux

The learning rate is a key hyperparameter that affects both the speed of training and the generalization performance of neural networks. Th… (voir plus)rough a new {\it loss-based example ranking} analysis, we show that networks trained with different learning rates focus their capacity on different parts of the data distribution, leading to solutions with different generalization properties. These findings, which hold across architectures and datasets, provide new insights into how learning rates affect model performance and example-level dynamics in neural networks.

2024-10-09

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

Language model scaling laws and zero-sum learning

Andrei Mircea

Ekaterina Lobacheva

Supriyo Chakraborty

Nima Chitsazan

Irina Rish

This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We fin… (voir plus)d that these improvements can be tied back to loss deceleration, an abrupt transition in the rate of loss improvement, characterized by piece-wise linear behavior in log-log space. Notably, improvements from increased model size appear to be a result of (1) improving the loss at which this transition occurs; and (2) improving the rate of loss improvement after this transition. As an explanation for the mechanism underlying this transition (and the effect of model size on loss it mediates), we propose the zero-sum learning (ZSL) hypothesis. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where the model can't improve loss on one token without harming it on another; bottlenecking the overall rate at which loss can improve. We find compelling evidence of ZSL, as well as unexpected results which shed light on other factors contributing to ZSL.

2024-10-09

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

Gradient Dissent in Language Model Training and Saturation

Andrei Mircea

Ekaterina Lobacheva

Irina Rish

We seek to shed light on language model (LM) saturation from the perspective of learning dynamics. To this end, we define a decomposition o… (voir plus)f the cross-entropy gradient, which forms a shared low-dimensional basis for analyzing the training dynamics of models across scales. Intuitively, this decomposition consists of attractive and repulsive components that increase the logit of the correct class and decrease the logits of incorrect classes, respectively. Our analysis in this subspace reveals a phenomenon we term \textit{gradient dissent}, characterized by gradient components becoming systematically opposed such that loss cannot be improved along one component without being degraded along the other. Notably, we find that complete opposition, which we term \textit{total dissent}, reliably occurs in tandem with the saturation of smaller LMs. Based on these results, we hypothesize that gradient dissent can provide a useful foundation for better understanding and mitigating saturation.

2024-06-15

ICML.cc/2024/Workshop/HiLD (poster)

openreview.net

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Ekaterina Lobacheva

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Ekaterina Lobacheva

Publications