Andrei Mircea

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Supriyo Chakraborty

Nima Chitsazan

This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (see more)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

2025-06-05

ArXiv (preprint)

arxiv.org

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Andrei Mircea

Supriyo Chakraborty

Nima Chitsazan

Irina Rish

Ekaterina Lobacheva

This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (see more)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

2025-01-01

ACL (1) (published)

doi.org

arxiv.org

Language model scaling laws and zero-sum learning

Andrei Mircea

Ekaterina Lobacheva

Supriyo Chakraborty

Nima Chitsazan

Irina Rish

This work aims to understand how, in terms of training dynamics, scaling up language model size yields predictable loss improvements. We fin… (see more)d that these improvements can be tied back to loss deceleration, an abrupt transition in the rate of loss improvement, characterized by piece-wise linear behavior in log-log space. Notably, improvements from increased model size appear to be a result of (1) improving the loss at which this transition occurs; and (2) improving the rate of loss improvement after this transition. As an explanation for the mechanism underlying this transition (and the effect of model size on loss it mediates), we propose the zero-sum learning (ZSL) hypothesis. In ZSL, per-token gradients become systematically opposed, leading to degenerate training dynamics where the model can't improve loss on one token without harming it on another; bottlenecking the overall rate at which loss can improve. We find compelling evidence of ZSL, as well as unexpected results which shed light on other factors contributing to ZSL.

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

Scalable Approaches for a Theory of Many Minds

Maximilian Puelma Touzel

Amin Memarian

Matthew D Riemer

Andrei Mircea

Andrew Robert Williams

Elin Ahlstrand

Lucas Lehnert

Rupali Bhati

Guillaume Dumas

Irina Rish

A major challenge as we move towards building agents for real-world problems, which could involve a massive number of human and/or machine a… (see more)gents, is that we must learn to reason about the behavior of these many other agents. In this paper, we consider the problem of scaling a predictive Theory of Mind (ToM) model to a very large number of interacting agents with a fixed computational budget. Motivated by the limited diversity of agent types, existing approaches to scalable TOM learn versatile single-agent representations for quickly adapting to new agents encountered sequentially. We consider the more general setting that many agents are observed in parallel and formulate the corresponding Theory of Many Minds (ToMM) problem of estimating the joint policy. We frame the scaling behavior of solutions in terms of parameter sharing schemes and in particular propose two parameter-free architectural features that endow models with the ability to exploit action correlations: encoding a multi-agent context, and decoding through an abstracted joint action space. The increased predictive capabilities that have come with foundation models have made it easier to imagine the possibility of using these models to make simulations that imitate the behavior of many agents within complex real-world systems. Being able to perform these simulations in a general-purpose way would not only help make more capable agents, it also would be a very useful capability for applications in social science, political science, and economics.

2024-06-18

ICML.cc/2024/Workshop/Agentic_Markets (poster)

openreview.net

Gradient Dissent in Language Model Training and Saturation

Andrei Mircea

Ekaterina Lobacheva

Irina Rish

We seek to shed light on language model (LM) saturation from the perspective of learning dynamics. To this end, we define a decomposition o… (see more)f the cross-entropy gradient, which forms a shared low-dimensional basis for analyzing the training dynamics of models across scales. Intuitively, this decomposition consists of attractive and repulsive components that increase the logit of the correct class and decrease the logits of incorrect classes, respectively. Our analysis in this subspace reveals a phenomenon we term \textit{gradient dissent}, characterized by gradient components becoming systematically opposed such that loss cannot be improved along one component without being degraded along the other. Notably, we find that complete opposition, which we term \textit{total dissent}, reliably occurs in tandem with the saturation of smaller LMs. Based on these results, we hypothesize that gradient dissent can provide a useful foundation for better understanding and mitigating saturation.

2024-06-16

ICML.cc/2024/Workshop/HiLD (poster)

openreview.net

Balaur: Language Model Pretraining with Lexical Semantic Relations

Andrei Mircea

Jackie Cheung

2023-12-01

Findings of the Association for Computational Linguistics: EMNLP 2023 (published)

doi.org

openreview.net

Discourse-Aware Unsupervised Summarization for Long Scientific Documents

Yue Dong

Andrei Mircea

Jackie Cheung

2021-04-01

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (published)

doi.org

HipoRank: Incorporating Hierarchical and Positional Information into Graph-based Unsupervised Long Document Extractive Summarization

Yue Dong

Andrei Mircea

Jackie Cheung

We propose a novel graph-based ranking model for unsupervised extractive summarization of long documents. Graph-based ranking models typical… (see more)ly represent documents as undirected fully-connected graphs, where a node is a sentence, an edge is weighted based on sentence-pair similarity, and sentence importance is measured via node centrality. Our method leverages positional and hierarchical information grounded in discourse structure to augment a document's graph representation with hierarchy and directionality. Experimental results on PubMed and arXiv datasets show that our approach outperforms strong unsupervised baselines by wide margins and performs comparably to some of the state-of-the-art supervised models that are trained on hundreds of thousands of examples. In addition, we find that our method provides comparable improvements with various distributional sentence representations; including BERT and RoBERTa models fine-tuned on sentence similarity.

2020-05-01

ArXiv (preprint)

arxiv.org

Mila AI Policy Conference

Leading in a New Era

TRAIL: Responsible AI for Professionals and Leaders

Andrei Mircea

Publications

Mila AI Policy Conference

Leading in a New Era

TRAIL: Responsible AI for Professionals and Leaders

Popular keywords:

Andrei Mircea

Publications