Gintare Karolina Dziugaite

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin

Ahmed Imtiaz Humayun

Utku Evci

Suvinay Subramanian

Amir Yazdanbakhsh

Dan Alistarh

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

2025-01-22

ICLR.cc/2025/Conference (poster)

The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws

Daniel M. Roy

We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws

Daniel M. Roy

We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.

2025-01-22

aistats.org/AISTATS/2025/Conference (poster)

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Tian Jin

Ahmed Imtiaz Humayun

Utku Evci

Suvinay Subramanian

Amir Yazdanbakhsh

Dan Alistarh

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

2025-01-21

ArXiv (preprint)

Soup to go: mitigating forgetting during continual learning with model averaging

Anat Kleiman

Jonathan Frankle

Sham M. Kakade

Mansheej Paul

In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.

2025-01-09

ArXiv (preprint)

Soup to go: mitigating forgetting during continual learning with model averaging

Anat Kleiman

Jonathan Frankle

Sham M. Kakade

Mansheej Paul

In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.

2025-01-09

ArXiv (preprint)

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Razvan Pascanu

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (preprint)

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Razvan Pascanu

Sarath Chandar

2024-12-25

ArXiv (preprint)

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Razvan Pascanu

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (preprint)

Improved Localized Machine Unlearning Through the Lens of Memorization

Reihaneh Torkzadehmahani

Reza Nasirigerdeh

Georgios Kaissis

Daniel Rueckert

Eleni Triantafillou

Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after… (see more) it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.

2024-12-03

ArXiv (preprint)

Improved Localized Machine Unlearning Through the Lens of Memorization

Reihaneh Torkzadehmahani

Reza Nasirigerdeh

Georgios Kaissis

Daniel Rueckert

Eleni Triantafillou

Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after… (see more) it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.

2024-12-03

ArXiv (preprint)