Portrait of Gintare Karolina Dziugaite

Gintare Karolina Dziugaite

Associate Industry Member
Adjunct Professor, McGill University, School of Computer Science
Senior Research Scientist, Google DeepMind
Research Topics
Deep Learning
Information Theory
Machine Learning Theory

Biography

Gintare Karolina Dziugaite is a senior research scientist at Google DeepMind in Toronto, and an adjunct professor at the McGill University School of Computer Science. Prior to joining Google, she led the Trustworthy AI program at Element AI (ServiceNow). Her research combines theoretical and empirical approaches to understanding deep learning.

Dziugaite is well known for her work on network and data sparsity, developing algorithms and uncovering effects on generalization and other metrics. She pioneered the study of linear mode connectivity, first connecting it to the existence of lottery tickets, then to loss landscapes and the mechanism of iterative magnitude pruning. Another major focus of her research is understanding generalization in deep learning and, more generally, the development of information-theoretic methods for studying generalization. Her most recent work looks at removing the influence of data on the model (unlearning).

Dziugaite obtained her PhD in machine learning from the University of Cambridge under the supervision of Zoubin Ghahramani. Prior to that, she studied mathematics at the University of Warwick and read Part III in Mathematics at the University of Cambridge, receiving a Master of Advanced Study (MASt) in mathematics. She has participated in a number of long-term programs at the Institute for Advanced Study in Princeton, NJ, and at the Simons Institute for the Theory of Computing at the University of Berkeley.

Publications

On the Dichotomy Between Privacy and Traceability in ℓp Stochastic Convex Optimization
Sasha Voitovych
MAHDI HAGHIFAM
Idan Attias
Roi Livni
Daniel M. Roy
On the Dichotomy Between Privacy and Traceability in $\ell_p$ Stochastic Convex Optimization
Sasha Voitovych
MAHDI HAGHIFAM
Idan Attias
Roi Livni
Daniel M. Roy
In this paper, we investigate the necessity of memorization in stochastic convex optimization (SCO) under …
On the Dichotomy Between Privacy and Traceability in $\ell_p$ Stochastic Convex Optimization
Sasha Voitovych
MAHDI HAGHIFAM
Idan Attias
Roi Livni
Daniel M. Roy
In this paper, we investigate the necessity of memorization in stochastic convex optimization (SCO) under …
Selective Unlearning via Representation Erasure Using Domain Adversarial Training
Nazanin Mohammadi Sepahvand
Eleni Triantafillou
James J. Clark
Daniel M. Roy
When deploying machine learning models in the real world, we often face the challenge of “unlearning” specific data points or subsets a… (see more)fter training. Inspired by Domain-Adversarial Training of Neural Networks (DANN), we propose a novel algorithm,SURE, for targeted unlearning.SURE treats the process as a domain adaptation problem, where the “forget set” (data to be removed) and a validation set from the same distribution form two distinct domains. We train a domain classifier to discriminate between representations from the forget and validation sets.Using a gradient reversal strategy similar to DANN, we perform gradient updates to the representations to “fool” the domain classifier and thus obfuscate representations belonging to the forget set. Simultaneously, gradient descent is applied to the retain set (original training data minus the forget set) to preserve its classification performance. Unlike other unlearning approaches whose training objectives are built based on model outputs, SURE directly manipulates the representations.This is key to ensure robustness against a set of more powerful attacks than currently considered in the literature, that aim to detect which examples were unlearned through access to learned embeddings. Our thorough experiments reveal that SURE has a better unlearning quality to utility trade-off compared to other standard unlearning techniques for deep neural networks.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Tian Jin
Ahmed Imtiaz Humayun
Utku Evci
Suvinay Subramanian
Amir Yazdanbakhsh
Dan Alistarh
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws
We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.
The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws
We study the generalization properties of randomly initialized neural networks, under the assumption that the network is larger than some un… (see more)known "teacher" network that achieves low risk. We extend the analysis of Buzaglo et al. (2024) to allow for student networks of arbitrary width and depth, and to the setting where no (small) teacher network perfectly interpolates the data. We obtain an oracle inequality, relating the risk of Gibbs posterior sampling to that of narrow teacher networks. As a result, the sample complexity is once again bounded in terms of the size of narrow teacher networks that themselves achieve small risk. We then introduce a new notion of data complexity, based on the minimal size of a teacher network required to achieve a certain level of excess risk. By comparing the scaling laws resulting from our bounds to those observed in empirical studies, we are able to estimate the data complexity of standard benchmarks according to our measure.
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
Tian Jin
Ahmed Imtiaz Humayun
Utku Evci
Suvinay Subramanian
Amir Yazdanbakhsh
Dan Alistarh
Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large l… (see more)anguage models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.