Portrait de Gintare Karolina Dziugaite

Gintare Karolina Dziugaite

Membre industriel associé
Professeure associée, McGill University, École d'informatique
Chercheuse scientifique senior, Google DeepMind
Sujets de recherche
Apprentissage profond
Théorie de l'apprentissage automatique
Théorie de l'information

Biographie

Gintare Karolina Dziugaite est chercheuse scientifique senior chez Google DeepMind, à Toronto, et professeure associée à l'École d'informatique de l'Université McGill. Avant de se joindre à Google, elle a dirigé le programme Trustworthy AI chez Element AI / ServiceNow. Ses recherches combinent des approches théoriques et empiriques visant à comprendre l'apprentissage profond.

Gintare Karolina Dziugaite est bien connue pour ses travaux sur la rareté des réseaux et des données, le développement d'algorithmes et la découverte des effets sur la généralisation et d'autres mesures. Elle a été la première à étudier la connectivité des modes linéaires, en les reliant d'abord à l'existence des billets de loterie, puis aux paysages de pertes et au mécanisme d'élagage itératif de la magnitude. Ses recherches portent également sur la compréhension de la généralisation dans l'apprentissage profond et, plus généralement, sur le développement de méthodes fondées sur la théorie de l'information pour l'étude de la généralisation. Ses travaux les plus récents s’intéressent à l'élimination de l'influence des données sur le modèle (désapprentissage).

Mme Dziugaite a obtenu un doctorat en apprentissage automatique de l'Université de Cambridge, sous la direction de Zoubin Ghahramani. Elle a étudié les mathématiques à l'Université de Warwick et a suivi la partie III des mathématiques à l'Université de Cambridge, où elle a obtenu un Master of Advanced Studies (M.A.St.) en mathématiques. Elle a participé à plusieurs programmes de longue durée à l'Institute for Advanced Study de l’Université Princeton (New Jersey) et au Simons Institute for the Theory of Computing de l'Université de Berkeley.

Publications

Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (voir plus)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (voir plus)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (voir plus)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (voir plus)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Improved Localized Machine Unlearning Through the Lens of Memorization
Reihaneh Torkzadehmahani
Reza Nasirigerdeh
Georgios Kaissis
Daniel Rueckert
Eleni Triantafillou
Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after… (voir plus) it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.
Improved Localized Machine Unlearning Through the Lens of Memorization
Reihaneh Torkzadehmahani
Reza Nasirigerdeh
Georgios Kaissis
Daniel Rueckert
Eleni Triantafillou
Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after… (voir plus) it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.
Improved Localized Machine Unlearning Through the Lens of Memorization
Reihaneh Torkzadehmahani
Reza Nasirigerdeh
Georgios Kaissis
Daniel Rueckert
Eleni Triantafillou
Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after… (voir plus) it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.
Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method
Teodora Baluta
Pascal Lamblin
Daniel Tarlow
Fabian Pedregosa
Machine unlearning aims to solve the problem of removing the influence of selected training examples from a learned model. Despite the incre… (voir plus)asing attention to this problem, it remains an open research question how to evaluate unlearning in large language models (LLMs), and what are the critical properties of the data to be unlearned that affect the quality and efficiency of unlearning. This work formalizes a metric to evaluate unlearning quality in generative models, and uses it to assess the trade-offs between unlearning quality and performance. We demonstrate that unlearning out-of-distribution examples requires more unlearning steps but overall presents a better trade-off overall. For in-distribution examples, however, we observe a rapid decay in performance as unlearning progresses. We further evaluate how example's memorization and difficulty affect unlearning under a classical gradient ascent-based approach.
Unlearning in- vs. out-of-distribution data in LLMs under gradient-based method
Teodora Baluta
Pascal Lamblin
Daniel Tarlow
Fabian Pedregosa
Machine unlearning aims to solve the problem of removing the influence of selected training examples from a learned model. Despite the incre… (voir plus)asing attention to this problem, it remains an open research question how to evaluate unlearning in large language models (LLMs), and what are the critical properties of the data to be unlearned that affect the quality and efficiency of unlearning. This work formalizes a metric to evaluate unlearning quality in generative models, and uses it to assess the trade-offs between unlearning quality and performance. We demonstrate that unlearning out-of-distribution examples requires more unlearning steps but overall presents a better trade-off overall. For in-distribution examples, however, we observe a rapid decay in performance as unlearning progresses. We further evaluate how example's memorization and difficulty affect unlearning under a classical gradient ascent-based approach.
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Huang Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse
Ekansh Sharma
Daniel M. Roy
Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task m… (voir plus)odel, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods -- such as Task Arithmetic, TIES-merging, and TALL mask merging -- work well to combine expert models obtained by fine-tuning a common foundation model, operating within a"local"neighborhood of the foundation model. This work explores the more challenging scenario of"non-local"merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to"variance collapse", a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem.