Portrait of Gintare Karolina Dziugaite

Gintare Karolina Dziugaite

Associate Industry Member
Adjunct Professor, McGill University, School of Computer Science
Senior Research Scientist, Google DeepMind
Research Topics
Deep Learning
Information Theory
Machine Learning Theory

Biography

Gintare Karolina Dziugaite is a senior research scientist at Google DeepMind in Toronto, and an adjunct professor at the McGill University School of Computer Science. Prior to joining Google, she led the Trustworthy AI program at Element AI (ServiceNow). Her research combines theoretical and empirical approaches to understanding deep learning.

Dziugaite is well known for her work on network and data sparsity, developing algorithms and uncovering effects on generalization and other metrics. She pioneered the study of linear mode connectivity, first connecting it to the existence of lottery tickets, then to loss landscapes and the mechanism of iterative magnitude pruning. Another major focus of her research is understanding generalization in deep learning and, more generally, the development of information-theoretic methods for studying generalization. Her most recent work looks at removing the influence of data on the model (unlearning).

Dziugaite obtained her PhD in machine learning from the University of Cambridge under the supervision of Zoubin Ghahramani. Prior to that, she studied mathematics at the University of Warwick and read Part III in Mathematics at the University of Cambridge, receiving a Master of Advanced Study (MASt) in mathematics. She has participated in a number of long-term programs at the Institute for Advanced Study in Princeton, NJ, and at the Simons Institute for the Theory of Computing at the University of Berkeley.

Publications

The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning.
Tian Jin
Nolan Clement
Xin Dong
Vaishnavh Nagarajan
Michael Carbin
Jonathan Ragan-Kelley
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning
Tian Jin
Nolan Clement
Xin Dong
Vaishnavh Nagarajan
Michael Carbin
Jonathan Ragan-Kelley
How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techn… (see more)iques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?
Mansheej Paul
Feng Chen
Brett W. Larsen
Jonathan Frankle
Surya Ganguli
Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can s… (see more)till be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed? We develop answers in terms of the geometry of the error landscape. First, we find that
When Majorities Prevent Learning: Eliminating Bias to Improve Worst-group and Out-of-distribution Generalization
Yu Yang
Baharan Mirzasoleiman
Modern neural networks trained on large datasets have achieved state-of-the-art (in-distribution) generalization performance on various task… (see more)s. However, their good generalization performance has been shown to be contributed largely to overfitting spurious biases in large datasets. This is evident by the poor generalization performance of such models on minorities and out-of-distribution data. To alleviate this issue, subsampling the majority groups has been shown to be very effective. However, it is not clear how to find the subgroups (e.g. within a class) in large real-world datasets. Besides, naively subsampling the majority groups can entirely deplete some of their smaller sub-populations and drastically harm the in-distribution performance. Here, we show that tracking gradient trajectories of examples in initial epochs allows for finding large subpopulations of data points. We leverage this observation and propose an importance sampling method that is biased towards selecting smaller subpopulations, and eliminates bias in the large subpopulations. Our experiments confirm the effectiveness of our approach in eliminating spurious biases and learning higher-quality models with superior in- and out-of-distribution performance on various datasets.
Stochastic Neural Network with Kronecker Flow
Chin-Wei Huang
Ahmed Touati
Alexandre Lacoste
Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to… (see more) scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.
In Search of Robust Measures of Generalization
Brady Neal
Nitarshan Rajkumar
Ethan Caballero
Linbo Wang
Daniel M. Roy
One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now tra… (see more)ins networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically. Jiang et al. (2020) recently described a large-scale empirical study aimed at uncovering potential causal relationships between bounds/measures and generalization. Building on their study, we highlight where their proposed methods can obscure failures and successes of generalization measures in explaining generalization. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.
Stochastic Neural Network with Kronecker Flow
Chin-Wei Huang
Ahmed Touati
Alexandre Lacoste
Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to… (see more) scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.