Gintare Karolina Dziugaite

2024-01-01

ECML/PKDD (7) (published)

The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning.

Tian Jin

Nolan Clement

Xin Dong

Vaishnavh Nagarajan

Michael Carbin

Jonathan Ragan-Kelley

2024-01-01

International Conference on Learning Representations (published)

Leveraging Function Space Aggregation for Federated Learning at Scale

Nikita Dhawan

Nicole Elyse Mitchell

Zachary Charles

Zachary Garrett

The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model,… (see more) without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, FedFish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.

2023-11-17

ArXiv (preprint)

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

Tian Jin

Nolan Clement

Xin Dong

Vaishnavh Nagarajan

Michael Carbin

Jonathan Ragan-Kelley

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techn… (see more)iques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

2023-10-07

ArXiv (preprint)

Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization

MAHDI HAGHIFAM

Borja Rodr'iguez-G'alvez

Ragnar Thobaben

Mikael Skoglund

Daniel M. Roy

2023-02-13

Proceedings of The 34th International Conference on Algorithmic Learning Theory (published)

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

Mansheej Paul

Feng Chen

Brett W. Larsen

Jonathan Frankle

Surya Ganguli

Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can s… (see more)till be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed? We develop answers in terms of the geometry of the error landscape. First, we find that

2023-02-01

ICLR.cc/2023/Conference (notable)

When Majorities Prevent Learning: Eliminating Bias to Improve Worst-group and Out-of-distribution Generalization

Yu Yang

Baharan Mirzasoleiman

Modern neural networks trained on large datasets have achieved state-of-the-art (in-distribution) generalization performance on various task… (see more)s. However, their good generalization performance has been shown to be contributed largely to overfitting spurious biases in large datasets. This is evident by the poor generalization performance of such models on minorities and out-of-distribution data. To alleviate this issue, subsampling the majority groups has been shown to be very effective. However, it is not clear how to find the subgroups (e.g. within a class) in large real-world datasets. Besides, naively subsampling the majority groups can entirely deplete some of their smaller sub-populations and drastically harm the in-distribution performance. Here, we show that tracking gradient trajectories of examples in initial epochs allows for finding large subpopulations of data points. We leverage this observation and propose an importance sampling method that is biased towards selecting smaller subpopulations, and eliminates bias in the large subpopulations. Our experiments confirm the effectiveness of our approach in eliminating spurious biases and learning higher-quality models with superior in- and out-of-distribution performance on various datasets.

2023-02-01

ICLR.cc/2023/Conference (rejected)

Unmasking the Lottery Ticket Hypothesis: Efficient Adaptive Pruning for Finding Winning Tickets

Mansheej Paul

Feng Chen

Brett W. Larsen

Jonathan Frankle

Surya Ganguli

Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that requi… (see more)re less compute and memory but can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets, that can be retrained from initialization or an early training stage. IMP operates by iterative cycles of training, masking a fraction of smallest magnitude weights, rewinding unmasked weights back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? We find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training encodes information about the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. We leverage this observation to design a simple adaptive pruning heuristic for speeding up the discovery of winning tickets and achieve a 30% reduction in computation time on CIFAR-100. These results make progress toward demystifying the existence of winning tickets with an eye towards enabling the development of more efficient pruning algorithms.

2022-10-20

NeurIPS.cc/2022/Workshop/HITY (accepted)

Understanding Generalization via Leave-One-Out Conditional Mutual Information

MAHDI HAGHIFAM

Shay Moran

Daniel M. Roy

2022-01-01

ISIT (published)

Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

Soufiane Hayou

Bo He

2021-10-22

ArXiv (preprint)

Stochastic Neural Network with Kronecker Flow

Chin-wei Huang

Ahmed Touati

Pascal Vincent

Alexandre Lacoste

Aaron Courville

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to… (see more) scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.

2020-06-03

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (published)

proceedings.mlr.press

In Search of Robust Measures of Generalization

Brady Neal

Linbo Wang

Daniel M. Roy

One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now tra… (see more)ins networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically. Jiang et al. (2020) recently described a large-scale empirical study aimed at uncovering potential causal relationships between bounds/measures and generalization. Building on their study, we highlight where their proposed methods can obscure failures and successes of generalization measures in explaining generalization. We argue that generalization measures should instead be evaluated within the framework of distributional robustness.