Gintare Karolina Dziugaite

2024-04-09

ArXiv (preprint)

Evaluating Interventional Reasoning Capabilities of Large Language Models

Tejas Kasetty

Divyat Mahajan

Alexandre Drouin

Dhanya Sridhar

Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners consid… (see more)er using large language models (LLMs) to automate decisions, studying their causal reasoning capabilities becomes crucial. A recent line of work evaluates LLMs ability to retrieve commonsense causal facts, but these evaluations do not sufficiently assess how LLMs reason about interventions. Motivated by the role that interventions play in causal inference, in this paper, we conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning. These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts. Our analysis on four LLMs highlights that while GPT- 4 models show promising accuracy at predicting the intervention effects, they remain sensitive to distracting factors in the prompts.

2024-04-08

ArXiv (preprint)

Information Complexity of Stochastic Convex Optimization: Applications to Generalization and Memorization

Idan Attias

MAHDI HAGHIFAM

Roi Livni

Daniel M. Roy

In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). … (see more)We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is a precise characterization of the tradeoff between the accuracy of a learning algorithm and its CMI, answering an open question posed by Livni (2023). We show that, in the

2024-02-14

ArXiv (preprint)

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Johan Samir Obando Ceron

Ghada Sokar

Timon Willi

Clare Lyle

Jesse Farebrother

Jakob Nicolaus Foerster

Doina Precup

Pablo Samuel Castro

2024-02-13

ArXiv (preprint)

Leveraging Function Space Aggregation for Federated Learning at Scale

Nikita Dhawan

Nicole Elyse Mitchell

Zachary Charles

Zachary Garrett

The federated learning paradigm has motivated the development of methods for aggregating multiple client updates into a global server model,… (see more) without sharing client data. Many federated learning algorithms, including the canonical Federated Averaging (FedAvg), take a direct (possibly weighted) average of the client parameter updates, motivated by results in distributed optimization. In this work, we adopt a function space perspective and propose a new algorithm, FedFish, that aggregates local approximations to the functions learned by clients, using an estimate based on their Fisher information. We evaluate FedFish on realistic, large-scale cross-device benchmarks. While the performance of FedAvg can suffer as client models drift further apart, we demonstrate that FedFish is more robust to longer local training. Our evaluation across several settings in image and language benchmarks shows that FedFish outperforms FedAvg as local training epochs increase. Further, FedFish results in global networks that are more amenable to efficient personalization via local fine-tuning on the same or shifted data distributions. For instance, federated pretraining on the C4 dataset, followed by few-shot personalization on Stack Overflow, results in a 7% improvement in next-token prediction by FedFish over FedAvg.

2024-02-12

TMLR (accepted)

The Cost of Scaling Down Large Language Models: Reducing Model Size Affects Memory before In-context Learning

Tian Jin

Nolan Clement

Xin Dong

Vaishnavh Nagarajan

Michael Carbin

Jonathan Ragan-Kelley

We study how down-scaling large language model (LLM) size impacts LLM capabilities. We begin by measuring the effects of weight pruning – … (see more)a popular technique for reducing model size – on the two abilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in context. Surprisingly, we find that existing pruning techniques affect these two abilities of LLMs differently. For example, pruning more than 30% of weights significantly decreases an LLM’s ability to recall facts presented during pre-training. Yet pruning 60-70% of weights largely preserves an LLM’s ability to process information in-context, ranging from retrieving answers based on information presented in context to learning parameterized functions such as a linear classifier based on a few examples. Moderate pruning impairs LLM’s ability to recall facts learnt from pre-training. However, its effect on model’s ability to process information presented in context is much less pronounced. The said disparate effects similarly arise when replacing the original model with a smaller dense one with reduced width and depth. This similarity suggests that model size reduction in general underpins the said disparity.

2024-01-16

ICLR.cc/2024/Conference (poster)

Dataset Difficulty and the Role of Inductive Bias

Devin Kwok

Nikhil Anand

Jonathan Frankle

David Rolnick

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examp… (see more)les within a dataset. These methods, which we call"example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.

2024-01-03

ArXiv (preprint)

JaxPruner: A concise library for sparsity research

Joo Hyung Lee

Wonpyo Park

Nicole Elyse Mitchell

Jonathan Pilault

Johan Samir Obando Ceron

Han-Byul Kim

Namhoon Lee

Elias Frantar

Yun Long

Amir Yazdanbakhsh

Shivani Agrawal

Suvinay Subramanian

Xin Wang

Sheng-Chun Kao

Xingyao Zhang

Trevor Gale

Aart J.C. Bik

Woohyun Han

Milen Ferev

Zhonglin Han … (see 5 more)

Hong-Seok Kim

Yann Dauphin

Pablo Samuel Castro

Utku Evci

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims … (see more)to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.

2024-01-01

CPAL.cc/2024/Conference (oral)

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

Tian Jin

Nolan Clement

Xin Dong

Vaishnavh Nagarajan

Michael Carbin

Jonathan Ragan-Kelley

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techn… (see more)iques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

2023-10-07

ArXiv (preprint)

Identifying Spurious Biases Early in Training through the Lens of Simplicity Bias

Yu Yang

Eric Gan

Baharan Mirzasoleiman

Neural networks trained with (stochastic) gradient descent have an inductive bias towards learning simpler solutions. This makes them highly… (see more) prone to learning spurious correlations in the training data, that may not hold at test time. In this work, we provide the first theoretical analysis of the effect of simplicity bias on learning spurious correlations. Notably, we show that examples with spurious features are provably separable based on the model's output early in training. We further illustrate that if spurious features have a small enough noise-to-signal ratio, the network's output on the majority of examples is almost exclusively determined by the spurious features, leading to poor worst-group test accuracy. Finally, we propose SPARE, which identifies spurious correlations early in training and utilizes importance sampling to alleviate their effect. Empirically, we demonstrate that SPARE outperforms state-of-the-art methods by up to 21.1% in worst-group accuracy, while being up to 12x faster. We also show that SPARE is a highly effective but lightweight method to discover spurious correlations.

2023-05-30

ArXiv (preprint)

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

Mansheej Paul

Feng Chen

Brett W. Larsen

Jonathan Frankle

Surya Ganguli

Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can s… (see more)till be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets. IMP operates by iterative cycles of training, masking smallest magnitude weights, rewinding back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? How does SGD allow the network to extract this information? And why is iterative pruning needed? We develop answers in terms of the geometry of the error landscape. First, we find that

2023-02-01

ICLR.cc/2023/Conference (notable)

When Majorities Prevent Learning: Eliminating Bias to Improve Worst-group and Out-of-distribution Generalization

Yu Yang

Baharan Mirzasoleiman

Modern neural networks trained on large datasets have achieved state-of-the-art (in-distribution) generalization performance on various task… (see more)s. However, their good generalization performance has been shown to be contributed largely to overfitting spurious biases in large datasets. This is evident by the poor generalization performance of such models on minorities and out-of-distribution data. To alleviate this issue, subsampling the majority groups has been shown to be very effective. However, it is not clear how to find the subgroups (e.g. within a class) in large real-world datasets. Besides, naively subsampling the majority groups can entirely deplete some of their smaller sub-populations and drastically harm the in-distribution performance. Here, we show that tracking gradient trajectories of examples in initial epochs allows for finding large subpopulations of data points. We leverage this observation and propose an importance sampling method that is biased towards selecting smaller subpopulations, and eliminates bias in the large subpopulations. Our experiments confirm the effectiveness of our approach in eliminating spurious biases and learning higher-quality models with superior in- and out-of-distribution performance on various datasets.

2023-02-01

ICLR.cc/2023/Conference (rejected)