Devin Kwok

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Colin Raffel

Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is … (see more)unclear to what extent such effects lead to meaningfully different networks, either in terms of the models' weights or the underlying functions that were learned. In this work, we show that during the initial "chaotic" phase of training, even extremely small perturbations reliably causes otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time. We quantify this divergence through (i)

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

openreview.net

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Gül Sena Altıntaş

Devin Kwok

Colin Raffel

David Rolnick

Neural network training is inherently sensitive to initialization and the randomness induced by stochastic gradient descent. However, it is … (see more)unclear to what extent such effects lead to meaningfully different networks, either in terms of the models’ weights or the underlying functions that were learned. In this work, we show that during the initial "chaotic" phase of training, even extremely small perturbations reliably causes otherwise identical training trajectories to diverge-an effect that diminishes rapidly over training time. We quantify this divergence through (i)

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

The Butterfly Effect: Tiny Perturbations Cause Neural Network Training to Diverge

Gül Sena Altıntaş

Devin Kwok

David Rolnick

Neural network training begins with a chaotic phase in which the network is sensitive to small perturbations, such as those caused by stocha… (see more)stic gradient descent (SGD). This sensitivity can cause identically initialized networks to diverge both in parameter space and functional similarity. However, the exact degree to which networks are sensitive to perturbation, and the sensitivity of networks as they transition out of the chaotic phase, is unclear. To address this uncertainty, we apply a controlled perturbation at a single point in training time and measure its effect on otherwise identical training trajectories. We find that both the

2024-06-16

ICML.cc/2024/Workshop/HiLD (poster)

openreview.net

Dataset Difficulty and the Role of Inductive Bias

Devin Kwok

Nikhil Anand

Jonathan Frankle

Gintare Karolina Dziugaite

David Rolnick

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examp… (see more)les within a dataset. These methods, which we call"example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.

2024-01-03

ArXiv (preprint)

doi.org

arxiv.org

Dataset Difficulty and the Role of Inductive Bias

Devin Kwok

Nikhil Anand

Jonathan Frankle

Gintare Karolina Dziugaite

David Rolnick

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examp… (see more)les within a dataset. These methods, which we call"example difficulty scores", are typically used to rank or categorize examples, but the consistency of rankings between different training runs, scoring methods, and model architectures is generally unknown. To determine how example rankings vary due to these random and controlled effects, we systematically compare different formulations of scores over a range of runs and model architectures. We find that scores largely share the following traits: they are noisy over individual runs of a model, strongly correlated with a single notion of difficulty, and reveal examples that range from being highly sensitive to insensitive to the inductive biases of certain model architectures. Drawing from statistical genetics, we develop a simple method for fingerprinting model architectures using a few sensitive examples. These findings guide practitioners in maximizing the consistency of their scores (e.g. by choosing appropriate scoring methods, number of runs, and subsets of examples), and establishes comprehensive baselines for evaluating scores in the future.

2024-01-03

ArXiv (preprint)

doi.org

arxiv.org

Simultaneous linear connectivity of neural networks modulo permutation

Ekansh Sharma

Devin Kwok

Tom Denton

Daniel M. Roy

David Rolnick

Gintare Karolina Dziugaite

2024-01-01