Portrait de Hiroki Naganuma

Hiroki Naganuma

Doctorat - UdeM
Superviseur⋅e principal⋅e
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Généralisation hors distribution (OOD)
Modèles de fondation
Optimisation
Paysages de perte
Robustesse
Systèmes distribués

Publications

The Geometry of Spectral Gradient Descent: Layerwise Criteria for SignSGD vs SpecSGD
Optimization in deep learning has expanded beyond Euclidean methods to include entrywise sign updates (SignSGD) and spectral sign updates (S… (voir plus)pecGD/Muon). While both can be viewed as steepest descent under non-Euclidean geometries (
DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging
Kotaro Yoshida
Yuji Naraki
Takafumi Horie
Ryotaro Shimizu
Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent year… (voir plus)s. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose **DisTaC** (**Dis**tillation for **Ta**sk vector **C**onditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models that exhibit these harmful traits, where they would otherwise fail, and achieve significant performance gains.
Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training
Xinzhi Zhang
Man-Chung Yue
Russell J. Hewett
Philipp Andre Witte
Yin Tat Lee
An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration
Ryuichiro Hataya
Kotaro Yoshida
No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths
Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (voir plus)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
Smoothness-Adaptive Sharpness-Aware Minimization for Finding Flatter Minima
Junhyung Lyle Kim
Anastasios Kyrillidis
The sharpness-aware minimization (SAM) procedure recently gained increasing attention due to its favorable generalization ability to unseen … (voir plus)data. SAM aims to find flatter (local) minima, utilizing a minimax objective. An immediate challenge in the application of SAM is the adjustment of two pivotal step sizes, which significantly influence its effectiveness. We introduce a novel, straightforward approach for adjusting step sizes that adapts to the smoothness of the objective function, thereby reducing the necessity for manual tuning. This method, termed Smoothness-Adaptive SAM (SA-SAM), not only simplifies the optimization process but also promotes the method's inherent tendency to converge towards flatter minima, enhancing performance in specific models.
Empirical Study on Optimizer Selection for Out-of-Distribution Generalization
Shiro Takagi
Tetsuya Motokawa
Rio Yokota
Kohta Ishikawa
Ikuro Sato
Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution.… (voir plus) While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts---namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset---linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.