Pranshu Malviya

Torque-Aware Momentum

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (voir plus)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (prépublication)

doi.org

arxiv.org

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Gintare Karolina Dziugaite

Razvan Pascanu

Sarath Chandar

2024-12-25

ArXiv (prépublication)

doi.org

arxiv.org

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Gintare Karolina Dziugaite

Razvan Pascanu

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (voir plus)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-25

ArXiv (prépublication)

doi.org

arxiv.org

Lookbehind-SAM: k steps back, 1 step forward

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (publié)

proceedings.mlr.press

openreview.net

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of su… (voir plus)ch optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

2024-06-09

TMLR (accepté)

doi.org

openreview.net

Manifold Metric: A Loss Landscape Approach for Predicting Model Performance

Determining the optimal model for a given task often requires training multiple models from scratch, which becomes impractical as dataset an… (voir plus)d model sizes grow. A more efficient alternative is to expand smaller pre-trained models, but this approach is underutilized due to a limited understanding of its impact on the training dynamics. Existing methods for quantifying this impact have notable limitations, including computation cost. To address this, we introduce a new perspective based on the loss landscape, which has been shown to contain a manifold of linearly connected minima. Specifically, we propose a metric that estimates the size of this manifold to study the impact of model expansion. Our experiments reveal a strong correlation between performance gains and our manifold metric, enabling more informed model comparison and offering a first step toward a geometry-driven approach for reliable model expansion. Notably, our metric outperforms other baselines, even when different types of expansion with equivalent number of parameters are applied to a model.

2024-05-24

ArXiv (prépublication)

arxiv.org

Predicting the Impact of Model Expansion through the Minima Manifold: A Loss Landscape Perspective

The optimal model for a given task is often challenging to determine, requiring training multiple models from scratch which becomes prohibit… (voir plus)ive as dataset and model sizes grow. A more efficient alternative is to reuse smaller pre-trained models by expanding them, however, this is not widely adopted as how this impacts training dynamics remains poorly understood. While prior works have introduced statistics to measure these effects, they remain flawed. To rectify this, we offer a new approach for understanding and quantifying the impact of expansion through the lens of the loss landscape, which has been shown to contain a manifold of linearly connected minima. Building on this new perspective, we propose a metric to study the impact of expansion by estimating the size of the manifold. Experimental results show a clear relationship between gains in performance and manifold size, enabling the comparison of candidate models and presenting a first step towards expanding models more reliably based on geometric properties of the loss landscape.

2024-05-24

ArXiv (prépublication)

doi.org

arxiv.org

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Pranshu Malviya

Publications

Science éclair

À l’avant-garde d’une nouvelle ère

Demandes de supervision

Mots-clés populaires:

Pranshu Malviya

Publications