Aristide Baratin

arxiv.org

Molecule property prediction with molecular orbitals

Yan Zhang

Khang Ngo

Sékou-Oumar Kaba

Daniel T. Levy

Siamak Ravanbakhsh

Kisoo Kwon

MiYoung Jang

Eun Hyun Cho

Sangha Park

Sanghyun Yoo

Young-Seok Kim

Hasup Lee

Molecular orbitals describe the distribution of electrons in a molecule and are frequently used by chemists to understand properties of mole… (see more)cules, yet machine learning has neglected them so far. If atom coordinates are obtained through DFT anyway, they can be obtained for free at the same time and are thus a useful source of additional data, particularly when data is scarce We give an introduction to molecular orbitals for a machine learning audience and propose models to process three different representations of them. Experiments on a dataset with experimental properties show that including MOs significantly improves performance and sample efficiency over a pretrained molecular foundation model on this real-world task.

2026-03-01

AI4Mat @ International Conference on Learning Representations (poster)

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

When training neural networks, dying neurons -- units becoming inactive or saturated -- are traditionally seen as harmful. This paper sheds … (see more)new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycle schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs and accelerating training by up to 3.56

2025-02-12

TMLR (accepted)

Torque-Aware Momentum

Pranshu Malviya

Goncalo Mordido

Reza Babanezhad Harikandeh

Gintare Karolina Dziugaite

Razvan Pascanu

Sarath Chandar

Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

2024-12-24

ArXiv (preprint)

arxiv.org

How Learning Rates Shape Neural Network Focus: Insights from Example Ranking

Ekaterina Lobacheva

Keller Jordan

Nicolas Roux

The learning rate is a key hyperparameter that affects both the speed of training and the generalization performance of neural networks. Th… (see more)rough a new {\it loss-based example ranking} analysis, we show that networks trained with different learning rates focus their capacity on different parts of the data distribution, leading to solutions with different generalization properties. These findings, which hold across architectures and datasets, provide new insights into how learning rates affect model performance and example-level dynamics in neural networks.

2024-10-09

NeurIPS.cc/2024/Workshop/SciForDL (poster)

Lookbehind-SAM: k Steps Back, 1 Step Forward

Sharpness-aware minimization (SAM) methods have gained increasing popularity by formulating the problem of minimizing both loss value and lo… (see more)ss sharpness as a minimax objective. In this work, we increase the efficiency of the maximization and minimization parts of SAM's objective to achieve a better loss-sharpness trade-off. By taking inspiration from the Lookahead optimizer, which uses multiple descent steps ahead, we propose Lookbehind, which performs multiple ascent steps behind to enhance the maximization step of SAM and find a worst-case perturbation with higher loss. Then, to mitigate the variance in the descent step arising from the gathered gradients across the multiple ascent steps, we employ linear interpolation to refine the minimization step. Lookbehind leads to a myriad of benefits across a variety of tasks. Particularly, we show increased generalization performance, greater robustness against noisy weights, as well as improved learning and less catastrophic forgetting in lifelong learning settings. Our code is available at https://github.com/chandar-lab/Lookbehind-SAM.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Unsupervised Concept Discovery Mitigates Spurious Correlations

Md Rifat Arefin

Francesco Locatello

Dianbo Liu

Models prone to spurious correlations in training data often produce brittle predictions and introduce unintended biases. Addressing this ch… (see more)allenge typically involves methods relying on prior knowledge and group annotation to remove spurious correlations, which may not be readily available in many applications. In this paper, we establish a novel connection between unsupervised object-centric learning and mitigation of spurious correlations. Instead of directly inferring subgroups with varying correlations with labels, our approach focuses on discovering concepts: discrete ideas that are shared across input samples. Leveraging existing object-centric representation learning, we introduce CoBalT: a concept balancing technique that effectively mitigates spurious correlations without requiring human labeling of subgroups. Evaluation across the benchmark datasets for sub-population shifts demonstrate superior or competitive performance compared state-of-the-art baselines, without the need for group annotation. Code is available at https://github.com/rarefin/CoBalT.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya

Goncalo Mordido

Reza Babanezhad Harikandeh

Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergen… (see more)ce and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset. Our code is available at https://github.com/chandar-lab/CMOptimizer.

2024-06-08

Transactions on Machine Learning Research (accepted)

Predicting the Impact of Model Expansion through the Minima Manifold: A Loss Landscape Perspective

The optimal model for a given task is often challenging to determine, requiring training multiple models from scratch which becomes prohibit… (see more)ive as dataset and model sizes grow. A more efficient alternative is to reuse smaller pre-trained models by expanding them, however, this is not widely adopted as how this impacts training dynamics remains poorly understood. While prior works have introduced statistics to measure these effects, they remain flawed. To rectify this, we offer a new approach for understanding and quantifying the impact of expansion through the lens of the loss landscape, which has been shown to contain a manifold of linearly connected minima. Building on this new perspective, we propose a metric to study the impact of expansion by estimating the size of the manifold. Experimental results show a clear relationship between gains in performance and manifold size, enabling the comparison of candidate models and presenting a first step towards expanding models more reliably based on geometric properties of the loss landscape.

2024-05-23

arXiv (preprint)

arxiv.org

How Connectivity Structure Shapes Rich and Lazy Learning in Neural Circuits

Yuhan Helena Liu

Jonathan Cornford

Stefan Mihalas

Eric Shea-Brown

Guillaume Lajoie

In theoretical neuroscience, recent work leverages deep learning tools to explore how some network attributes critically influence its learn… (see more)ing dynamics. Notably, initial weight distributions with small (resp. large) variance may yield a rich (resp. lazy) regime, where significant (resp. minor) changes to network states and representation are observed over the course of learning. However, in biology, neural circuit connectivity could exhibit a low-rank structure and therefore differs markedly from the random initializations generally used for these studies. As such, here we investigate how the structure of the initial weights -- in particular their effective rank -- influences the network learning regime. Through both empirical and theoretical analyses, we discover that high-rank initializations typically yield smaller network changes indicative of lazier learning, a finding we also confirm with experimentally-driven initial connectivity in recurrent neural networks. Conversely, low-rank initialization biases learning towards richer learning. Importantly, however, as an exception to this rule, we find lazier learning can still occur with a low-rank initialization that aligns with task and data statistics. Our research highlights the pivotal role of initial weight structures in shaping learning regimes, with implications for metabolic costs of plasticity and risks of catastrophic forgetting.

2024-01-15

ICLR.cc/2024/Conference (poster)

CrossSplit: Mitigating Label Noise Memorization through Data Splitting

Jihye Kim

Yan Zhang

Simon Lacoste-Julien

We approach the problem of improving robustness of deep learning algorithms in the presence of label noise. Building upon existing label cor… (see more)rection and co-teaching methods, we propose a novel training procedure to mitigate the memorization of noisy labels, called CrossSplit, which uses a pair of neural networks trained on two disjoint parts of the labelled dataset. CrossSplit combines two main ingredients: (i) Cross-split label correction. The idea is that, since the model trained on one part of the data cannot memorize example-label pairs from the other part, the training labels presented to each network can be smoothly adjusted by using the predictions of its peer network; (ii) Cross-split semi-supervised training. A network trained on one part of the data also uses the unlabeled inputs of the other part. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and mini-WebVision datasets demonstrate that our method can outperform the current state-of-the-art in a wide range of noise ratios.

2023-07-02

Proceedings of the 40th International Conference on Machine Learning (published)