Aristide Baratin

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Pranshu Malviya

Goncalo Mordido

Aristide Baratin

Reza Babanezhad Harikandeh

Jerry Huang

Simon Lacoste-Julien

Razvan Pascanu

Sarath Chandar Anbil Parthipan

Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of su… (see more)ch optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.

2024-06-09

TMLR (accepted)

doi.org

openreview.net

Lookbehind-SAM: k steps back, 1 step forward

Goncalo Mordido

Pranshu Malviya

Aristide Baratin

Sarath Chandar Anbil Parthipan

2024-05-01

ICML.cc/2024/Conference (poster)

openreview.net

Unsupervised Concept Discovery Mitigates Spurious Correlations

Md Rifat Arefin

Yan Zhang

Aristide Baratin

Francesco Locatello

Irina Rish

Dianbo Liu

Kenji Kawaguchi

2024-05-01

ICML.cc/2024/Conference (poster)

doi.org

openreview.net

Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Simon Dufort-Labbé

Pierluca D'Oro

Evgenii Nikishin

Razvan Pascanu

Pierre-Luc Bacon

Aristide Baratin

2024-03-12

ArXiv (preprint)

doi.org

arxiv.org

How connectivity structure shapes rich and lazy learning in neural circuits

Yuhan Helena Liu

Aristide Baratin

Jonathan Cornford

Stefan Mihalas

Eric Todd SheaBrown

Guillaume Lajoie

In theoretical neuroscience, recent work leverages deep learning tools to explore how some network attributes critically influence its learn… (see more)ing dynamics. Notably, initial weight distributions with small (resp. large) variance may yield a rich (resp. lazy) regime, where significant (resp. minor) changes to network states and representation are observed over the course of learning. However, in biology, neural circuit connectivity generally has a low-rank structure and therefore differs markedly from the random initializations generally used for these studies. As such, here we investigate how the structure of the initial weights — in particular their effective rank — influences the network learning regime. Through both empirical and theoretical analyses, we discover that high-rank initializations typically yield smaller network changes indicative of lazier learning, a finding we also confirm with experimentally-driven initial connectivity in recurrent neural networks. Conversely, low-rank initialization biases learning towards richer learning. Importantly, however, as an exception to this rule, we find lazier learning can still occur with a low-rank initialization that aligns with task and data statistics. Our research highlights the pivotal role of initial weight structures in shaping learning regimes, with implications for metabolic costs of plasticity and risks of catastrophic forgetting.

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods

Yuchen Lu

Zhen Liu

Aristide Baratin

Romain Laroche

Aaron Courville

Alessandro Sordoni

2023-11-14

TMLR (accepted)

openreview.net

CrossSplit: Mitigating Label Noise Memorization through Data Splitting

Jihye Kim

Aristide Baratin

Yan Zhang

Simon Lacoste-Julien

We approach the problem of improving robustness of deep learning algorithms in the presence of label noise. Building upon existing label cor… (see more)rection and co-teaching methods, we propose a novel training procedure to mitigate the memorization of noisy labels, called CrossSplit, which uses a pair of neural networks trained on two disjoint parts of the labeled dataset. CrossSplit combines two main ingredients: (i) Cross-split label correction. The idea is that, since the model trained on one part of the data cannot memorize example-label pairs from the other part, the training labels presented to each network can be smoothly adjusted by using the predictions of its peer network; (ii) Cross-split semi-supervised training. A network trained on one part of the data also uses the unlabeled inputs of the other part. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet and mini-WebVision datasets demonstrate that our method can outperform the current state-of-the-art in a wide range of noise ratios. The project page is at https://rlawlgul.github.io/.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

Thomas George

Guillaume Lajoie

Aristide Baratin

Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called `laz… (see more)y' training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.

2022-01-01

Trans. Mach. Learn. Res. (published)

doi.org

openreview.net

Implicit Regularization in Deep Learning: A View from Function Space

Aristide Baratin

Thomas George

César Laurent

We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a possible regularization eff… (see more)ect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. By extrapolating a new analysis of Rademacher complexity bounds in linear models, we propose and study a new heuristic complexity measure for neural networks which captures this phenomenon, in terms of sequences of tangent kernel classes along in the learning trajectories.

2020-08-03

ArXiv (preprint)

arxiv.org