Benjamin Therien

Doctorat - UdeM

Superviseur⋅e principal⋅e

Irina Rish

Co-supervisor

Eugene Belilovsky

Sujets de recherche

Apprentissage méta

Apprentissage profond

Grands modèles de langage (LLM)

Optimisation

Site web

Google Scholar

GitHub

Publications

Continual Pre-Training of Large Language Models: How to (re)warm your model?

Quentin Gregory Anthony

Eugene Belilovsky

Irina Rish

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes a… (voir plus)vailable. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch

2023-06-19

ICML.cc/2023/Workshop/ES-FoMO (poster)

doi.org

openreview.net

Parametric Scattering Networks

Shanel Gauthier

Benjamin Therien

Laurent Alsène-Racicot

Michael Eickenberg

The wavelet scattering transform creates geometric invariants and deformation stability. In multiple signal domains, it has been shown to yi… (voir plus)eld more discriminative representations compared to other non-learned representations and to outperform learned representations in certain tasks, particularly on limited labeled data and highly structured signals. The wavelet filters used in the scattering transform are typically selected to create a tight frame via a parameterized mother wavelet. In this work, we investigate whether this standard wavelet filterbank construction is optimal. Focusing on Morlet wavelets, we propose to learn the scales, orientations, and aspect ratios of the filters to produce problem-specific parameterizations of the scattering transform. We show that our learned versions of the scattering transform yield significant performance gains in small-sample classification settings over the standard scattering transform. Moreover, our empirical results suggest that traditional filterbank constructions may not always be necessary for scattering transforms to extract effective representations.

2022-05-31

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

doi.org

arxiv.org

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Benjamin Therien

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Benjamin Therien

Publications