Portrait of Benjamin Therien

Benjamin Therien

PhD - Université de Montréal
Supervisor
Co-supervisor
Research Topics
Deep Learning
Large Language Models (LLM)
Meta-Learning
Optimization

Publications

Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Can We Learn Communication-Efficient Optimizers?
Learning Optimizers for Local SGD
Comparison of Radiologists and Deep Learning for US Grading of Hepatic Steatosis.
Sara-Ivana Calce
Pamela Boustros
Cassandra Larocque-Rigney
Laurent Patry-Beaudoin
Yi Hui Luo
Emre Aslan
John Marinos
Talal M. Alamri
Kim-Nhien Vu
Jessica Murphy-Lavallée
Jean-Sébastien Billiard
Emmanuel Montagnon
Hongliang Li
Samuel Kadoury
Bich Nguyen
Shanel Gauthier
Michael Chassé
Guy Cloutier
An Tang
Background Screening for nonalcoholic fatty liver disease (NAFLD) is suboptimal due to the subjective interpretation of US images. Purpose T… (see more)o evaluate the agreement and diagnostic performance of radiologists and a deep learning model in grading hepatic steatosis in NAFLD at US, with biopsy as the reference standard. Materials and Methods This retrospective study included patients with NAFLD and control patients without hepatic steatosis who underwent abdominal US and contemporaneous liver biopsy from September 2010 to October 2019. Six readers visually graded steatosis on US images twice, 2 weeks apart. Reader agreement was assessed with use of κ statistics. Three deep learning techniques applied to B-mode US images were used to classify dichotomized steatosis grades. Classification performance of human radiologists and the deep learning model for dichotomized steatosis grades (S0, S1, S2, and S3) was assessed with area under the receiver operating characteristic curve (AUC) on a separate test set. Results The study included 199 patients (mean age, 53 years ± 13 [SD]; 101 men). On the test set (n = 52), radiologists had fair interreader agreement (0.34 [95% CI: 0.31, 0.37]) for classifying steatosis grades S0 versus S1 or higher, while AUCs were between 0.49 and 0.84 for radiologists and 0.85 (95% CI: 0.83, 0.87) for the deep learning model. For S0 or S1 versus S2 or S3, radiologists had fair interreader agreement (0.30 [95% CI: 0.27, 0.33]), while AUCs were between 0.57 and 0.76 for radiologists and 0.73 (95% CI: 0.71, 0.75) for the deep learning model. For S2 or lower versus S3, radiologists had fair interreader agreement (0.37 [95% CI: 0.33, 0.40]), while AUCs were between 0.52 and 0.81 for radiologists and 0.67 (95% CI: 0.64, 0.69) for the deep learning model. Conclusion Deep learning approaches applied to B-mode US images provided comparable performance with human readers for detection and grading of hepatic steatosis. Published under a CC BY 4.0 license. Supplemental material is available for this article. See also the editorial by Tuthill in this issue.
Continual Pre-Training of Large Language Models: How to (re)warm your model?
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to restart the process over again once new data becomes a… (see more)vailable. A much cheaper and more efficient solution would be to enable the continual pre-training of these models, i.e. updating pre-trained models with new data instead of re-training them from scratch. However, the distribution shift induced by novel data typically results in degraded performance on past data. Taking a step towards efficient continual pre-training, in this work, we examine the effect of different warm-up strategies. Our hypothesis is that the learning rate must be re-increased to improve compute efficiency when training on a new dataset. We study the warmup phase of models pre-trained on the Pile (upstream data, 300B tokens) as we continue to pre-train on SlimPajama (downstream data, 297B tokens), following a linear warmup and cosine decay schedule. We conduct all experiments on the Pythia 410M language model architecture and evaluate performance through validation perplexity. We experiment with different pre-training checkpoints, various maximum learning rates, and various warmup lengths. Our results show that while rewarming models first increases the loss on upstream and downstream data, in the longer run it improves the downstream performance, outperforming models trained from scratch