Publications

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (preprint)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (preprint)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (preprint)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (preprint)

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Quentin Gregory Anthony

Timothee LESORT

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English

2024-03-13

ArXiv (preprint)

Assessing the Security of GitHub Copilot Generated Code - A Targeted Replication Study

Vahid Majdinasab

Michael Joshua Bishop

Shawn Rasheed

Arghavan Moradi Dakhel

Amjed Tahir

Foutse Khomh

2024-03-12

2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (published)

Deep Learning Model Reuse in the HuggingFace Community: Challenges, Benefit and Trends

Mina Taraghi

Gianolli Dorcelus

Armstrong Foundjem

Florian Tambon

Foutse Khomh

The ubiquity of large-scale Pre-Trained Models (PTMs) is on the rise, sparking interest in model hubs, and dedicated platforms for hosting P… (see more)TMs. Despite this trend, a comprehensive exploration of the challenges that users encounter and how the community leverages PTMs remains lacking. To address this gap, we conducted an extensive mixed-methods empirical study by focusing on discussion forums and the model hub of HuggingFace, the largest public model hub. Based on our qualitative analysis, we present a taxonomy of the challenges and benefits associated with PTM reuse within this community. We then conduct a quantitative study to track model-type trends and model documentation evolution over time. Our findings highlight prevalent challenges such as limited guidance for beginner users, struggles with model output comprehensibility in training or inference, and a lack of model understanding. We also identified interesting trends among models where some models maintain high upload rates despite a decline in topics related to them. Additionally, we found that despite the introduction of model documentation tools, its quantity has not increased over time, leading to difficulties in model comprehension and selection among users. Our study sheds light on new challenges in reusing PTMs that were not reported before and we provide recommendations for various stakeholders involved in PTM reuse.

2024-03-12

2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (published)

Refining GPT-3 Embeddings with a Siamese Structure for Technical Post Duplicate Detection

Xingfang Wu

Heng Li

Nobukazu Yoshioka

Hironori Washizaki

Foutse Khomh

2024-03-12

2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (published)

Reproducible Spinal Cord Quantitative MRI Analysis with the Spinal Cord Toolbox

Jan Valosek

Julien Cohen-Adad

The spinal cord plays a pivotal role in the central nervous system, providing communication between the brain and the body and containing cr… (see more)itical motor and sensory networks. Recent advancements in spinal cord MRI data acquisition and image analysis have shown a potential to improve the diagnostics, prognosis, and management of a variety of pathological conditions. In this review, we first discuss the significance of standardized spinal cord MRI acquisition protocol in multi-center and multi-manufacturer studies. Then, we cover open-access spinal cord MRI datasets, which are important for reproducible science and validation of new methods. Finally, we elaborate on the recent advances in spinal cord MRI data analysis techniques implemented in the open-source software package Spinal Cord Toolbox (SCT).

2024-03-12

Magnetic Resonance in Medical Sciences (published)

Rethinking Machine Learning Benchmarks in the Context of Professional Codes of Conduct

Peter Henderson

Jieru Hu

Mona Diab

Joelle Pineau

2024-03-12

Proceedings of the Symposium on Computer Science and Law (published)