Elvis Dohmatob

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised reg… (see more)ression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised reg… (see more)ression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised reg… (see more)ression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised reg… (see more)ression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and feed-forward neural networks for images.

2024-10-07

ArXiv (preprint)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (preprint)

Consistent Adversarially Robust Linear Classification: Non-Parametric Setting

For binary classification in …

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

A Tale of Tails: Model Collapse as a Change of Scaling Laws

Yunzhen Feng

Pu Yang

Francois Charton

Julia Kempe

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing … (see more)capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ”un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

Individual Brain Charting dataset extension, third release for movie watching and retinotopy data

Ana Lúısa Pinho

Hugo Richard

Ana Fernanda Ponce

Michael Eickenberg

Alexis Amadon

Isabelle Denghien

Juan Jesús Torre

Swetha Shankar

Himanshu Aggarwal

Alexis Thual

Thomas Chapalain

Chantal Ginisty

Séverine Becuwe-Desmidt

Séverine Roger

Yann Lecomte

Valérie Berland

Laurence Laurier

Véronique Joly-Testault

Gaëlle Médiouni-Cloarec … (see 6 more)

Christine Doublé

Bernadette Martins

Gael Varoquaux

Stanislas Dehaene

Lucie Hertz-Pannier

Bertrand Thirion

2024-06-05

Scientific Data (published)