Elvis Dohmatob

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Michal Drozdzal

Adriana Romero Soriano

2025-05-01

ICML.cc/2025/Conference (présentation orale)

auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory

Arjun Subramonian

2025-04-14

ArXiv (prépublication)

auto-fpt: Automating Free Probability Theory Calculations for Machine Learning Theory

Arjun Subramonian

2025-04-01

arXiv (publié)

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari Hemmat

Mohammad Pezeshki

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Michal Drozdzal

Adriana Romero Soriano

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

2025-02-21

ArXiv (prépublication)

The Pitfalls of Memorization: When Memorization Hurts Generalization

Reza Bayat

Mohammad Pezeshki

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-22

ICLR.cc/2025/Conference (poster)

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Yunzhen Feng

Pu Yang

Francois Charton

Julia Kempe

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of… (voir plus) the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.

2025-01-01

ICLR (publié)

An Effective Theory of Bias Amplification

Arjun Subramonian

Samuel J. Bell

Levent Sagun

Machine learning models may capture and amplify biases present in data, leading to disparate test performance across social groups. To bette… (voir plus)r understand, evaluate, and mitigate these possible biases, a deeper theoretical understanding of how model design choices and data distribution properties could contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we demonstrate that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be fundamental differences in test error between groups that do not vanish with increased parameterization. Importantly, our theoretical predictions align with several empirical observations reported in the literature. We extensively empirically validate our theory on diverse synthetic and semi-synthetic datasets.

2025-01-01

ICLR (publié)

Strong Model Collapse.

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2025-01-01

ICLR (publié)

dblp.uni-trier.de

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-10

NeurIPS.cc/2024/Workshop/SciForDL (poster)

An Effective Theory of Bias Amplification

Arjun Subramonian

Samuel J. Bell

Levent Sagun

Machine learning models may capture and amplify biases present in data, leading to disparate test performance across social groups. To bette… (voir plus)r understand, evaluate, and mitigate these possible biases, a deeper theoretical understanding of how model design choices and data distribution properties could contribute to bias is needed. In this work, we contribute a precise analytical theory in the context of ridge regression, both with and without random projections, where the former models neural networks in a simplified regime. Our theory offers a unified and rigorous explanation of machine learning bias, providing insights into phenomena such as bias amplification and minority-group bias in various feature and parameter regimes. For example, we demonstrate that there may be an optimal regularization penalty or training time to avoid bias amplification, and there can be fundamental differences in test error between groups that do not vanish with increased parameterization. Importantly, our theoretical predictions align with several empirical observations reported in the literature. We extensively empirically validate our theory on diverse synthetic and semi-synthetic datasets.

2024-10-07

ArXiv (prépublication)

Strong Model Collapse

Yunzhen Feng

Arjun Subramonian

Julia Kempe

2024-10-07

ArXiv (prépublication)