Mohammad Pezeshki

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Sachin Goyal

Badr Youbi Idrissi

David Lopez-Paz

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, … (voir plus)and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Why Less is More (Sometimes): A Theory of Data Curation

Elvis Dopgima Dohmatob

Mohammad Pezeshki

Reyhane Askari Hemmat

2025-11-04

ArXiv (prépublication)

doi.org

arxiv.org

Iterative Amortized Inference: Unifying In-Context Learning and Learned Optimizers

2025-10-12

ArXiv (prépublication)

doi.org

arxiv.org

Compositional Risk Minimization

Charles Arnal

Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (voir plus) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

Steering Large Language Model Activations in Sparse Spaces

Reza Bayat

Ali Rahimi-Kalahroudi

Mohammad Pezeshki

A. Chandar

P Vincent

2025-07-07

Conference on Language Modeling (accepté)

doi.org

openreview.net

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat

Mohammad Pezeshki

Elvis Dohmatob

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Michal Drozdzal

Adriana Romero-Soriano

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

2025-04-30

ICML.cc/2025/Conference (présentation orale)

doi.org

proceedings.mlr.press

The Pitfalls of Memorization: When Memorization Hurts Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dohmatob

David Lopez-Paz

Pascal Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

Deliberate Practice with Synthetic Data

Reyhane Askari Hemmat

Mohammad Pezeshki

Pietro Astolfi

Melissa Hall

Florian Bordes

Jakob Verbeek

Michal Drozdzal

Adriana Romero

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

openreview.net

The Pitfalls of Memorization: When Memorization Hinders Generalization

Reza Bayat

Mohammad Pezeshki

Elvis Dopgima Dohmatob

David Lopez-Paz

P Vincent

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explan… (voir plus)ations. This leads to poor generalization when the learned explanations are spurious. In this work, we formalize

2024-10-09

NeurIPS.cc/2024/Workshop/SciForDL (poster)

openreview.net

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat

Mohammad Pezeshki

Florian Bordes

Michal Drozdzal

Adriana Romero

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distribution… (voir plus)s. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT, Places-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over

2024-09-10

TMLR (accepté)

doi.org

openreview.net

Discovering Environments with XRM

Mohammad Pezeshki

Diane Bouchacourt

Mark Ibrahim

Nicolas Ballas

P Vincent

David Lopez-Paz

Environment annotations are essential for the success of many out-of-distribution (OOD) generalization methods. Unfortunately, these are cos… (voir plus)tly to obtain and often limited by human annotators’ biases. To achieve robust generalization, it is essential to develop algorithms for automatic environment discovery within datasets. Current proposals, which divide examples based on their training error, suffer from one fundamental problem. These methods introduce hyper-parameters and early-stopping criteria, which require a validation set with human-annotated environments, the very information subject to discovery. In this paper, we propose Cross-Risk Minimization (XRM) to address this issue. XRM trains twin networks, each learning from one random half of the training data, while imitating confident held-out mistakes made by its sibling. XRM provides a recipe for hyper-parameter tuning, does not require early-stopping, and can discover environments for all training and validation data. Algorithms built on top of XRM environments achieve oracle worst-group-accuracy, addressing a long-standing challenge in OOD generalization.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (publié)

proceedings.mlr.press