Adriana Romero Soriano

Collaborateur·rice de recherche - UdeM

Sumana Basu

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Superviseur⋅e principal⋅e :

Publications

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Mohammad Pezeshki

Elvis Dohmatob

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30% reduction in iterations, all while achieving superior performance compared to prior work.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

Reward the Reward Designer: Making Reinforcement Learning Useful for Clinical Decision Making

Sumana Basu

Doina Precup

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (publié)

openreview.net

Increasing the Utility of Synthetic Images through Chamfer Guidance

Nicola Dall'Asen

Xiaofeng Zhang

Melissa Hall

Jakob Verbeek

Conditional image generative models hold considerable promise to produce infinite amounts of synthetic training data. Yet, recent progress i… (voir plus)n generation quality has come at the expense of generation diversity, limiting the utility of these models as a source of synthetic training data. Although guidance-based approaches have been introduced to improve the utility of generated data by focusing on quality or diversity, the (implicit or explicit) utility functions oftentimes disregard the potential distribution shift between synthetic and real data. In this work, we introduce Chamfer Guidance: a training-free guidance approach which leverages a handful of real exemplar images to characterize the quality and diversity of synthetic data. We show that by leveraging the proposed Chamfer Guidance, we can boost the diversity of the generations w.r.t. a dataset of real images while maintaining or improving the generation quality on ImageNet-1k and standard geo-diversity benchmarks. Our approach achieves state-of-the-art few-shot performance with as little as 2 exemplar real images, obtaining 96.4\% in terms of precision, and 86.4\% in terms of distributional coverage, which increase to 97.5\% and 92.7\%, respectively, when using 32 real images. We showcase the benefits of the Chamfer Guidance generation by training downstream image classifiers on synthetic data, achieving accuracy boost of up to 15\% for in-distribution over the baselines, and up to 16\% in out-of-distribution. Furthermore, our approach does not require using the unconditional model, and thus obtains a 31\% reduction in FLOPs w.r.t. classifier-free-guidance-based approaches at sampling time.

2025-08-14

ArXiv (prépublication)

DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Revant Teotia

Candace Ross

Karen Ullrich

Sumit Chopra

Melissa Hall

Matthew J. Muckley

Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of repres… (voir plus)entation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does"the model generate images with expected attributes?) and generalization capacity ("Can"the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

2025-06-05

ArXiv (prépublication)

DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Revant Teotia

Candace Ross

Karen Ullrich

Sumit Chopra

Melissa Hall

Matthew J. Muckley

Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of repres… (voir plus)entation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.

2025-06-01

arXiv (publié)

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Mohammad Pezeshki

Elvis Dohmatob

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

2025-05-01

ICML.cc/2025/Conference (présentation orale)

openreview.net

Multi-Modal Language Models as Text-to-Image Model Evaluators

Jiahui Chen

Candace Ross

Koustuv Sinha

Melissa Hall

2025-05-01

ArXiv (prépublication)

Multi-Modal Language Models as Text-to-Image Model Evaluators

Jiahui Chen

Candace Ross

Koustuv Sinha

Melissa Hall

2025-05-01

arXiv (publié)

Entropy Rectifying Guidance for Diffusion and Flow Models

Tariq Berrada

Jakob Verbeek

Karteek Alahari

2025-04-18

ArXiv (prépublication)

Entropy Rectifying Guidance for Diffusion and Flow Models

Tariq Berrada

Jakob Verbeek

Karteek Alahari

2025-04-18

ArXiv (prépublication)

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Mohammad Pezeshki

Elvis Dohmatob

Florian Bordes

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

2025-02-21

ArXiv (prépublication)

PairBench: Are Vision-Language Models Reliable at Comparing What They See?

Aarash Feizi

Sai Rajeswar

Reihaneh Rabbany

Valentina Zantedeschi

Spandana Gella

Joao Monteiro

Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fu… (voir plus)ndamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for more complex tasks, while providing additional metrics into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation depending on the task.

2025-02-21

ArXiv (prépublication)