Portrait de Adriana Romero Soriano

Adriana Romero Soriano

Membre industriel principal
Chaire en IA Canada-CIFAR
Professeure adjointe, McGill University, École d'informatique
Chercheuse scientifique, Meta AI Research (FAIR)
Sujets de recherche
Apprentissage profond
Modèles génératifs
Vision par ordinateur

Biographie

Adriana Romero-Soriano est chercheuse à Meta (FAIR, Fundamental AI Research), professeure adjointe à l'Université McGill, membre industriel principal de Mila – Institut québécois d’intelligence artificielle et titulaire d'une chaire en IA Canada-CIFAR. Ses recherches se situent à l'intersection des modèles génératifs, de la vision par ordinateur et de l'IA responsable. Ses travaux les plus récents portent sur l'amélioration de la qualité, de la contrôlabilité, de la cohérence et de la diversité de représentation des systèmes de création de contenu visuel. Elle a obtenu son doctorat à l'Université de Barcelone, où elle a travaillé avec Carlo Gatta, et a été chercheuse postdoctorale pendant deux ans à Mila, où elle a travaillé avec le professeur Yoshua Bengio.

Étudiants actuels

Doctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - McGill
Superviseur⋅e principal⋅e :

Publications

Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Reyhane Askari Hemmat
Mohammad Pezeshki
Florian Bordes
Pietro Astolfi
Melissa Hall
Jakob Verbeek
Michal Drozdzal
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (voir plus)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
Aarash Feizi
Sai Rajeswar
Spandana Gella
Valentina Zantedeschi
Joao Monteiro
As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare dat… (voir plus)a pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.
Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel
Pietro Astolfi
Florian Bordes
Michal Drozdzal
Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel
Pietro Astolfi
Florian Bordes
Michal Drozdzal
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual informa… (voir plus)tion with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Reyhane Askari Hemmat
Mohammad Pezeshki
Florian Bordes
Pietro Astolfi
Melissa Hall
Jakob Verbeek
Michal Drozdzal
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
Aarash Feizi
Sai Rajeswar
Spandana Gella
Valentina Zantedeschi
Joao Monteiro
As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare dat… (voir plus)a pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.
Boosting Latent Diffusion with Perceptual Objectives
Tariq Berrada
Pietro Astolfi
Jakob Verbeek
Melissa Hall
Marton Havasi
Michal Drozdzal
Yohann Benchetrit
Karteek Alahari
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adina Williams
Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to aut… (voir plus)omatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adina Williams
Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to aut… (voir plus)omatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall
Oscar Mañas
Reyhane Askari
Mark Ibrahim
Candace Ross
Pietro Astolfi
Tariq Berrada
Marton Havasi
Yohann Benchetrit
Karen Ullrich
Carolina Braga
Abhishek Charnalia
Maeve Ryan
Michael Rabbat
Michal Drozdzal
Jakob Verbeek
As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. Ho… (voir plus)wever, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.
EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall
Oscar Mañas
Reyhane Askari Hemmat
Mark Ibrahim
Candace Ross
Pietro Astolfi
Tariq Berrada
Marton Havasi
Yohann Benchetrit
Karen Ullrich
Carolina Braga
Abhishek Charnalia
Maeve Ryan
Michal Drozdzal
Jakob Verbeek
As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. Ho… (voir plus)wever, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.
EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall
Oscar Mañas
Reyhane Askari
Mark Ibrahim
Candace Ross
Pietro Astolfi
Tariq Berrada
Marton Havasi
Yohann Benchetrit
Karen Ullrich
Carolina Braga
Abhishek Charnalia
Maeve Ryan
Michael Rabbat
Michal Drozdzal
Jakob Verbeek
As the use of text-to-image generative models increases, so does the adoption of automatic benchmarking methods used in their evaluation. Ho… (voir plus)wever, while metrics and datasets abound, there are few unified benchmarking libraries that provide a framework for performing evaluations across many datasets and metrics. Furthermore, the rapid introduction of increasingly robust benchmarking methods requires that evaluation libraries remain flexible to new datasets and metrics. Finally, there remains a gap in synthesizing evaluations in order to deliver actionable takeaways about model performance. To enable unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced ''EvalGym''), a library for evaluating generative image models. EvalGIM contains broad support for datasets and metrics used to measure quality, diversity, and consistency of text-to-image generative models. In addition, EvalGIM is designed with flexibility for user customization as a top priority and contains a structure that allows plug-and-play additions of new datasets and metrics. To enable actionable evaluation insights, we introduce ''Evaluation Exercises'' that highlight takeaways for specific evaluation questions. The Evaluation Exercises contain easy-to-use and reproducible implementations of two state-of-the-art evaluation methods of text-to-image generative models: consistency-diversity-realism Pareto Fronts and disaggregated measurements of performance disparities across groups. EvalGIM also contains Evaluation Exercises that introduce two new analysis methods for text-to-image generative models: robustness analyses of model rankings and balanced evaluations across different prompt styles. We encourage text-to-image model exploration with EvalGIM and invite contributions at https://github.com/facebookresearch/EvalGIM/.