Adriana Romero Soriano

Michal Drozdzal

Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative… (voir plus) models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in generative models mostly focuses on creative applications that are predominantly concerned with human preferences of image quality and aesthetics. We note that generative models have inference time mechanisms - or knobs - that allow the control of generation consistency, quality, and diversity. In this paper, we use state-of-the-art text-to-image and image-and-text-to-image models and their knobs to draw consistency-diversity-realism Pareto fronts that provide a holistic view on consistency-diversity-realism multi-objective. Our experiments suggest that realism and consistency can both be improved simultaneously; however there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing significantly the representation diversity. By computing Pareto fronts on a geodiverse dataset, we find that the first version of latent diffusion models tends to perform better than more recent models in all axes of evaluation, and there exist pronounced consistency-diversity-realism disparities between geographical regions. Overall, our analysis clearly shows that there is no best model and the choice of model should be determined by the downstream application. With this analysis, we invite the research community to consider Pareto fronts as an analytical tool to measure progress towards world models.

2024-10-10

NeurIPS.cc/2024/Workshop/RBFM (présentation orale)

Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas

Pierluca D'Oro

Koustuv Sinha

Michal Drozdzal

Aishwarya Agrawal

2024-10-10

NeurIPS.cc/2024/Workshop/AFM (poster)

Deliberate Practice with Synthetic Data

Reyhane Askari Hemmat

Mohammad Pezeshki

Pietro Astolfi

Melissa Hall

Florian Bordes

Jakob Verbeek

Michal Drozdzal

2024-10-10

NeurIPS.cc/2024/Workshop/AFM (poster)

OC-CLIP : Object-centric binding in Contrastive Language-Image Pretraining

Rim Assouel

Pietro Astolfi

Florian Bordes

Michal Drozdzal

Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP which learn to associate visual inform… (voir plus)ation with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.

2024-10-10

NeurIPS.cc/2024/Workshop/Compositional_Learning (poster)

On improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

Tariq Berrada

Pietro Astolfi

Melissa Hall

Reyhane Askari Hemmat

Yohann Benchetrit

Marton Havasi

Matthew J. Muckley

Karteek Alahari

Jakob Verbeek

Michal Drozdzal

Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, large-scale end-to-e… (voir plus)nd training of these models is computationally costly, and hence most research focuses either on finetuning pretrained models or experiments at smaller scales. In this work we aim to improve the training efficiency and performance of LDMs with the goal of scaling to larger datasets and higher resolutions. We focus our study on two points that are critical for good performance and efficient training: (i) the mechanisms used for semantic level (\eg a text prompt, or class name) and low-level (crop size, random flip, \etc) conditioning of the model, and (ii) pre-training strategies to transfer representations learned on smaller and lower-resolution datasets to larger ones. The main contributions of our work are the following: we present systematic experimental study of these points, we propose a novel conditioning mechanism that disentangles semantic and low-level conditioning, we obtain state-of-the-art performance on CC12M for text-to-image at 512 resolution.

2024-09-25

NeurIPS.cc/2024/Conference (poster)

What makes a good metric? Evaluating automatic metrics for text-to-image consistency

Candace Ross

Melissa Hall

Adina Williams

2024-07-10

colmweb.org/COLM/2024/Conference (accepté)

Decomposed evaluations of geographic disparities in text-to-image models

Abhishek Sureddy

Dishant Padalia

Nandhinee Periyakaruppan

Oindrila Saha

Adina Williams

Megan Richards

Polina Kirichenko

Melissa Hall

2024-06-28

ICML.cc/2024/Workshop/NextGenAISafety (poster)

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Jack Urbanek

Florian Bordes

Pietro Astolfi

Mary Williamson

Vasu Sharma

2024-06-16

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance

Reyhane Askari Hemmat

Melissa Hall

Alicia Sun

Candace Ross

Michal Drozdzal

With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Rec… (voir plus)ent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a"memory bank"of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.

2024-06-06

ArXiv (prépublication)

Towards Geographic Inclusion in the Evaluation of Text-to-Image Models

Melissa Hall

Samuel J. Bell

Candace Ross

Adina Williams

Michal Drozdzal

Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of … (voir plus)thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of"appeal"captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.

2024-06-05

The 2024 ACM Conference on Fairness, Accountability, and Transparency (publié)

DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning

Jonathan Lebensold

Maziar Sanjabi

Pietro Astolfi

Kamalika Chaudhuri

Michael Rabbat

Chuan Guo

2024-03-21

ArXiv (prépublication)

DP-RDM: Adapting Diffusion Models to Private Domains Without Fine-Tuning

Jonathan Lebensold

Maziar Sanjabi

Pietro Astolfi

Kamalika Chaudhuri

Michael Rabbat

Chuan Guo

Text-to-image diffusion models have been shown to suffer from sample-level memorization, possibly reproducing near-perfect replica of images… (voir plus) that they are trained on, which may be undesirable. To remedy this issue, we develop the first differentially private (DP) retrieval-augmented generation algorithm that is capable of generating high-quality image samples while providing provable privacy guarantees. Specifically, we assume access to a text-to-image diffusion model trained on a small amount of public data, and design a DP retrieval mechanism to augment the text prompt with samples retrieved from a private retrieval dataset. Our \emph{differentially private retrieval-augmented diffusion model} (DP-RDM) requires no fine-tuning on the retrieval dataset to adapt to another domain, and can use state-of-the-art generative models to generate high-quality image samples while satisfying rigorous DP guarantees. For instance, when evaluated on MS-COCO, our DP-RDM can generate samples with a privacy budget of

2024-03-21

ArXiv (prépublication)