Florian Bordes

IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments

Quentin Garrido

Justine T Kao

Adina Williams

Michael G. Rabbat

Emmanuel Dupoux

We present IntPhys 2, a video benchmark designed to evaluate the intuitive physics understanding of deep learning models. Building on the or… (see more)iginal IntPhys benchmark, IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity. These conditions are inspired by research into intuitive physical understanding emerging during early childhood. IntPhys 2 offers a comprehensive suite of tests, based on the violation of expectation framework, that challenge models to differentiate between possible and impossible events within controlled and diverse virtual environments. Alongside the benchmark, we provide performance evaluations of several state-of-the-art models. Our findings indicate that while these models demonstrate basic visual understanding, they face significant challenges in grasping intuitive physics across the four principles in complex scenes, with most models performing at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy. This underscores the gap between current models and human-like intuitive physics understanding, highlighting the need for advancements in model architectures and training methodologies.

2025-06-10

ArXiv (preprint)

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari-Hemmat

Mohammad Pezeshki

Elvis Dohmatob

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Adriana Romero-Soriano

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (see more)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

2025-04-30

ICML.cc/2025/Conference (oral)

proceedings.mlr.press

Deliberate Practice with Synthetic Data

Reyhane Askari Hemmat

Mohammad Pezeshki

Pietro Astolfi

Melissa Hall

Jakob Verbeek

Adriana Romero

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

OC-CLIP : Object-centric binding in Contrastive Language-Image Pretraining

Rim Assouel

Pietro Astolfi

Adriana Romero

Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP which learn to associate visual inform… (see more)ation with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.

2024-10-09

NeurIPS.cc/2024/Workshop/Compositional_Learning (poster)

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat

Mohammad Pezeshki

Adriana Romero

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distribution… (see more)s. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT, Places-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over

2024-09-10

TMLR (accepted)

Stochastic positional embeddings improve masked image modeling

Amir Bar

Assaf Shocher

Mahmoud Assran

P Vincent

Nicolas Ballas

Trevor Darrell

Amir Globerson

Yann Lecun

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (published)

proceedings.mlr.press

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

Jack Urbanek

Pietro Astolfi

Mary Williamson

Vasu Sharma

Adriana Romero

2024-06-15

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (published)

arxiv.org

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 22 more)

Vasu Sharma

Huijuan Xu 0001

Hu Xu

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

2024-05-26

arXiv (preprint)

arxiv.org

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Shashank Shekhar

Mark Ibrahim

Diane Bouchacourt

Pascal Vincent

Ari S. Morcos

Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (see more)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation.Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear.In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. Using PUG for evaluation and fine-tuning, we demonstrate the potential of PUG to both enable more rigorous evaluations and to improve model training.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

Do SSL Models Have Déjà Vu? A Case of Unintended Memorization in Self-supervised Learning

Casey Meehan

Pascal Vincent

Kamalika Chaudhuri

Chuan Guo

Self-supervised learning (SSL) algorithms can produce useful image representations by learning to associate different parts of natural image… (see more)s with one another. However, when taken to the extreme, SSL models can unintendedly memorize specific parts in individual training samples rather than learning semantically meaningful associations. In this work, we perform a systematic study of the unintended memorization of image-specific information in SSL models -- which we refer to as déjà vu memorization. Concretely, we show that given the trained model and a crop of a training image containing only the background (e.g., water, sky, grass), it is possible to infer the foreground object with high accuracy or even visually reconstruct it. Furthermore, we show that déjà vu memorization is common to different SSL algorithms, is exacerbated by certain design choices, and cannot be detected by conventional techniques for evaluating representation quality. Our study of déjà vu memorization reveals previously unknown privacy risks in SSL models, as well as suggests potential practical mitigation strategies.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

A surprisingly simple technique to control the pretraining bias for better transfer: Expand or Narrow your representation