Publications

GradTune: Last-layer Fine-tuning for Group Robustness Without Group Annotation

Patrik Joslin Kenfack

Ulrich Aivodji

Samira Ebrahimi Kahou

This work addresses the limitations of deep neural networks (DNNs) in generalizing beyond training data due to spurious correlations. Recent… (see more) research has demonstrated that models trained with empirical risk minimization learn both core and spurious features, often upweighting spurious ones in the final classification, which can frequently lead to poor performance on minority groups. Deep Feature Reweighting alleviates this issue by retraining the model's last classification layer using a group-balanced held-out validation set. However, relying on spurious feature labels during training or validation limits practical application, as spurious features are not always known or costly to annotate. Our preliminary experiments reveal that ERM-trained models exhibit higher gradient norms on minority group samples in the hold-out dataset. Leveraging these insights, we propose an alternative approach called GradTune, which fine-tunes the last classification layer using high-gradient norm samples. Our results on four well-established benchmarks demonstrate that the proposed method can achieve competitive performance compared to existing methods without requiring group labels during training or validation.

2025-03-06

ICLR.cc/2025/Workshop/SCSL (published)

Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

Ali Karami

Thi Kieu Khanh Ho

Narges Armanfard

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

A Joint Space-Time Encoder for Geographic Time-Series Data

David Mickisch

Konstantin Klemmer

Mélisande Teng

David Rolnick

Many real-world processes are characterized by complex spatio-temporal dependencies, from climate dynamics to disease spread. Here, we intro… (see more)duce a new neural network architecture to model such dynamics at scale: the \emph{Space-Time Encoder}. Building on recent advances in \emph{location encoders}, models that take as inputs geographic coordinates, we develop a method that takes in geographic and temporal information simultaneously and learns smooth, continuous functions in both space and time. The inputs are first transformed using positional encoding functions and then fed into neural networks that allow the learning of complex functions. We implement a prototype of the \emph{Space-Time Encoder}, discuss the design choices of the novel temporal encoding, and demonstrate its utility in climate model emulation. We discuss the potential of the method across use cases, as well as promising avenues for further methodological innovation.

2025-03-06

ICLR.cc/2025/Workshop/MLMP (poster)

Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles

Luca Scimeca

Alexander Rubinstein

Damien Teney

Seong Joon Oh

Armand Mihai Nicolicioiu

Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as shortcut lea… (see more)rning, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose

2025-03-06

ICLR.cc/2025/Workshop/SCSL (published)

Mixed Patch Visible-Infrared Modality Agnostic Object Detection

Heitor Rapela Medeiros

David Latortue

Eric Granger

Marco Pedersoli

In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive … (see more)task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing un-even recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa)from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models

Siddarth Venkatraman

Mohsin Hasan

Minsu Kim

Luca Scimeca

Marcin Sendera

Glen Berseth

Nikolay Malkin

Any well-behaved generative model over a variable …

2025-03-06

ICLR.cc/2025/Workshop/DeLTa (poster)

A Realistic Protocol for Evaluation of Weakly Supervised Object Localization

Shakeeb Murtaza

Soufiane Belharbi

Marco Pedersoli

Eric Granger

Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only globa… (see more)l class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper, a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments with several WSOL methods on ILSVRC and CUB datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to those selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded based solely on LOC maps.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Nicholas Meade

Xing Han Lu

Alejandra Zambrano

Arkil Patel

Esin Durmus

Spandana Gella

Karolina Sta'nczak

Siva Reddy

2025-03-06

ArXiv (preprint)

Shaping Inductive Bias in Diffusion Models through Frequency-Based Noise Control

Thomas Jiralerspong

Berton Earnshaw

Jason Hartford

Luca Scimeca

Diffusion Probabilistic Models (DPMs) are powerful generative models that have achieved unparalleled success in a number of generative tasks… (see more). In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. For topologically structured data, we devise a frequency-based noising operator to purposefully manipulate, and set, these inductive biases. We first show that appropriate manipulations of the noising forward process can lead DPMs to focus on particular aspects of the distribution to learn. We show that different datasets necessitate different inductive biases, and that appropriate frequency-based noise control induces increased generative performance compared to standard diffusion. Finally, we demonstrate the possibility of ignoring information at particular frequencies while learning. We show this in an image corruption and recovery task, where we train a DPM to recover the original target distribution after severe noise corruption.

2025-03-06

ICLR.cc/2025/Workshop/DeLTa (poster)

Laurence Perreault-Levasseur

Solving Bayesian inverse problems with diffusion priors and off-policy RL

Luca Scimeca

Siddarth Venkatraman

Moksh J. Jain

Minsu Kim

Marcin Sendera

Mohsin Hasan

Luke Rowe

Sarthak Mittal

Pablo Lemos

Emmanuel Bengio

Alexandre Adam

Jarrid Rector-Brooks

Yashar Hezaveh

Glen Berseth

Nikolay Malkin

This paper presents a practical application of Relative Trajectory Balance (RTB), a recently introduced off-policy reinforcement learning (R… (see more)L) objective that can asymptotically solve Bayesian inverse problems optimally. We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science. We use the objective alongside techniques such as off-policy backtracking exploration to improve training. Importantly, our results show that existing training-free diffusion posterior methods struggle to perform effective posterior inference in latent space due to inherent biases.

2025-03-06

ICLR.cc/2025/Workshop/DeLTa (poster)

Towards personalized healthcare without harm via bias modulation

Frank Ngaha

Patrik Joslin Kenfack

Ulrich Aivodji

Samira Ebrahimi Kahou

Personalized machine learning models have gained significant importance in various domains, including healthcare. However, designing efficie… (see more)nt personalized models remains a challenge. Traditional approaches often involve training multiple sub-models for different population sub-groups, which can be costly and does not always guarantee improved performance across all sub-groups. This paper presents a novel approach to improving model performance at the sub-group level by leveraging bias and training a joint model. Our method involves a two-step process: first, we train a model to predict group attributes, and then we use this model to learn data-dependent biases to modulate a second model for diagnosis prediction. Our results demonstrate that this joint architecture achieves consistent performance gains across all sub-groups in the Heart dataset. Furthermore, in the mortality dataset, it improves performance in two of the four sub-groups. A comparison of our method with the traditional decoupled personalization method demonstrated a greater performance gain in the sub-groups with less harm. This approach offers a more effective and scalable solution for personalization of models, which could have positive impact in healthcare and other areas that require predictive models which take sub-group information into account.

2025-03-06

ICLR.cc/2025/Workshop/SCSL (published)