Marco Pedersoli

Ismail Ben Ayed

Fully Test-Time Adaptation (TTA), which aims at adapting models to data drifts, has recently attracted wide interest. Numerous tricks and te… (voir plus)chniques have been proposed to ensure robust learning on arbitrary streams of unlabeled data. However, assessing the true impact of each individual technique and obtaining a fair comparison still constitutes a significant challenge. To help consolidate the community’s knowledge, we present a categorization of selected orthogonal TTA techniques, including small batch normalization, stream rebalancing, reliable sample selection, and network confidence calibration. We meticulously dissect the effect of each approach on different scenarios of interest. Through our analysis, we shed light on trade-offs induced by those techniques between accuracy, the computational power required, and model complexity. We also uncover the synergy that arises when combining techniques and are able to establish new state-of-the-art results.

2024-01-02

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

Domain Generalization by Rejecting Extreme Augmentations

Masih Aminbeidokhti

Fidel A. Guerrero Peña

Heitor Rapela Medeiros

Thomas Dubail

Eric Granger

2024-01-02

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

HalluciDet: Hallucinating RGB Modality for Person Detection Through Privileged Information

Heitor Rapela Medeiros

Fidel A. Guerrero Peña

Masih Aminbeidokhti

Thomas Dubail

Eric Granger

A powerful way to adapt a visual recognition model to a new domain is through image translation. However, common image translation approache… (voir plus)s only focus on generating data from the same distribution as the target domain. Given a cross-modal application, such as pedestrian detection from aerial images, with a considerable shift in data distribution between infrared (IR) to visible (RGB) images, a translation focused on generation might lead to poor performance as the loss focuses on irrelevant details for the task. In this paper, we propose HalluciDet, an IR-RGB image translation model for object detection. Instead of focusing on reconstructing the original image on the IR modality, it seeks to reduce the detection loss of an RGB detector, and therefore avoids the need to access RGB data. This model produces a new image representation that enhances objects of interest in the scene and greatly improves detection performance. We empirically compare our approach against state-of-the-art methods for image translation and for fine-tuning on IR, and show that our HalluciDet improves detection accuracy in most cases by exploiting the privileged information encoded in a pre-trained RGB detector. Code: https://github.com/heitorrapela/HalluciDet.

2024-01-02

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

Multi-Source Domain Adaptation for Object Detection with Prototype-based Mean Teacher

Atif Belal

Akhil Meethal

Francisco Perdigon Romero

Eric Granger

2024-01-02

2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptation of Object Detectors

Atif Belal

Akhil Meethal

Francisco Perdigon Romero

Eric Granger

Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment acro… (voir plus)ss source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform feature alignment in a class-agnostic manner. This is challenging since the objects have unique modality information due to variations in object appearance across domains. A recent prototype-based approach proposed a class-wise alignment, yet it suffers from error accumulation caused by noisy pseudo-labels that can negatively affect adaptation with imbalanced data. To overcome these limitations, we propose an attention-based class-conditioned alignment method for MSDA, designed to align instances of each object category across domains. In particular, an attention module combined with an adversarial domain classifier allows learning domain-invariant and class-specific instance representations. Experimental results on multiple benchmarking MSDA datasets indicate that our method outperforms state-of-the-art methods and exhibits robustness to class imbalance, achieved through a conceptually simple class-conditioning strategy. Our code is available at: https://github.com/imatif17/ACIA.

2023-12-31

arXiv.org (prépublication)

Evaluating Supervision Levels Trade-Offs for Infrared-Based People Counting

David Latortue

Moetez Kdayem

Fidel A. Guerrero Peña

Eric Granger

Object detection models are commonly used for people counting (and localization) in many applications but require a dataset with costly boun… (voir plus)ding box annotations for training. Given the importance of privacy in people counting, these models rely more and more on infrared images, making the task even harder. In this paper, we explore how weaker levels of supervision affect the performance of deep person counting architectures for image classification and point-level localization. Our experiments indicate that counting people using a convolutional neural network with image-level annotation achieves a level of accuracy that is competitive with YOLO detectors and point-level localization models yet provides a higher frame rate and a simi-lar amount of model parameters. Our code is available at: https://github.com/tortueTortue/IRPeopleCounting.

2023-12-31

2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW) (publié)

Joint Multimodal Transformer for Dimensional Emotional Recognition in the Wild

Paul Waligora

Muhammad Osama Zeeshan

Muhammad Haseeb Aslam

Soufiane Belharbi

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter-and intra… (voir plus)-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subse-quently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.

2023-12-31

arXiv.org (prépublication)

Do not trust what you trust: Miscalibration in Semi-supervised Learning

Shambhavi Mishra

Balamurali Murugesan

Ismail Ben Ayed

Jose Dolz

State-of-the-art semi-supervised learning (SSL) approaches rely on highly confident predictions to serve as pseudo-labels that guide the tra… (voir plus)ining on unlabeled samples. An inherent drawback of this strategy stems from the quality of the uncertainty estimates, as pseudo-labels are filtered only based on their degree of uncertainty, regardless of the correctness of their predictions. Thus, assessing and enhancing the uncertainty of network predictions is of paramount importance in the pseudo-labeling process. In this work, we empirically demonstrate that SSL methods based on pseudo-labels are significantly miscalibrated, and formally demonstrate the minimization of the min-entropy, a lower bound of the Shannon entropy, as a potential cause for miscalibration. To alleviate this issue, we integrate a simple penalty term, which enforces the logit distances of the predictions on unlabeled samples to remain low, preventing the network predictions to become overconfident. Comprehensive experiments on a variety of SSL image classification benchmarks demonstrate that the proposed solution systematically improves the calibration performance of relevant SSL models, while also enhancing their discriminative power, being an appealing addition to tackle SSL tasks.

2023-12-31

Trans. Mach. Learn. Res. (publié)

DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

Shakeeb Murtaza

Soufiane Belharbi

Aydin Sarraf

Eric Granger

2023-11-30

Image and Vision Computing (publié)