Marco Pedersoli

Eric Granger

2025-08-01

arXiv (published)

MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training

Muhammad Osama Zeeshan

Natacha Gillet

Alessandro Lameiras Koerich

Francois Bremond

Eric Granger

2025-08-01

arXiv (published)

Personalized Feature Translation for Expression Recognition: An Efficient Source-Free Domain Adaptation Method

Masoumeh Sharafi

Soufiane Belharbi

Houssem Ben Salem

Ali Etemad

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Facial expression recognition (FER) models are employed in many video-based affective computing applications, such as human-computer interac… (see more)tion and healthcare monitoring. However, deep FER models often struggle with subtle expressions and high inter-subject variability, limiting their performance in real-world applications. To improve their performance, source-free domain adaptation (SFDA) methods have been proposed to personalize a pretrained source model using only unlabeled target domain data, thereby avoiding data privacy, storage, and transmission constraints. This paper addresses a challenging scenario where source data is unavailable for adaptation, and only unlabeled target data consisting solely of neutral expressions is available. SFDA methods are not typically designed to adapt using target data from only a single class. Further, using models to generate facial images with non-neutral expressions can be unstable and computationally intensive. In this paper, personalized feature translation (PFT) is proposed for SFDA. Unlike current image translation methods for SFDA, our lightweight method operates in the latent space. We first pre-train the translator on the source domain data to transform the subject-specific style features from one source subject into another. Expression information is preserved by optimizing a combination of expression consistency and style-aware objectives. Then, the translator is adapted on neutral target data, without using source data or image synthesis. By translating in the latent space, PFT avoids the complexity and noise of face expression generation, producing discriminative embeddings optimized for classification. Using PFT eliminates the need for image synthesis, reduces computational overhead (using a lightweight translator), and only adapts part of the model, making the method efficient compared to image-based translation.

2025-08-01

arXiv (published)

WiSE-OD: Benchmarking Robustness in Infrared Object Detection

Heitor Rapela Medeiros

Atif Belal

Masih Aminbeidokhti

Eric Granger

2025-07-01

arXiv (published)

Advancements in Affective and Behavior Analysis: The 8th ABAW Workshop and Competition

Dimitrios Kollias

Panagiotis Tzirakis

Alan Cowen

Stefanos Zafeiriou

Irene Kotsia

Eric Granger

Simon Bacon

Alice Baird

Chris Gagne

Chunchang Shao

Guanyu Hu

Soufiane Belharbi

Muhammad Haseeb Aslam

The 8th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop at CVPR 2025 focuses on advancing the understanding and modeling of human … (see more)affective and behavioral patterns in real-world scenarios. It serves as a platform for interdisciplinary collaboration, showcasing the latest methodologies and applications in affective computing and behavior analysis. A core feature of the workshop is the ABAW Competition, which tackles critical challenges in human affect and behavior recognition essential for developing human-centered AI technologies. The 8th ABAW Competition features six challenges: (1) estimation of two continuous affect dimensions (valence and arousal), (2) recognition of eight mutually exclusive classes (the 7 basic expressions and a category 'other'), (3) detection of twelve action units, (4) recognition of seven mutually exclusive compound expressions, (5) estimation of emotional mimicry intensity across six dimensions, and (6) recognition of presence and absence of ambivalence/hesitancy. These challenges leverage datasets such as Aff-Wild2, C-EXPR-DB, HUME-Vidmimic2, and BAH, providing a comprehensive benchmark for evaluating affective behavior analysis models. Each challenge is assessed using specialized performance metrics, including Concordance Correlation Coefficient, F1-score, and Pearson's correlation. This paper provides an overview of the competition, detailing the datasets, pre-processing methodologies, evaluation criteria, baseline models and top performing teams' in each Challenge, including their obtained performance. Further details on the competition are available at: https://affective-behavior-analysis-inthe-wild.github.io/8th.

2025-06-11

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (published)

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Juan A. Rodriguez

Haotian Zhang

Abhay Puri

Aarash Feizi

Rishav Pramanik

Pascal Wichmann

Arnab Mondal

Mohammad Reza Samsami

Sai Rajeswar

Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-lang… (see more)uage models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF(Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.

2025-05-27

ArXiv (preprint)

Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

Masoumeh Sharafi

Emma Ollivier

Muhammad Osama Zeeshan

Soufiane Belharbi

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

2025-05-26

2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG) (published)

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change

Manuela Gonz'alez-Gonz'alez

Soufiane Belharbi

Muhammad Osama Zeeshan

Masoumeh Sharafi

Muhammad Haseeb Aslam

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of … (see more)digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.

2025-05-25

ArXiv (preprint)

Source-Free Domain Adaptation for YOLO Object Detection

Simon Varailhon

Masih Aminbeidokhti

Eric Granger

Source-free domain adaptation (SFDA) is a challenging problem in object detection, where a pre-trained source model is adapted to a new targ… (see more)et domain without using any source domain data for privacy and efficiency reasons. Most state-of-the-art SFDA methods for object detection have been proposed for Faster-RCNN, a detector that is known to have high computational complexity. This paper focuses on domain adaptation techniques for real-world vision systems, particularly for the YOLO family of single-shot detectors known for their fast baselines and practical applications. Our proposed SFDA method - Source-Free YOLO (SF-YOLO) - relies on a teacher-student framework in which the student receives images with a learned, target domain-specific augmentation, allowing the model to be trained with only unlabeled target data and without requiring feature alignment. A challenge with self-training using a mean-teacher architecture in the absence of labels is the rapid decline of accuracy due to noisy or drifting pseudo-labels. To address this issue, a teacher-to-student communication mechanism is introduced to help stabilize the training and reduce the reliance on annotated target data for model selection. Despite its simplicity, our approach is competitive with state-of-the-art detectors on several challenging benchmark datasets, even sometimes outperforming methods that use source data for adaptation.

2025-05-12

Lecture Notes in Computer Science (published)

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Behavioural Change

Manuela Gonz'alez-Gonz'alez

Soufiane Belharbi

Muhammad Osama Zeeshan

Masoumeh Sharafi

Muhammad Haseeb Aslam

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of … (see more)digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H. This paper introduces a first Behavioural Ambivalence/Hesitancy (BAH) dataset collected for subject-based multimodal recognition of A/H in videos. It contains videos from 224 participants captured across 9 provinces in Canada, with different age, and ethnicity. Through our web platform, we recruited participants to answer 7 questions, some of which were designed to elicit A/H while recording themselves via webcam with microphone. BAH amounts to 1,118 videos for a total duration of 8.26 hours with 1.5 hours of A/H. Our behavioural team annotated timestamp segments to indicate where A/H occurs, and provide frame- and video-level annotations with the A/H cues. Video transcripts and their timestamps are also included, along with cropped and aligned faces in each frame, and a variety of participants meta-data. We include results baselines for BAH at frame- and video-level recognition in multi-modal setups, in addition to zero-shot prediction, and for personalization using unsupervised domain adaptation. The limited performance of baseline models highlights the challenges of recognizing A/H in real-world videos. The data, code, and pretrained weights are available.

2025-05-01

arXiv (published)