Marco Pedersoli

Alessandro L. Koerich

Simon L Bacon

Eric Granger

Using behavioural science, health interventions focus on behaviour change by providing a framework to help patients acquire and maintain hea… (see more)lthy habits that improve medical outcomes. In-person interventions are costly and difficult to scale, especially in resource-limited regions. Digital health interventions offer a cost-effective approach, potentially supporting independent living and self-management. Automating such interventions, especially through machine learning, has gained considerable attention recently. Ambivalence and hesitancy (A/H) play a primary role for individuals to delay, avoid, or abandon health interventions. A/H are subtle and conflicting emotions that place a person in a state between positive and negative evaluations of a behaviour, or between acceptance and refusal to engage in it. They manifest as affective inconsistency across modalities or within a modality, such as language, facial, vocal expressions, and body language. While experts can be trained to recognize A/H, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital health interventions. Here, we explore the application of deep learning models for A/H recognition in videos, a multi-modal task by nature. In particular, this paper covers three learning setups: supervised learning, unsupervised domain adaptation for personalization, and zero-shot inference via large language models (LLMs). Our experiments are conducted on the unique and recently published BAH video dataset for A/H recognition. Our results show limited performance, suggesting that more adapted multi-modal models are required for accurate A/H recognition. Better methods for modeling spatio-temporal and multimodal fusion are necessary to leverage conflicts within/across modalities.

2026-04-12

arXiv (preprint)

CLIP-AUTT: Test-Time Personalization with Action Unit Prompting for Fine-Grained Video Emotion Recognition

Muhammad Zeeshan

Masoumeh Sharafi

Benoit Savary

Alessandro L. Koerich

Eric Granger

Personalization in emotion recognition (ER) is essential for an accurate interpretation of subtle and subject-specific expressive patterns. … (see more)Recent advances in vision-language models (VLMs) such as CLIP demonstrate strong potential for leveraging joint image-text representations in ER. However, CLIP-based methods either depend on CLIP's contrastive pretraining or on LLMs to generate descriptive text prompts, which are noisy, computationally expensive, and fail to capture fine-grained expressions, leading to degraded performance. In this work, we leverage Action Units (AUs) as structured textual prompts within CLIP to model fine-grained facial expressions. AUs encode the subtle muscle activations underlying expressions, providing localized and interpretable semantic cues for more robust ER. We introduce CLIP-AU, a lightweight AU-guided temporal learning method that integrates interpretable AU semantics into CLIP. It learns generic, subject-agnostic representations by aligning AU prompts with facial dynamics, enabling fine-grained ER without CLIP fine-tuning or LLM-generated text supervision. Although CLIP-AU models fine-grained AU semantics, it does not adapt to subject-specific variability in subtle expressions. To address this limitation, we propose CLIP-AUTT, a video-based test-time personalization method that dynamically adapts AU prompts to videos from unseen subjects. By combining entropy-guided temporal window selection with prompt tuning, CLIP-AUTT enables subject-specific adaptation while preserving temporal consistency. Our extensive experiments on three challenging video-based subtle ER datasets, BioVid, StressID, and BAH, indicate that CLIP-AU and CLIP-AUTT outperform state-of-the-art CLIP-based FER and TTA methods, achieving robust and personalized subtle ER. Our code is publicly available at: https://github.com/osamazeeshan/CLIP-AUTT.

2026-03-29

arXiv (preprint)

Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos

Masoumeh Sharafi

Muhammad Zeeshan

Soufiane Belharbi

Alessandro L. Koerich

Eric Granger

Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-… (see more)language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.

2026-03-21

arXiv (preprint)

Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models

Shambhavi Mishra

Julio Silva-Rodríguez

Ismail Ben Ayed

Jose Dolz

Large pre-trained vision-language models (VLMs) like CLIP exhibit strong zero-shot performance but struggle under distributional shifts. We … (see more)propose Semantic Anchor Transport (SAT), a method that generates pseudo-labels for test samples by aligning visual embeddings with reliable text-based semantic anchors using Optimal Transport for batch-wise label assignment. These pseudo-labels enable efficient test-time adaptation through principled cross-modal alignment. We further incorporate multi-template distillation to leverage diverse textual clues, replicating multi-view contrastive learning without added computational cost. Extensive experiments demonstrate consistent performance gains over state-of-the-art methods across multiple benchmarks while maintaining computational efficiency.

2026-02-28

TTU_Main_Track @ International Conference on Learning Representations (published)

openreview.net

VectorGym: A Multitask Benchmark for SVG Code Generation, Sketching, and Editing

Juan Rodriguez

Haotian Zhang

Abhay Puri

Tianyang Zhang

Rishav Pramanik

Meng Lin

Xiaoqing Xie

Marco Terral

Aly Shariff

Sai Rajeswar

Christopher Pal

We introduce VectorGym, a comprehensive benchmark suite for Scalable Vector Graphics (SVG) that spans generation from text and sketches, com… (see more)plex editing, and visual understanding. VectorGym addresses the lack of realistic, challenging benchmarks aligned with professional design workflows. Our benchmark comprises four tasks with expert human-authored annotations: the novel Sketch2SVG task (VG-Sketch); a new SVG editing dataset (VG-Edit) featuring complex, multi-step edits with higher-order primitives; Text2SVG generation (VG-Text); and SVG captioning (VG-Cap). Unlike prior benchmarks that rely on synthetic edits, VectorGym provides gold-standard human annotations that require semantic understanding and design intent. We also propose a multi-task reinforcement learning approach that jointly optimizes across all four tasks using rendering-based rewards. Our method, built on GRPO with curriculum learning, trains a Qwen3-VL 8B model that achieves state-of-the-art performance among open-source models, surpassing much larger models including Qwen3-VL 235B and matching GPT-4o. We also introduce a VLM-as-a-Judge metric for SVG generation, validated through human correlation studies. Our evaluation of frontier VLMs reveals significant performance gaps, positioning VectorGym as a rigorous framework for advancing visual code generation. VectorGym is publicly available on huggingface.co/datasets/ServiceNow/VectorGym.

2026-02-21

arXiv (preprint)

Continual Test-Time Adaptation: A Comprehensive Survey

Sarthak Kumar Maharana

Shambhavi Mishra

Yunbei Zhang

shuaicheng niu

Taki Hasan Rafi

Jihun Hamm

Jose Dolz

Guo, Yunhui

Deep neural nets achieve remarkable performance when training and test data share the same distribution, but this assumption frequently brea… (see more)ks in real-world deployment, where data undergoes continual distributional shifts. Continual Test-Time Adaptation (CTTA) addresses this challenge by adapting pretrained models to non-stationary target distributions on-the-fly, without access to source data or labeled targets, while mitigating two critical failure modes: catastrophic forgetting of source knowledge and error accumulation from noisy pseudo-labels over extended time horizons. In this comprehensive survey, we formally define the CTTA problem, analyze the diverse continual domain shift patterns that characterize different evaluation protocols, and propose a hierarchical taxonomy that categorizes existing methods into three families: optimization-based strategies (entropy minimization, pseudo-labeling, parameter restoration), parameter-efficient methods (normalization layer adaptation, adaptive parameter selection), and architecture-based approaches (teacher-student frameworks, adapters, visual prompting, masked modeling). We systematically review representative methods within each category and present comparative benchmarks and experimental results across standard evaluation settings. Finally, we discuss limitations of current approaches and highlight emerging research directions, including adaptation of foundation models and black-box systems, providing a roadmap for future research in robust continual test-time adaptation. We encourage visiting our repository at https://github.com/sarthaxxxxx/Awesome-Continual-Test-Time-Adaptation.

2026-02-15

Open MIND (preprint)

Leveraging Diversity for Privileged Multi-Teacher Knowledge Distillation for Facial Expression Recognition

Muhammad Haseeb Aslam

Alessandro L. Koerich

Eric Granger

2025-12-31

SSRN Electronic Journal (accepted)

Iterative Monte Carlo Tree Search for Neural Architecture Search

Mehraveh Javan

Matthew Toews

2025-11-10

Proceedings of the Fourth International Conference on Automated Machine Learning (published)

proceedings.mlr.press

LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups

Masih Aminbeidokhti

Subhankar Roy

Eric Granger

Elisa Ricci

Real-world datasets typically exhibit long-tailed (LT) distributions, where a few head classes dominate and many tail classes are severely u… (see more)nderrepresented. While recent work shows that parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer preserve tail-class performance on foundation models such as CLIP, we find that they do so at the cost of head-class accuracy. We identify the head-tail ratio, the proportion of head to tail classes, as a crucial but overlooked factor influencing this trade-off. Through controlled experiments on CIFAR100 with varying imbalance ratio (

2025-11-10

ArXiv (preprint)

High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti

Heitor Rapela Medeiros

Srikanth Muralidharan

Eric Granger

2025-10-07

ArXiv (preprint)

Revisiting Mixout: An Overlooked Path to Robust Finetuning

Masih Aminbeidokhti

Heitor Rapela Medeiros

Eric Granger

Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revis… (see more)it Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.

2025-10-07

ArXiv (preprint)

VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors

Atif Belal

Heitor Rapela Medeiros

Eric Granger

2025-09-30

ArXiv (preprint)