Marco Pedersoli

Alessandro Lameiras Koerich

Ali Etemad

Eric Granger

2025-04-19

ArXiv (prépublication)

StarVector: Generating Scalable Vector Graphics Code from Images and Text

Juan A. Rodriguez

Abhay Puri

Shubham Agarwal

Issam Hadj Laradji

Pau Rodriguez

Sai Rajeswar

David Vazquez

Chris Pal

2025-04-11

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition

Muhammad Osama Zeeshan

Alessandro Lameiras Koerich

Eric Grange

2025-04-05

ArXiv (prépublication)

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

Muhammad Haseeb Aslam

Clara Martinez

Alessandro Lameiras Koerich

Ali Etemad

Eric Granger

Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) arch… (voir plus)itecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.

2025-04-01

arXiv (publié)

Progressive Multi-Source Domain Adaptation for Personalized Facial Expression Recognition

Muhammad Osama Zeeshan

Alessandro Lameiras Koerich

Eric Grange

2025-04-01

arXiv (publié)

Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

Masoumeh Sharafi

Emma Ollivier

Muhammad Osama Zeeshan

Soufiane Belharbi

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

2025-03-26

ArXiv (prépublication)

Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptation of Object Detectors

Atif Belal

Akhil Meethal

Francisco Perdigon Romero

Eric Granger

Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment acro… (voir plus)ss source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform feature alignment in a class-agnostic manner. This is challenging since the objects have unique modality information due to variations in object appearance across domains. A recent prototype-based approach proposed a class-wise alignment, yet it suffers from error accumulation caused by noisy pseudo-labels that can negatively affect adaptation with imbalanced data. To overcome these limitations, we propose an attention-based class-conditioned alignment method for MSDA, designed to align instances of each object category across domains. In particular, an attention module combined with an adversarial domain classifier allows learning domain-invariant and class-specific instance representations. Experimental results on multiple benchmarking MSDA datasets indicate that our method outperforms state-of-the-art methods and exhibits robustness to class imbalance, achieved through a conceptually simple class-conditioning strategy. Our code is available at: https://github.com/imatif17/ACIA.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

Mixed Patch Visible-Infrared Modality Agnostic Object Detection

Heitor Rapela Medeiros

David Latortue

Eric Granger

In real-world scenarios, using multiple modalities like visible (RGB) and infrared (IR) can greatly improve the performance of a predictive … (voir plus)task such as object detection (OD). Multimodal learning is a common way to leverage these modalities, where multiple modality-specific encoders and a fusion module are used to improve performance. In this paper, we tackle a different way to employ RGB and IR modalities, where only one modality or the other is observed by a single shared vision encoder. This realistic setting requires a lower memory footprint and is more suitable for applications such as autonomous driving and surveillance, which commonly rely on RGB and IR data. However, when learning a single encoder on multiple modalities, one modality can dominate the other, producing un-even recognition results. This work investigates how to efficiently leverage RGB and IR modalities to train a common transformer-based OD vision encoder while countering the effects of modality imbalance. For this, we introduce a novel training technique to Mix Patches (MiPa)from the two modalities, in conjunction with a patch-wise modality agnostic module, for learning a common representation of both modalities. Our experiments show that MiPa can learn a representation to reach competitive results on traditional RGB/IR benchmarks while only requiring a single modality during inference. Our code is available at: https://github.com/heitorrapela/MiPa.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

A Realistic Protocol for Evaluation of Weakly Supervised Object Localization

Shakeeb Murtaza

Soufiane Belharbi

Eric Granger

Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only globa… (voir plus)l class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper, a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments with several WSOL methods on ILSVRC and CUB datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to those selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded based solely on LOC maps.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (publié)

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Juan A. Rodriguez

Chao Wang

Akshay Kalkunte Suresh

Abhay Puri

Xiangru Jian

Pierre-Andre Noel

Sathwik Tejaswi Madhusudhan

Enamul Hoque

Issam Hadj Laradji

David Vazquez

Perouz Taslakian … (voir 2 de plus)

Spandana Gella

Sai Rajeswar

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges… (voir plus) on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

2025-03-05

ICLR.cc/2025/Workshop/Re-Align (poster)