Portrait of Marco Pedersoli is unavailable

Marco Pedersoli

Affiliate Member
Associate Professor, École de technologie suprérieure
Research Topics
Building Energy Management Systems
Computer Vision
Deep Learning
Generalization
Generative Models
Multimodal Learning
Representation Learning
Robustness
Satellite Imagery
Vision and Language
Weak Supervision

Biography

I am an Associate Professor at ÉTS Montreal, a member of LIVIA (le Laboratoire d'Imagerie, Vision et Intelligence Artificielle), and part of the International Laboratory of Learning Systems (ILLS). I am also a member of ELLIS, the European network of excellence in AI. Since 2021, I have co-held the Distech Industrial Research Chair on Embedded Neural Networks for Connected Building Control.

My research centers on Deep Learning methods and algorithms, with a focus on visual recognition, and the automatic interpretation and understanding of images and videos. A key objective of my work is to advance machine intelligence by minimizing two critical factors: computational load and the need for human supervision. These reductions are essential for scalable AI, enabling more efficient, adaptive, and embedded systems. In my recent work, I have contributed to developing neural networks for smart buildings, integrating AI-driven solutions to enhance energy efficiency and comfort in intelligent environments.

Publications

A Realistic Protocol for Evaluation of Weakly Supervised Object Localization
Shakeeb Murtaza
Soufiane Belharbi
Eric Granger
Weakly Supervised Object Localization (WSOL) allows training deep learning models for classification and localization (LOC) using only globa… (see more)l class-level labels. The absence of bounding box (bbox) supervision during training raises challenges in the literature for hyper-parameter tuning, model selection, and evaluation. WSOL methods rely on a validation set with bbox annotations for model selection, and a test set with bbox annotations for threshold estimation for producing bboxes from localization maps. This approach, however, is not aligned with the WSOL setting as these annotations are typically unavailable in real-world scenarios. Our initial empirical analysis shows a significant decline in LOC performance when model selection and threshold estimation rely solely on class labels and the image itself, respectively, compared to using manual bbox annotations. This highlights the importance of incorporating bbox labels for optimal model performance. In this paper, a new WSOL evaluation protocol is proposed that provides LOC information without the need for manual bbox annotations. In particular, we generated noisy pseudo-boxes from a pretrained off-the-shelf region proposal method such as Selective Search, CLIP, and RPN for model selection. These bboxes are also employed to estimate the threshold from LOC maps, circumventing the need for test-set bbox annotations. Our experiments with several WSOL methods on ILSVRC and CUB datasets show that using the proposed pseudo-bboxes for validation facilitates the model selection and threshold estimation, with LOC performance comparable to those selected using GT bboxes on the validation set and threshold estimation on the test set. It also outperforms models selected using class-level labels, and then dynamically thresholded based solely on LOC maps.
IntentGPT: Few-shot Intent Discovery with Large Language Models
Juan A. Rodriguez
Nicholas Botzer
David Vazquez
Issam Hadj Laradji
IntentGPT: Few-shot Intent Discovery with Large Language Models
Juan A. Rodriguez
Nicholas Botzer
David Vazquez
Issam Hadj Laradji
Bag of Tricks for Fully Test-Time Adaptation
Saypraseuth Mounsaveng
Florent Chiaroni
Malik Boudiaf
Ismail Ben Ayed
Fully Test-Time Adaptation (TTA), which aims at adapting models to data drifts, has recently attracted wide interest. Numerous tricks and te… (see more)chniques have been proposed to ensure robust learning on arbitrary streams of unlabeled data. However, assessing the true impact of each individual technique and obtaining a fair comparison still constitutes a significant challenge. To help consolidate the community’s knowledge, we present a categorization of selected orthogonal TTA techniques, including small batch normalization, stream rebalancing, reliable sample selection, and network confidence calibration. We meticulously dissect the effect of each approach on different scenarios of interest. Through our analysis, we shed light on trade-offs induced by those techniques between accuracy, the computational power required, and model complexity. We also uncover the synergy that arises when combining techniques and are able to establish new state-of-the-art results.
Domain Generalization by Rejecting Extreme Augmentations
Masih Aminbeidokhti
Fidel A. Guerrero Peña
Heitor Rapela Medeiros
Thomas Dubail
Eric Granger
HalluciDet: Hallucinating RGB Modality for Person Detection Through Privileged Information
Heitor Rapela Medeiros
Fidel A. Guerrero Peña
Masih Aminbeidokhti
Thomas Dubail
Eric Granger
A powerful way to adapt a visual recognition model to a new domain is through image translation. However, common image translation approache… (see more)s only focus on generating data from the same distribution as the target domain. Given a cross-modal application, such as pedestrian detection from aerial images, with a considerable shift in data distribution between infrared (IR) to visible (RGB) images, a translation focused on generation might lead to poor performance as the loss focuses on irrelevant details for the task. In this paper, we propose HalluciDet, an IR-RGB image translation model for object detection. Instead of focusing on reconstructing the original image on the IR modality, it seeks to reduce the detection loss of an RGB detector, and therefore avoids the need to access RGB data. This model produces a new image representation that enhances objects of interest in the scene and greatly improves detection performance. We empirically compare our approach against state-of-the-art methods for image translation and for fine-tuning on IR, and show that our HalluciDet improves detection accuracy in most cases by exploiting the privileged information encoded in a pre-trained RGB detector. Code: https://github.com/heitorrapela/HalluciDet.
Multi-Source Domain Adaptation for Object Detection with Prototype-based Mean Teacher
Atif Belal
Akhil Meethal
Francisco Perdigon Romero
Eric Granger
Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptive Object Detection
Atif Belal
Akhil Meethal
Francisco Perdigon Romero
Eric Granger
Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptive Object Detection
Atif Belal
Akhil Meethal
Francisco Perdigon Romero
Eric Granger
Evaluating Supervision Levels Trade-Offs for Infrared-Based People Counting
David Latortue
Moetez Kdayem
Fidel A. Guerrero Peña
Eric Granger
Object detection models are commonly used for people counting (and localization) in many applications but require a dataset with costly boun… (see more)ding box annotations for training. Given the importance of privacy in people counting, these models rely more and more on infrared images, making the task even harder. In this paper, we explore how weaker levels of supervision affect the performance of deep person counting architectures for image classification and point-level localization. Our experiments indicate that counting people using a convolutional neural network with image-level annotation achieves a level of accuracy that is competitive with YOLO detectors and point-level localization models yet provides a higher frame rate and a simi-lar amount of model parameters. Our code is available at: https://github.com/tortueTortue/IRPeopleCounting.
Joint Multimodal Transformer for Dimensional Emotional Recognition in the Wild
Paul Waligora
Muhammad Osama Zeeshan
Muhammad Haseeb Aslam
Soufiane Belharbi
Alessandro Lameiras Koerich
Simon Bacon
Eric Granger
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter-and intra… (see more)-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subse-quently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
Do not trust what you trust: Miscalibration in Semi-supervised Learning
Shambhavi Mishra
Balamurali Murugesan
Ismail Ben Ayed
Jose Dolz
State-of-the-art semi-supervised learning (SSL) approaches rely on highly confident predictions to serve as pseudo-labels that guide the tra… (see more)ining on unlabeled samples. An inherent drawback of this strategy stems from the quality of the uncertainty estimates, as pseudo-labels are filtered only based on their degree of uncertainty, regardless of the correctness of their predictions. Thus, assessing and enhancing the uncertainty of network predictions is of paramount importance in the pseudo-labeling process. In this work, we empirically demonstrate that SSL methods based on pseudo-labels are significantly miscalibrated, and formally demonstrate the minimization of the min-entropy, a lower bound of the Shannon entropy, as a potential cause for miscalibration. To alleviate this issue, we integrate a simple penalty term, which enforces the logit distances of the predictions on unlabeled samples to remain low, preventing the network predictions to become overconfident. Comprehensive experiments on a variety of SSL image classification benchmarks demonstrate that the proposed solution systematically improves the calibration performance of relevant SSL models, while also enhancing their discriminative power, being an appealing addition to tackle SSL tasks.