Portrait de Marco Pedersoli

Marco Pedersoli

Membre affilié
Professeur associé, École de technologie suprérieure
Sujets de recherche
Apprentissage de représentations
Apprentissage multimodal
Apprentissage profond
Généralisation
Imagerie satellite
Modèles génératifs
Robustesse
Supervision faible
Systèmes de gestion de l'énergie des bâtiments
Vision et langage
Vision par ordinateur

Biographie

Je suis professeur associé à l'ÉTS Montréal, membre du LIVIA (le Laboratoire d'Imagerie, Vision et Intelligence Artificielle), et membre du Laboratoire International des Systèmes d'Apprentissage (ILLS). Je suis également membre d'ELLIS, le réseau européen d'excellence en IA. Depuis 2021, je suis co-titulaire de la chaire de recherche industrielle Distech sur les réseaux neuronaux intégrés pour le contrôle des bâtiments connectés.

Mes recherches sont centrées sur les méthodes et algorithmes de Deep Learning, avec un accent sur la reconnaissance visuelle, l'interprétation automatique et la compréhension des images et des vidéos. L'un des principaux objectifs de mon travail est de faire progresser l'intelligence artificielle en minimisant deux facteurs critiques : la charge de calcul et la nécessité d'une supervision humaine. Ces réductions sont essentielles pour une IA évolutive, permettant des systèmes plus efficaces, adaptatifs et intégrés. Dans mes travaux récents, j'ai contribué au développement de réseaux neuronaux pour les bâtiments intelligents, en intégrant des solutions basées sur l'IA pour améliorer l'efficacité énergétique et le confort dans les environnements intelligents.

Étudiants actuels

Maîtrise recherche - École de technologie suprérieure
Superviseur⋅e principal⋅e :

Publications

Iterative Monte Carlo Tree Search for Neural Architecture Search
Mehraveh Javan
Matthew Toews
LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups
Masih Aminbeidokhti
Subhankar Roy
Eric Granger
Elisa Ricci
Real-world datasets typically exhibit long-tailed (LT) distributions, where a few head classes dominate and many tail classes are severely u… (voir plus)nderrepresented. While recent work shows that parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer preserve tail-class performance on foundation models such as CLIP, we find that they do so at the cost of head-class accuracy. We identify the head-tail ratio, the proportion of head to tail classes, as a crucial but overlooked factor influencing this trade-off. Through controlled experiments on CIFAR100 with varying imbalance ratio (
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
Masih Aminbeidokhti
Heitor Rapela Medeiros
Srikanth Muralidharan
Eric Granger
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shi… (voir plus)fts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
Masih Aminbeidokhti
Heitor Rapela Medeiros
Eric Granger
Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shi… (voir plus)fts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.
High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization
Masih Aminbeidokhti
Heitor Rapela Medeiros
Srikanth Muralidharan
Eric Granger
Revisiting Mixout: An Overlooked Path to Robust Finetuning
Masih Aminbeidokhti
Heitor Rapela Medeiros
Eric Granger
Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revis… (voir plus)it Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
Revisiting Mixout: An Overlooked Path to Robust Finetuning
Masih Aminbeidokhti
Heitor Rapela Medeiros
Eric Granger
Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revis… (voir plus)it Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
Atif Belal
Heitor Rapela Medeiros
Eric Granger
VLOD-TTA: Test-Time Adaptation of Vision-Language Object Detectors
Atif Belal
Heitor Rapela Medeiros
Eric Granger
Vision-language object detectors (VLODs) such as YOLO-World and Grounding DINO achieve impressive zero-shot recognition by aligning region p… (voir plus)roposals with text representations. However, their performance often degrades under domain shift. We introduce VLOD-TTA, a test-time adaptation (TTA) framework for VLODs that leverages dense proposal overlap and image-conditioned prompt scores. First, an IoU-weighted entropy objective is proposed that concentrates adaptation on spatially coherent proposal clusters and reduces confirmation bias from isolated boxes. Second, image-conditioned prompt selection is introduced, which ranks prompts by image-level compatibility and fuses the most informative prompts with the detector logits. Our benchmarking across diverse distribution shifts -- including stylized domains, driving scenes, low-light conditions, and common corruptions -- shows the effectiveness of our method on two state-of-the-art VLODs, YOLO-World and Grounding DINO, with consistent improvements over the zero-shot and TTA baselines. Code : https://github.com/imatif17/VLOD-TTA
AInstein: Can AI Rediscover Scientific Concepts from First Principles?
Large language models have demonstrated remarkable capabilities across diverse tasks, yet a fundamental question remains: can these models g… (voir plus)enuinely rediscover complex scientific insights, or do they merely recite memorized information? We present AInstein, a novel framework for evaluating whether language models can derive established scientific concepts from first principles when stripped of domain-specific terminology. Rather than testing the recall of scientific facts, we reformulate landmark discoveries as conceptual puzzles, challenging models to reconstruct the underlying technical solutions independently.
Rendering-Aware Reinforcement Learning for Vector Graphics Generation
Juan A. Rodriguez
Haotian Zhang
Rishav Pramanik
Pascal Wichmann
Arnab Mondal
Mohammad Reza Samsami
Sai Rajeswar
Scalable Vector Graphics (SVG) offer a powerful format for representing visual designs as interpretable code. Recent advances in vision-lang… (voir plus)uage models (VLMs) have enabled high-quality SVG generation by framing the problem as a code generation task and leveraging large-scale pretraining. VLMs are particularly suitable for this task as they capture both global semantics and fine-grained visual patterns, while transferring knowledge across vision, natural language, and code domains. However, existing VLM approaches often struggle to produce faithful and efficient SVGs because they never observe the rendered images during training. Although differentiable rendering for autoregressive SVG code generation remains unavailable, rendered outputs can still be compared to original inputs, enabling evaluative feedback suitable for reinforcement learning (RL). We introduce RLRF (Reinforcement Learning from Rendering Feedback), an RL method that enhances SVG generation in autoregressive VLMs by leveraging feedback from rendered SVG outputs. Given an input image, the model generates SVG roll-outs that are rendered and compared to the original image to compute a reward. This visual fidelity feedback guides the model toward producing more accurate, efficient, and semantically coherent SVGs. RLRF significantly outperforms supervised fine-tuning, addressing common failure modes and enabling precise, high-quality SVG generation with strong structural understanding and generalization.
MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
Muhammad Osama Zeeshan
Natacha Gillet
Alessandro Lameiras Koerich
Francois Bremond
Eric Granger
Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of exp… (voir plus)ressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject, to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multi-modal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Our experimental results on challenging multimodal ER datasets: BioVid and StressID, show that MuSACo can outperform UDA (blending) and state-of-the-art MSDA methods.