Eugene Belilovsky

Paul Janson

PhD - Concordia University

Co-supervisor :

Master's Research - Concordia University

Co-supervisor :

Irina Rish

Website

Gwen Legate

PhD - Concordia University

Co-supervisor :

Master's Research - Concordia University

Co-supervisor :

Guy Wolf

Abhinav Moudgil

PhD - Concordia University

Website

Google Scholar

Adel Nabli

PhD - Concordia University

Google Scholar

Geraldin Nanfack

Postdoctorate - Concordia University

Co-supervisor :

Albert Orozco Camacho

PhD - Concordia University

Co-supervisor :

PhD - Concordia University

Co-supervisor :

Irina Rish

Benjamin Therien

PhD - Université de Montréal

Principal supervisor :

Collaborating researcher - Université de Montréal

Principal supervisor :

PhD - Concordia University

Co-supervisor :

Publications

Model Parallelism With Subnetwork Data Parallelism

Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (see more)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (published)

MuLoCo: Muon is a practical inner optimizer for DiLoCo

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (published)

Geometry-Aware Preference Learning for 3D Texture Generation

AmirHossein Zamani

Tianhao Xie

Amir Aghdam

Tiberiu Popa

Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjec… (see more)tive human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.

2025-06-10

ICML.cc/2025/Workshop/MoFA (poster)

Test Time Adaptation Using Adaptive Quantile Recalibration

2025-06-10

ICML.cc/2025/Workshop/PUT (poster)

PyLO: Towards Accessible Learned Optimizers in PyTorch

Paul Janson

Benjamin Therien

Quentin Gregory Anthony

Xiaolong Huang

Abhinav Moudgil

Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optim… (see more)izers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo

2025-06-09

ICML.cc/2025/Workshop/CODEML (published)

Unsupervised Test-Time Adaptation for Hepatic Steatosis Grading Using Ultrasound B-Mode Images.

Michael Eickenberg

An Tang

Guy Cloutier

Ultrasound is considered a key modality for the clinical assessment of hepatic steatosis (i.e., fatty liver) due to its non-invasiveness and… (see more) availability. Deep learning methods have attracted considerable interest in this field, as they are capable of learning patterns in a collection of images and achieve clinically comparable levels of accuracy in steatosis grading. However, variations in patient populations, acquisition protocols, equipment, and operator expertise across clinical sites can introduce domain shifts that reduce model performance when applied outside the original training setting. In response, unsupervised domain adaptation techniques are being investigated to address these shifts, allowing models to generalize more effectively across diverse clinical environments. In this work, we propose a test-time batch normalization technique designed to handle domain shift, especially for changes in label distribution, by adapting selected features of batch normalization layers in a trained convolutional neural network model. This approach operates in an unsupervised manner, allowing robust adaptation to new distributions without access to label data. The method was evaluated on two abdominal ultrasound datasets collected at different institutions, assessing its capability in mitigating domain shift for hepatic steatosis classification. The proposed method reduced the mean absolute error in steatosis grading by 37% and improved the area under the receiver operating characteristic curve for steatosis detection from 0.78 to 0.97, compared to non-adapted models. These findings demonstrate the potential of the proposed method to address domain shift in ultrasound-based hepatic steatosis diagnosis, minimizing risks associated with deploying trained models in various clinical settings.

2025-03-26

IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control (published)

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Irina Rish

2025-03-06

ArXiv (preprint)

arxiv.org

Adaptive Local Training in Federated Learning

Donald Shenaj

Pietro Zanuttigh

Federated Learning is a machine learning paradigm where multiple clients collaboratively train a global model by exchanging their locally tr… (see more)ained model weights instead of raw data. In the standard setting, every client trains the local model for the same number of epochs. We introduce ALT (Adaptive Local Training), a simple yet effective feedback mechanism that could be introduced at the client side to limit unnecessary and degrading computations. ALT dynamically adjusts the number of training epochs for each client based on the similarity between their local representations and the global one, ensuring that well-aligned clients can train longer without experiencing client drift. We evaluated ALT on federated partitions of the CIFAR-10 and TinyImageNet datasets, demonstrating its effectiveness in improving model convergence and stability.

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepted)

Adaptive Local Training in Federated Learning

Donald Shenaj

Pietro Zanuttigh

Federated learning is a machine learning paradigm where multiple clients collaboratively train a global model by exchanging their locally tr… (see more)ained model weights instead of raw data. In the standard setting, every client trains the local model for the same number of epochs. We introduce ALT (Adaptive Local Training), a simple yet effective feedback mechanism that can be exploited at the client side to limit unnecessary and degrading computations. ALT dynamically adjusts the number of training epochs for each client based on the similarity between their local representations and the global one, ensuring that well-aligned clients can train longer without experiencing client drift. We evaluated ALT on federated partitions of the CIFAR-10 and Tiny-ImageNet datasets, demonstrating its effectiveness in improving model convergence and stability.

2025-03-05

ICLR.cc/2025/Workshop/MCDC (accepted)

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

2025-03-04

ArXiv (preprint)

arxiv.org

Large language models deconstruct the clinical intuition behind diagnosing autism

Emmett Rabot

L. Mottron

2025-03-01

Cell (published)

Channel-Selective Normalization for Label-Shift Robust Test-Time Adaptation

Pedro Vianna

Muawiz Chaudhary

Paria Mehrbod

An Tang

Guy Cloutier

Guy Wolf

Michael Eickenberg

Deep neural networks have useful applications in many different tasks, however their performance can be severely affected by changes in the … (see more)data distribution. For example, in the biomedical field, their performance can be affected by changes in the data (different machines, populations) between training and test datasets. To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust models to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks. It is implemented by recalculating batch normalization statistics on test batches. Prior work has focused on analysis with test data that has the same label distribution as the training data. However, in many practical applications this technique is vulnerable to label distribution shifts, sometimes producing catastrophic failure. This presents a risk in applying test time adaptation methods in deployment. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. Our selection scheme is based on two principles that we empirically motivate: (1) later layers of networks are more sensitive to label shift (2) individual features can be sensitive to specific classes. We apply the proposed technique to three classification tasks, including CIFAR10-C, Imagenet-C, and diagnosis of fatty liver, where we explore both covariate and label distribution shifts. We find that our method allows to bring the benefits of TTA while significantly reducing the risk of failure common in other methods, while being robust to choice in hyperparameters.

2025-02-17

Proceedings of The 3rd Conference on Lifelong Learning Agents (published)