Portrait of Eugene Belilovsky is unavailable

Eugene Belilovsky

Associate Academic Member
Assistant Professor, Concordia University, Department of Computer Science and Software Engineering
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Deep Learning
Distributed Systems
Optimization

Biography

Eugene Belilovsky is an assistant professor in the Department of Computer Science and Software Engineering at Concordia University.

He is also an associate academic member of Mila – Quebec Artificial Intelligence Institute and an adjunct professor at Université de Montréal.

Belilovsky’s research specialties lie in computer vision and deep learning. His current interests include continual learning and few-shot learning, along with applications of these aspects at the intersection of computer vision and language processing.

Current Students

PhD - Concordia University
Master's Research - Concordia University
Co-supervisor :
PhD - Concordia University
Co-supervisor :
Master's Research - Université de Montréal
Co-supervisor :
Master's Research - Concordia University
Co-supervisor :
PhD - Concordia University
Co-supervisor :
Master's Research - Concordia University
Co-supervisor :
PhD - Concordia University
PhD - Concordia University
Postdoctorate - Concordia University
Co-supervisor :
PhD - Concordia University
Co-supervisor :
PhD - Concordia University
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Principal supervisor :
PhD - Concordia University
Co-supervisor :
Master's Research - Concordia University

Publications

Circuit Discovery Helps To Detect LLM Jailbreaking
Paria Mehrbod
Boris Knyazev
geraldin nanfack
Despite extensive safety alignment, large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safeguards to elicit har… (see more)mful content. While prior work attributes this vulnerability to safety training limitations, the internal mechanisms by which LLMs process adversarial prompts remain poorly understood. We present a mechanistic analysis of the jailbreaking behavior in a large-scale, safety-aligned LLM, focusing on LLaMA-2-7B-chat-hf. Leveraging edge attribution patching and subnetwork probing, we systematically identify computational circuits responsible for generating affirmative responses to jailbreak prompts. Ablating these circuits during the first token prediction can reduce attack success rates by up to 80\%, demonstrating its critical role in safety bypass. Our analysis uncovers key attention heads and MLP pathways that mediate adversarial prompt exploitation, revealing how important tokens propagate through these components to override safety constraints. These findings advance the understanding of adversarial vulnerabilities in aligned LLMs and pave the way for targeted, interpretable defenses mechanisms based on mechanistic interpretability.
PyLO: Towards Accessible Learned Optimizers in PyTorch
Paul Janson
Benjamin Thérien
Quentin Anthony
Xiaolong Huang
Abhinav Moudgil
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optim… (see more)izers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo
Geometry-Aware Preference Learning for 3D Texture Generation
AmirHossein Zamani
Tianhao Xie
Amir Aghdam
Tiberiu Popa
Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjec… (see more)tive human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.
MuLoCo: Muon is a practical inner optimizer for DiLoCo
Benjamin Thérien
Xiaolong Huang
Unsupervised Test-Time Adaptation for Hepatic Steatosis Grading Using Ultrasound B-Mode Images.
Pedro Vianna
Paria Mehrbod
Muawiz Chaudhary
Michael Eickenberg
An Tang
Guy Cloutier
Ultrasound is considered a key modality for the clinical assessment of hepatic steatosis (i.e., fatty liver) due to its non-invasiveness and… (see more) availability. Deep learning methods have attracted considerable interest in this field, as they are capable of learning patterns in a collection of images and achieve clinically comparable levels of accuracy in steatosis grading. However, variations in patient populations, acquisition protocols, equipment, and operator expertise across clinical sites can introduce domain shifts that reduce model performance when applied outside the original training setting. In response, unsupervised domain adaptation techniques are being investigated to address these shifts, allowing models to generalize more effectively across diverse clinical environments. In this work, we propose a test-time batch normalization technique designed to handle domain shift, especially for changes in label distribution, by adapting selected features of batch normalization layers in a trained convolutional neural network model. This approach operates in an unsupervised manner, allowing robust adaptation to new distributions without access to label data. The method was evaluated on two abdominal ultrasound datasets collected at different institutions, assessing its capability in mitigating domain shift for hepatic steatosis classification. The proposed method reduced the mean absolute error in steatosis grading by 37% and improved the area under the receiver operating characteristic curve for steatosis detection from 0.78 to 0.97, compared to non-adapted models. These findings demonstrate the potential of the proposed method to address domain shift in ultrasound-based hepatic steatosis diagnosis, minimizing risks associated with deploying trained models in various clinical settings.
Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien
Charles-Étienne Joseph
Zain Sarwar
Ashwinee Panda
Anirban Das
Shi-Xiong Zhang
Stephen Rawls
Sambit Sahu
Adaptive Local Training in Federated Learning
Donald Shenaj
Pietro Zanuttigh
Federated Learning is a machine learning paradigm where multiple clients collaboratively train a global model by exchanging their locally tr… (see more)ained model weights instead of raw data. In the standard setting, every client trains the local model for the same number of epochs. We introduce ALT (Adaptive Local Training), a simple yet effective feedback mechanism that could be introduced at the client side to limit unnecessary and degrading computations. ALT dynamically adjusts the number of training epochs for each client based on the similarity between their local representations and the global one, ensuring that well-aligned clients can train longer without experiencing client drift. We evaluated ALT on federated partitions of the CIFAR-10 and TinyImageNet datasets, demonstrating its effectiveness in improving model convergence and stability.
Adaptive Local Training in Federated Learning
Donald Shenaj
Pietro Zanuttigh
Federated learning is a machine learning paradigm where multiple clients collaboratively train a global model by exchanging their locally tr… (see more)ained model weights instead of raw data. In the standard setting, every client trains the local model for the same number of epochs. We introduce ALT (Adaptive Local Training), a simple yet effective feedback mechanism that can be exploited at the client side to limit unnecessary and degrading computations. ALT dynamically adjusts the number of training epochs for each client based on the similarity between their local representations and the global one, ensuring that well-aligned clients can train longer without experiencing client drift. We evaluated ALT on federated partitions of the CIFAR-10 and Tiny-ImageNet datasets, demonstrating its effectiveness in improving model convergence and stability.
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Vaibhav Singh
Paul Janson
Paria Mehrbod
Adam Ibrahim
Benjamin Thérien
The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Vaibhav Singh
Paul Janson
Paria Mehrbod
Adam Ibrahim
Benjamin Thérien
The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.
Continual Pre-training of MoEs: How robust is your router?
Benjamin Thérien
Charles-Étienne Joseph
Zain Sarwar
Ashwinee Panda
Anirban Das
Shi-Xiong Zhang
Stephen Rawls
Sambit Sahu
Large language models deconstruct the clinical intuition behind diagnosing autism
Jack Stanley
Emmett Rabot
L. Mottron