Eugene Belilovsky

Website

Xiao Huang

Master's Research - Concordia University

Co-supervisor :

Paul Janson

PhD - Concordia University

Co-supervisor :

Master's Research - Concordia University

Co-supervisor :

Website

Gwen Legate

PhD - Concordia University

Co-supervisor :

Master's Research - Concordia University

Co-supervisor :

Guy Wolf

Abhinav Moudgil

PhD - Concordia University

Website

Google Scholar

Adel Nabli

PhD - Concordia University

Google Scholar

Geraldin Nanfack

Postdoctorate - Concordia University

Co-supervisor :

Albert Orozco Camacho

PhD - Concordia University

Co-supervisor :

PhD - Concordia University

Co-supervisor :

PhD - Université de Montréal

Principal supervisor :

Postdoctorate - Université de Montréal

Principal supervisor :

PhD - Concordia University

Co-supervisor :

Publications

Efficient Refusal Ablation in LLM through Optimal Transport

Geraldin Nanfack

Elvis Dohmatob

Safety-aligned language models refuse harmful requests through learned refusal behaviors encoded in their internal representations. Recent a… (see more)ctivation-based jailbreaking methods circumvent these safety mechanisms by applying orthogonal projections to remove refusal directions, but these approaches treat refusal as a one-dimensional phenomenon and ignore the rich distributional structure of model activations. We introduce a principled framework based on optimal transport theory that transforms the entire distribution of harmful activations to match harmless ones. By combining PCA with closed-form Gaussian optimal transport, we achieve efficient computation in high-dimensional representation spaces while preserving essential geometric structure. Across six models (Llama-2, Llama-3.1, Qwen-2.5; 7B-32B parameters), our method achieves up to 11% higher attack success rates than state-of-the-art baselines while maintaining comparable perplexity, demonstrating superior preservation of model capabilities. Critically, we discover that layer-selective intervention (applying optimal transport to 1-2 carefully chosen layers at approximately 40-60% network depth) substantially outperforms full-network interventions, revealing that refusal mechanisms may be localized rather than distributed. Our analysis provides new insights into the geometric structure of safety representations and suggests that current alignment methods may be vulnerable to distributional attacks beyond simple direction removal.

2026-03-03

arXiv (preprint)

DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

Vaibhav Singh

Oleksiy Ostapenko

Pierre-Andre Noel

Torsten Scholak

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) generation, yet their reliance on Transforme… (see more)r backbones limits inference efficiency due to quadratic attention or KV-cache overhead. We introduce DiffuMamba, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling, and DiffuMamba-H, a hybrid variant with interleaved attention. Across scales up to 1.3B parameters, our models match Transformer-based diffusion in downstream performance while achieving up to 8.2× and 4.3× higher inference throughput, respectively, on long sequences. We further present a systematic analysis of inference efficiency across modern DLM variants, combining asymptotic complexity with empirical measurements. Notably, cache-efficient block diffusion with Mamba mixers emerges as the only strategy that scales linearly with sequence length and achieves the strongest performance across all baselines, suggesting a promising direction for future diffusion-based generation systems.

2026-03-01

MM_Intelligence @ International Conference on Learning Representations (poster)

Stabilizing Native Low-Rank LLM Pretraining

Paul Janson

Edouard Oyallon

Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges.… (see more) Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices without auxiliary"full-rank"guidance required by prior methods. While native low-rank training often suffers from instability and loss spikes, we identify uncontrolled growth in the spectral norm (largest singular value) of the weight matrix update as the dominant factor. To address this, we introduce Spectron: Spectral renormalization with orthogonalization, which dynamically bounds the resultant weight updates based on the current spectral norms of the factors. Our method enables stable, end-to-end factorized training with negligible overhead. Finally, we establish compute-optimal scaling laws for natively low-rank transformers, demonstrating predictable power-law behavior and improved inference efficiency relative to dense models.

2026-02-11

ArXiv (preprint)

arxiv.org

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Vaibhav Singh

Rahaf Aljundi

Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge r… (see more)emains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

2026-01-29

Transactions on Machine Learning Research (accepted)

$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Charles-Etienne Joseph

Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (see more)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Towards Learned Optimization Free Lunch

Abhinav Moudgil

Boris Knyazev

Learned optimizers are powerful alternatives to hand-designed rules like Adam, yet they have seen limited practical adoption since they ofte… (see more)n fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months (

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Continual Pre-training of MoEs: How robust is your router?

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

2025-09-25

TMLR (accepted)

When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh

Rahaf Aljundi

2025-09-22

NeurIPS.cc/2025/Workshop/CCFM (poster)

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli

Louis Fournier

Pierre Erbacher

Louis Serrano

Edouard Oyallon

Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, s… (see more)ynchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Warming Up for Zeroth-Order Federated Pre-Training with Low Resource Clients

Gwen Legate

Federated learning enables collaborative model training across numerous edge devices without requiring participants to share data; however, … (see more)memory and communication constraints on these edge devices may preclude their participation in training. We consider a setting in which a subset of edge devices are below a critical memory or communication threshold required to conduct model updates. Under typical federated optimization algorithms, these devices are excluded from training which renders their data inaccessible and increases system induced bias. We are inspired by MeZO, a zeroth-order method used for memory-efficient fine-tuning. The increased variance inherent to zeroth-order gradient approximations has relegated previous zeroth-order optimizers exclusively to the domain of fine tuning; a limitation we seek to correct. We devise a federated, memory-efficient zeroth-order optimizer, ZOWarmUp that permits zeroth-order training from a random initialization. ZOWarmUp leverages differing client capabilities and careful variance reduction techniques to facilitate participation of under-represented, low-resource clients in model training. Like other federated zeroth-order methods, ZOWarmUp eliminates the need for edge devices to transmit their full gradients to the server and instead relies on only a small set of random seeds, rendering the up-link communication cost negligible. We present experiments using various datasets and model architectures to show that ZOWarmUp is a robust algorithm that can can be applied under a wide variety of circumstances. For systems with a high proportion of edge devices that would otherwise be excluded from training, this algorithm provides access to a greater volume and diversity of data, thus improving training outcomes.

2025-09-02

ArXiv (preprint)

arxiv.org

Communication Efficient LLM Pre-training with SparseLoCo

Amir M. Sarfi

Joel Lidin

2025-08-20

ArXiv (preprint)