Vaibhav Singh

DiffuMamba: High-Throughput Diffusion LMs with Mamba Backbone

Pierre-Andre Noel

Torsten Scholak

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) generation, yet their reliance on Transforme… (see more)r backbones limits inference efficiency due to quadratic attention or KV-cache overhead. We introduce DiffuMamba, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling, and DiffuMamba-H, a hybrid variant with interleaved attention. Across scales up to 1.3B parameters, our models match Transformer-based diffusion in downstream performance while achieving up to 8.2× and 4.3× higher inference throughput, respectively, on long sequences. We further present a systematic analysis of inference efficiency across modern DLM variants, combining asymptotic complexity with empirical measurements. Notably, cache-efficient block diffusion with Mamba mixers emerges as the only strategy that scales linearly with sequence length and achieves the strongest performance across all baselines, suggesting a promising direction for future diffusion-based generation systems.

2026-03-01

MM_Intelligence @ International Conference on Learning Representations (poster)

doi.org

openreview.net

Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Vaibhav Singh

Rahaf Aljundi

Eugene Belilovsky

Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge r… (see more)emains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

2026-01-29

Transactions on Machine Learning Research (accepted)

doi.org

openreview.net

When Data Falls Short: Grokking Below the Critical Threshold

Vaibhav Singh

Eugene Belilovsky

Rahaf Aljundi

2025-09-22

NeurIPS.cc/2025/Workshop/CCFM (poster)

openreview.net

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (published)

doi.org

openreview.net

Model Parallelism With Subnetwork Data Parallelism

Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (see more)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a

2025-06-10

ICML.cc/2025/Workshop/ES-FoMo-III (published)

doi.org

openreview.net

Controlling Forgetting with Test-Time Data in Continual Learning

Vaibhav Singh

Rahaf Aljundi

Eugene Belilovsky

Foundational vision-language models excel in various tasks but require updates as new tasks or domains emerge. Current Continual Learning (C… (see more)L) methods, which focus on supervised training, often suffer from significant forgetting, performing worse than the original models in zero-shot scenarios. This work proposes leveraging test-time, unsupervised data in a self-supervised manner to refresh the model’s memory of previously learned tasks, minimizing forgetting without additional labeling. By introducing a student-teacher framework with gradient-based sparse parameter updates, the approach enhances performance on prior tasks and reduces reliance on offline memory buffers, effectively improving continual learning outcomes.

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

openreview.net

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Vaibhav Singh

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Vaibhav Singh

Publications