TRAIL : IA responsable pour les professionnels et les leaders
Apprenez à intégrer des pratique d'IA responsable dans votre organisation avec le programme TRAIL. Inscrivez-vous à la prochaine cohorte qui débutera le 15 avril.
Avantage IA : productivité dans la fonction publique
Apprenez à tirer parti de l’IA générative pour soutenir et améliorer votre productivité au travail. La prochaine cohorte se déroulera en ligne les 28 et 30 avril 2026.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Lecteur Multimédia
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) generation, yet their reliance on Transforme… (voir plus)r backbones limits inference efficiency due to quadratic attention or KV-cache overhead. We introduce DiffuMamba, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling, and DiffuMamba-H, a hybrid variant with interleaved attention. Across scales up to 1.3B parameters, our models match Transformer-based diffusion in downstream performance while achieving up to 8.2× and 4.3× higher inference throughput, respectively, on long sequences. We further present a systematic analysis of inference efficiency across modern DLM variants, combining asymptotic complexity with empirical measurements. Notably, cache-efficient block diffusion with Mamba mixers emerges as the only strategy that scales linearly with sequence length and achieves the strongest performance across all baselines, suggesting a promising direction for future diffusion-based generation systems.
2026-03-01
MM_Intelligence @ International Conference on Learning Representations (poster)
Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge r… (voir plus)emains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.
2026-01-29
Transactions on Machine Learning Research (accepté)
The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (voir plus)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.
Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (voir plus)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a
Foundational vision-language models excel in various tasks but require updates as new tasks or domains emerge. Current Continual Learning (C… (voir plus)L) methods, which focus on supervised training, often suffer from significant forgetting, performing worse than the original models in zero-shot scenarios. This work proposes leveraging test-time, unsupervised data in a self-supervised manner to refresh the model’s memory of previously learned tasks, minimizing forgetting without additional labeling. By introducing a student-teacher framework with gradient-based sparse parameter updates, the approach enhances performance on prior tasks and reduces reliance on offline memory buffers, effectively improving continual learning outcomes.