Développez des compétences fondamentales en intelligence artificielle (IA) responsable grâce à des cours autodirigés, animés par des expert·e·s de Mila reconnu·e·s à l’échelle internationale.
Le Fellowship Mila en politiques de l'IA transforme l'expertise approfondie en IA en politiques rigoureuses d'intérêt public. Découvrez la dernière publication Combler la disparité en matière d’expertise : mécanismes de transfert des connaissances pour la réglementation de l’IA par Moritz von Knebel.
Ce programme soutient les startups spécialisées en IA à tout moment de l'année. Bénéficiez de ressources de pointe et d'un accompagnement sur mesure pour accélérer le développement de votre technologie.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Lecteur Multimédia
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Publications
Spatial pattern regression for meteorological fields interpolation
Abstract In multiple sclerosis, different types of lesions and their localization can have varying effects on clinical disability and diseas… (voir plus)e progression. Ultra-high field 7-Tesla MRI improves the visualization of cortical, especially subpial, lesions and of white matter lesions with a paramagnetic rim that are associated with smoldering inflammation. Spinal cord atrophy is also a critical determinant of clinical disability in multiple sclerosis, but its importance relative to paramagnetic rim and cortical lesions in predicting neurological disability and its progression remains unclear. In this longitudinal study, we aimed to identify the most relevant predictors of both the baseline Expanded Disability Status Scale status and 4-year progression independent of relapse activity in a heterogeneous multiple sclerosis cohort. One-hundred-twelve patients (83 relapsing-remitting and 29 secondary progressive; mean age 42.3 years, mean disease duration 9.8 years) underwent 7-Tesla T2* susceptibility-weighted images to segment paramagnetic rim lesions, non-rim white matter lesions and cortical lesions; 3-Tesla T1-weighted brain MRI images extended to the C2-C3 spinal cord were employed to obtain brain volumes and the spinal cord C2-C3 cross-sectional area using FreeSurfer and Spinal Cord Toolbox. Clinical disability was assessed through the Expanded Disability Status Scale at baseline and, in 97/112 patients (86.6%), after a mean follow-up of 4.0 years. The association between imaging metrics and clinical outcome was evaluated using correlations and regression models, corrected for age, sex, treatment class and clinical follow-up time. The main predictors of baseline Expanded Disability Status Scale were cortical lesion (β = 2.9 × 10−4, P = 0.001), non-rim white matter lesion (β = 1.2 × 10−4, P 0.001) volumes, brain white matter volume (β = −15.68, P = 0.017) and C2-C3 cross-sectional area (β = −0.68, P = 0.003). At follow-up, 23/97 patients (24%) experienced progression independent of relapse activity. Progression independent of relapse activity was associated with paramagnetic rim lesion volume (odds ratio = 1.0006 per mm³ increase, P = 0.030), cortical lesion volume (odds ratio = 1.0005 per mm³ increase, P = 0.011) and brain white matter volume (odds ratio = 0.97 × 10−20, P 0.001). However, a stepwise logistic regression model assessing clinical, lesion and atrophy variables identified cortical lesion volume as the strongest independent predictor of progression independent of relapse activity (odds ratio = 1.0006 per mm³ increase, P = 0.005). In multiple sclerosis, different imaging biomarkers contribute differently to current disability and progression independent of relapse activity. Spinal cord atrophy mainly explains the current Expanded Disability Status Scale, while brain white matter atrophy and paramagnetic rim lesions provide additional insights into future disability trajectory. Among all markers, cortical lesions emerged as the main driver for progression independent of relapse activity.
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes res… (voir plus)ource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
2025-12-31
International Conference on Machine Learning (Accept (regular))
In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient commu… (voir plus)nication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by
Tequila: Deadzone-free Ternary Quantization for Large Language Models
Hong Huang
Decheng Wu
Rui Cen
Guanghua Yu
Zonghang Li
Kai Liu
Jianchen Zhu
Peng Chen
Xue Liu
Dapeng Wu
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often … (voir plus)rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as _**deadzone trapping**: a large number of weights are trapped at the deadzone boundary._ This occurs because these weights receive only noisy, less informative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose **Tequila**, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly _zero_ inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves
2025-12-31
International Conference on Learning Representations (Accept (Poster))
Scientific foundation models should be built for science, not for generic AI tastes or leaderboard prestige. This workshop centers problem-d… (voir plus)riven design: models that measurably advance real scientific inquiries, e.g., forecasting extreme climate events, accelerating materials discovery, understanding biological mechanisms, co-developed with domain experts and validated against field data, experiments, and downstream impact. We argue that foundation models for science must be built differently from language and vision. Scientific data are physical, causal, spatiotemporal, and often scarce or biased; objectives must reflect mechanistic fidelity, not just predictive accuracy. This calls for scientific priors and constraints, robust uncertainty quantification (UQ), and architectures that natively handle multi-modality (e.g., grids, meshes, spectra, time series, point clouds, text, images, code). It also demands tight integration with classical scientific tools (simulators, PDE solvers, optimization and inference engines, and HPC workflows) to yield hybrid systems that are faster, more accurate, and more trustworthy. We will highlight opportunities and hard problems unique to science: enforcing conservation laws and symmetries; learning across vast spatial and temporal scales; representing extreme events and tipping points; calibrating and validating UQ; and developing evaluation protocols that reward mechanistic insight and actionable reliability. The goal is a roadmap for building, training, and deploying scientific foundation models that accelerate discovery while respecting the structure of the natural world.
2025-12-31
Workshop Proposals @ International Conference on Learning Representations (publié)
Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuni… (voir plus)ng a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
Pre-trained transformers have demonstrated remarkable generalization abilities, at times extending beyond the scope of their training data. … (voir plus)Yet, real-world deployments often face unexpected or adversarial data that diverges from training data distributions. Without explicit mechanisms for handling such shifts, model reliability and safety degrade, urging more disciplined study of out-of-distribution (OOD) settings for transformers. By systematic experiments, we present a mechanistic framework for delineating the precise contours of transformer model robustness. We find that OOD inputs, including subtle typos and jailbreak prompts, drive language models to operate on an increased number of fallacious concepts in their internals. We leverage this device to quantify and understand the degree of distributional shift in prompts, enabling a mechanistically grounded fine-tuning strategy to robustify LLMs. Expanding the very notion of OOD from input data to a model’s private computational processes—a new transformer diagnostic at inference time—is a critical step toward making AI systems safe for deployment across science, business, and government.
2025-12-31
International Conference on Machine Learning (Accept (regular))
State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling task… (voir plus)s while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) State-Space Models (SSMs) on sequential state-tracking tasks for abstract groups. It is easy to show that a single DCD SSM layer with a universal decoder can track any Abelian group at finite precision by decomposing it into a product of cyclic groups. We show that this is tight by proving that such a model cannot track any non-Abelian group at finite precision. We further establish the expressivity of multi-layer DCD SSMs. We show that a
2025-12-31
International Conference on Learning Representations (Accept (Poster))
The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to ar… (voir plus)gue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both the uniform and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.
2025-12-31
International Conference on Learning Representations (Accept (Poster))