The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Spatial pattern regression for meteorological fields interpolation
Abstract In multiple sclerosis, different types of lesions and their localization can have varying effects on clinical disability and diseas… (see more)e progression. Ultra-high field 7-Tesla MRI improves the visualization of cortical, especially subpial, lesions and of white matter lesions with a paramagnetic rim that are associated with smoldering inflammation. Spinal cord atrophy is also a critical determinant of clinical disability in multiple sclerosis, but its importance relative to paramagnetic rim and cortical lesions in predicting neurological disability and its progression remains unclear. In this longitudinal study, we aimed to identify the most relevant predictors of both the baseline Expanded Disability Status Scale status and 4-year progression independent of relapse activity in a heterogeneous multiple sclerosis cohort. One-hundred-twelve patients (83 relapsing-remitting and 29 secondary progressive; mean age 42.3 years, mean disease duration 9.8 years) underwent 7-Tesla T2* susceptibility-weighted images to segment paramagnetic rim lesions, non-rim white matter lesions and cortical lesions; 3-Tesla T1-weighted brain MRI images extended to the C2-C3 spinal cord were employed to obtain brain volumes and the spinal cord C2-C3 cross-sectional area using FreeSurfer and Spinal Cord Toolbox. Clinical disability was assessed through the Expanded Disability Status Scale at baseline and, in 97/112 patients (86.6%), after a mean follow-up of 4.0 years. The association between imaging metrics and clinical outcome was evaluated using correlations and regression models, corrected for age, sex, treatment class and clinical follow-up time. The main predictors of baseline Expanded Disability Status Scale were cortical lesion (β = 2.9 × 10−4, P = 0.001), non-rim white matter lesion (β = 1.2 × 10−4, P 0.001) volumes, brain white matter volume (β = −15.68, P = 0.017) and C2-C3 cross-sectional area (β = −0.68, P = 0.003). At follow-up, 23/97 patients (24%) experienced progression independent of relapse activity. Progression independent of relapse activity was associated with paramagnetic rim lesion volume (odds ratio = 1.0006 per mm³ increase, P = 0.030), cortical lesion volume (odds ratio = 1.0005 per mm³ increase, P = 0.011) and brain white matter volume (odds ratio = 0.97 × 10−20, P 0.001). However, a stepwise logistic regression model assessing clinical, lesion and atrophy variables identified cortical lesion volume as the strongest independent predictor of progression independent of relapse activity (odds ratio = 1.0006 per mm³ increase, P = 0.005). In multiple sclerosis, different imaging biomarkers contribute differently to current disability and progression independent of relapse activity. Spinal cord atrophy mainly explains the current Expanded Disability Status Scale, while brain white matter atrophy and paramagnetic rim lesions provide additional insights into future disability trajectory. Among all markers, cortical lesions emerged as the main driver for progression independent of relapse activity.
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes res… (see more)ource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.
2025-12-31
International Conference on Machine Learning (Accept (regular))
In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient commu… (see more)nication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by
Tequila: Deadzone-free Ternary Quantization for Large Language Models
Hong Huang
Decheng Wu
Rui Cen
Guanghua Yu
Zonghang Li
Kai Liu
Jianchen Zhu
Peng Chen
Xue Liu
Dapeng Wu
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often … (see more)rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as _**deadzone trapping**: a large number of weights are trapped at the deadzone boundary._ This occurs because these weights receive only noisy, less informative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose **Tequila**, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly _zero_ inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves
2025-12-31
International Conference on Learning Representations (Accept (Poster))
Scientific foundation models should be built for science, not for generic AI tastes or leaderboard prestige. This workshop centers problem-d… (see more)riven design: models that measurably advance real scientific inquiries, e.g., forecasting extreme climate events, accelerating materials discovery, understanding biological mechanisms, co-developed with domain experts and validated against field data, experiments, and downstream impact. We argue that foundation models for science must be built differently from language and vision. Scientific data are physical, causal, spatiotemporal, and often scarce or biased; objectives must reflect mechanistic fidelity, not just predictive accuracy. This calls for scientific priors and constraints, robust uncertainty quantification (UQ), and architectures that natively handle multi-modality (e.g., grids, meshes, spectra, time series, point clouds, text, images, code). It also demands tight integration with classical scientific tools (simulators, PDE solvers, optimization and inference engines, and HPC workflows) to yield hybrid systems that are faster, more accurate, and more trustworthy. We will highlight opportunities and hard problems unique to science: enforcing conservation laws and symmetries; learning across vast spatial and temporal scales; representing extreme events and tipping points; calibrating and validating UQ; and developing evaluation protocols that reward mechanistic insight and actionable reliability. The goal is a roadmap for building, training, and deploying scientific foundation models that accelerate discovery while respecting the structure of the natural world.
2025-12-31
Workshop Proposals @ International Conference on Learning Representations (published)
Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuni… (see more)ng a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
Pre-trained transformers have demonstrated remarkable generalization abilities, at times extending beyond the scope of their training data. … (see more)Yet, real-world deployments often face unexpected or adversarial data that diverges from training data distributions. Without explicit mechanisms for handling such shifts, model reliability and safety degrade, urging more disciplined study of out-of-distribution (OOD) settings for transformers. By systematic experiments, we present a mechanistic framework for delineating the precise contours of transformer model robustness. We find that OOD inputs, including subtle typos and jailbreak prompts, drive language models to operate on an increased number of fallacious concepts in their internals. We leverage this device to quantify and understand the degree of distributional shift in prompts, enabling a mechanistically grounded fine-tuning strategy to robustify LLMs. Expanding the very notion of OOD from input data to a model’s private computational processes—a new transformer diagnostic at inference time—is a critical step toward making AI systems safe for deployment across science, business, and government.
2025-12-31
International Conference on Machine Learning (Accept (regular))
State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling task… (see more)s while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) State-Space Models (SSMs) on sequential state-tracking tasks for abstract groups. It is easy to show that a single DCD SSM layer with a universal decoder can track any Abelian group at finite precision by decomposing it into a product of cyclic groups. We show that this is tight by proving that such a model cannot track any non-Abelian group at finite precision. We further establish the expressivity of multi-layer DCD SSMs. We show that a
2025-12-31
International Conference on Learning Representations (Accept (Poster))
The Clock and Pizza interpretations, associated with architectures differing in either uniform or learnable attention, were introduced to ar… (see more)gue that different architectural designs can yield distinct circuits for modular addition. In this work, we show that this is not the case, and that both the uniform and trainable attention architectures implement the same algorithm via topologically and geometrically equivalent representations. Our methodology goes beyond the interpretation of individual neurons and weights. Instead, we identify all of the neurons corresponding to each learned representation and then study the collective group of neurons as one entity. This method reveals that each learned representation is a manifold that we can study utilizing tools from topology. Based on this insight, we can statistically analyze the learned representations across hundreds of circuits to demonstrate the similarity between learned modular addition circuits that arise naturally from common deep learning paradigms.
2025-12-31
International Conference on Learning Representations (Accept (Poster))