The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Modular Memory is the Key to Continual Learning Agents
Foundation models have transformed machine learning through large-scale pretraining and increased test-time compute. Despite surpassing huma… (see more)n performance in several domains, these models remain fundamentally limited in continuous operation, experience accumulation, and personalization, capabilities that are central to adaptive intelligence. While continual learning research has long targeted these goals, its historical focus on in-weight learning (IWL), i.e., updating a single model's parameters to absorb new knowledge, has rendered catastrophic forgetting a persistent challenge. Our position is that combining the strengths of In-Weight Learning (IWL) and the newly emerged capabilities of In-Context Learning (ICL) through the design of modular memory is the missing piece for continual adaptation at scale. We outline a conceptual framework for modular memory-centric architectures that leverage ICL for rapid adaptation and knowledge accumulation, and IWL for stable updates to model capabilities, charting a practical roadmap toward continually learning agents.
Molecular orbitals describe the distribution of electrons in a molecule and are frequently used by chemists to understand properties of mole… (see more)cules, yet machine learning has neglected them so far. If atom coordinates are obtained through DFT anyway, they can be obtained for free at the same time and are thus a useful source of additional data, particularly when data is scarce We give an introduction to molecular orbitals for a machine learning audience and propose models to process three different representations of them. Experiments on a dataset with experimental properties show that including MOs significantly improves performance and sample efficiency over a pretrained molecular foundation model on this real-world task.
2026-03-01
AI4Mat @ International Conference on Learning Representations (poster)
Goal-conditioned reinforcement learning (GCRL) requires agents to learn effective state and goal representations, which represents a challen… (see more)ging problem, especially in high-dimensional vision-based environments, as differences in the observations can be uncorrelated with dynamical distances. Classical deep reinforcement learning techniques often fail to capture the alignment between state and goal spaces, requiring additional representation learning techniques. To address this, we propose
2026-03-01
World Models @ International Conference on Learning Representations (published)
Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying … (see more)inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.
2026-03-01
MM_Intelligence @ International Conference on Learning Representations (poster)
Sensory organization at the spinal segment level is commonly inferred from dermatomal maps that assume a fixed correspondence between cutane… (see more)ous regions and spinal segments. However, based on the complexities of spinal neuroanatomy and neurophysiology, the distribution of sensory signals within the cord may be broader and less segment-specific than dermatomal maps suggest, leaving the segment-level localization of sensory-evoked activity in humans uncertain. Spinal cord functional magnetic resonance imaging (fMRI) is currently the only technique capable of noninvasively mapping sensory activity with high spatial resolution in the human spinal cord. However, its application remains technically challenging and is limited by the uncertainty in segmental localization. In this study, we leveraged recent advancements in spinal cord fMRI, including spinal nerve rootlet-based spatial normalization, to investigate how sensory information is represented and distributed within the human spinal cord during electrocutaneous stimulation of the third digit of the right hand (i.e., C7 dermatome). Forty healthy adults were scanned with electrocutaneous stimulation at four individualized intensities across multiple runs to quantify (i) the rostrocaudal distribution of sensory-evoked activity, (ii) intensity-dependent changes in detectability and localization, and (iii) the effect of normalization strategy on segmental localization. Across participants, stimulation produced activation localized in the lower cervical cord (e.g., C6-C8), with the most consistent segmental localization near C7. Stronger stimulation increased detectability and produced more consistent segmental localization across participants. Importantly, normalization that incorporated nerve rootlet landmarks sharpened localization and improved sensitivity relative to conventional intervertebral disc-based alignment. This highlights the value of functionally relevant anatomical landmarks for group inference in the spinal cord. Responses were strongest in the initial run and attenuated with repetition, suggesting habituation or adaptation that can bias multi-run paradigms if unmodeled. Together, our results define practical acquisition and analysis conditions (e.g., stimulation strength, anatomical alignment strategy, and run structure) under which segment-level spinal sensory responses can be detected, thereby supporting more reliable studies of human spinal cord future basic and translational studies, including pain mechanisms, sensory function, and spinal injury.
Effective exploration in reinforcement learning requires not only tracking where an agent has been, but also understanding how the agent per… (see more)ceives and represents the world. To learn powerful representations, an agent should actively explore states that contribute to its knowledge of the environment. Temporal representations can capture the information necessary to solve a wide range of potential tasks while avoiding the computational cost associated with full state reconstruction. In this paper, we propose an exploration method that leverages temporal contrastive representations to guide exploration, prioritizing states with unpredictable future outcomes. We demonstrate that such representations can enable the learning of complex exploratory x in locomotion, manipulation, and embodied-AI tasks, revealing capabilities and behaviors that traditionally require extrinsic rewards. Unlike approaches that rely on explicit distance learning or episodic memory mechanisms (e.g., quasimetric-based methods), our method builds directly on temporal similarities, yielding a simpler yet effective strategy for exploration.
Optimization in deep learning has expanded beyond Euclidean methods to include entrywise sign updates (SignSGD) and spectral sign updates (S… (see more)pecGD/Muon). While both can be viewed as steepest descent under non-Euclidean geometries (
2026-03-01
GRaM @ International Conference on Learning Representations (poster)
Titanium nanotube arrays promote the activity of anastomotic healing-related cells by increasing fibronectin adsorption and activating the RGD–integrin pathway
The smooth titanium staples of stapling devices cannot reduce the incidence of gastrointestinal anastomotic leakage due to their bioinert na… (see more)ture and lack of active wound-healing promotion capability. This study aims to investigate whether titanium nanotube arrays (TNTs) can enhance the activity of cells involved in gastrointestinal anastomotic healing and further explore the potential mechanisms. TNTs were fabricated on pure titanium sheets via anodic oxidation, and characterized using scanning electron microscopy, roughness analysis, contact angle measurement, and x-ray photoelectron spectroscopy. Cell adhesion, proliferation, spreading, collagen secretion, and integrin expression were evaluated using methods such as CCK-8, immunofluorescence, qPCR, enzyme-linked immunosorbent assay (ELISA), and Western blot. Fibronectin (FN) adsorption and Arg-Gly-Asp tripeptide sequence (RGD domain) exposure were detected via bicinchoninic acid assay, fluorescent staining, and ELISA. The role of the RGD-integrin pathway was further investigated by supplementing serum-reduced medium with exogenous FN and using RGD-specific antagonists. The results showed that TNTs increased the roughness, hydrophilicity, and surface free energy of titanium surfaces. Compared with smooth pure titanium, TNTs promoted the adhesion, proliferation, spreading, and integrin expression of gastric mucosal epithelial cells and fibroblasts, while enhancing the collagen secretion capacity of fibroblasts. Moreover, TNTs adsorbed more FN and exposed more RGD domains, thereby upregulating integrin α5β1 expression. The RGD antagonist could reverse these enhanced cellular responses, confirming the pivotal role of the FN–RGD–integrin pathway. The conclusion indicates that TNTs enhance the adhesion, proliferation, and functional activity of gastrointestinal anastomosis-related cells by promoting FN adsorption and activating the RGD–integrin pathway, which demonstrates that TNT-modified titanium materials hold significant potential for developing bioactive anastomotic devices and promoting tissue healing.
Vision-Language-Action (VLA) models show strong generalization for robotic control, but finetuning them with reinforcement learning (RL) is … (see more)constrained by the high cost and safety risks of real-world interaction. Training VLA models in interactive world models avoids these issues but introduces several challenges, including pixel-level world modeling, multi-view consistency, and compounding errors under sparse rewards. Building on recent advances across multimodal models and model-based RL, we propose **VLA-MBPO**, a practical world model-based RL framework to tackle these problems in VLA finetuning. Our approach is guided by three key design choices: (i) adapting *unified multimodal models (UMMs)* to VLA settings, leveraging rich multimodal priors to enable world modeling with limited data; (ii) introducing an *interleaved view decoding* mechanism to enforce consistency across views; and (iii) employing *chunk-level branched rollout* to limit rollout horizons and mitigate error compounding during policy optimization. Our theoretical analysis shows a reduction in value gap of VLA-MBPO, and experiments in both simulated and real-world tasks demonstrate that our method effectively improves policy performance and sample efficiency for VLA finetuning.
2026-03-01
World Models @ International Conference on Learning Representations (published)
Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, ag… (see more)ents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with action-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5
2026-03-01
LLA @ International Conference on Learning Representations (poster)
Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
Wenjie Du
Li Jiang
Keda Tao
Xue Liu
Huan Wang
Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to info… (see more)rmation loss during decoding, creating critical challenges for KV cache compression.
Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning.
However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination.
To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes.
This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache.
Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--50% cache reduction with near-lossless performance across diverse tasks and models.
2026-03-01
LIT @ International Conference on Learning Representations (accepted)