The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Step-by-step verifiers—also known as process reward models (PRMs)—are a key ingredient for test-time scaling, but training them requires… (see more) expensive step-level supervision. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers—using only 1% of the process labels in PRM800K—across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME ’24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation over subsets of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained with the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. This work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training.
2026-03-08
Transactions on Machine Learning Research (accepted)
Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions an… (see more)d interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
On evolutionary timescales, brain circuits adapt to support survival in each species' ecological niche. While some anatomical aspects of neu… (see more)ral circuitry are conserved across species with distant evolutionary origins, each species also exhibits specific circuit adaptations that enable its behavioral repertoire. It remains unclear whether homologous brain regions leverage analogous neural computations as different species perform common behaviors such as reaching and manipulating objects. Here, we directly assessed conservation of neural computations using intracortical recordings from mouse, monkey, and human motor cortex-a homologous region across many mammals-during motor behaviors crucial for survival. We hypothesized that, despite their phylogenetic distance, rodents and primates produce movements through conserved neural computations implemented by motor cortical population dynamics. Remarkably, we found that movement-related neural dynamics were highly conserved across species, while variations in behavioral output were uniquely captured in neural trajectory geometries. Strikingly, neural dynamics during movement across species were more conserved than those across brain regions in the same human and between motor preparation and execution in the same monkeys. Lastly, through manipulation of neural network models trained to perform reaching movements, we reinforce that conservation of neural dynamics across species likely stems from shared circuit constraints. We thus assert that evolution maintains neural computations across phylogeny even as behavioral repertoires expand.
Unmanned aerial vehicle (UAV) communications demand accurate yet interpretable air-to-ground (A2G) channel models that can adapt to non-stat… (see more)ionary propagation environments. While deterministic models offer interpretability and deep learning (DL) models provide accuracy, both approaches suffer from either rigidity or a lack of explainability. To bridge this gap, we propose the Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (e.g., free-space path loss, two-ray reflections) into the learning process. Unlike physics-informed neural networks (PINNs), PIKAN is more flexible for applying physical information because it introduces them as adaptable inductive biases. Thus, it enables a more flexible training process. Experiments on UAV A2G measurement data show that PIKAN achieves comparable accuracy to DL models while providing symbolic and explainable expressions aligned with propagation laws. Remarkably, PIKAN achieves this performance with only 232 parameters, making it up to 37 times lighter than multilayer perceptron (MLP) baselines with thousands of parameters, without sacrificing correlation with measurements and also providing symbolic expressions. These results highlight PIKAN as an efficient, interpretable, and scalable solution for UAV channel modelling in beyond-5G and 6G networks.
While recent 3D generative models can produce high-quality texture images, they often fail to capture human preferences or meet task-specifi… (see more)c requirements. Moreover, a core challenge in the 3D texture generation domain is that most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To alleviate these issues, we propose an end-to-end differentiable, reinforcement-learning-free framework that embeds human feedback, expressed as differentiable reward functions, directly into the 3D texture synthesis pipeline. By back-propagating preference signals through both geometric and appearance modules of the proposed framework, our method generates textures that respect the 3D geometry structure and align with desired criteria. To demonstrate its versatility, we introduce three novel geometry-aware reward functions, which offer a more controllable and interpretable pathway for creating high-quality 3D content from natural language. By conducting qualitative, quantitative, and user-preference evaluations against state-of-the-art methods, we demonstrate that our proposed strategy consistently outperforms existing approaches. Our implementation code is publicly available at: https://github.com/AHHHZ975/Differentiable-Texture-Learning
2026-03-05
IEEE/CVF Winter Conference on Applications of Computer Vision (published)
Abstract Inferring the governing dynamics of differentiation that capture cell state evolution remains a central challenge in single-cell bi… (see more)ology. We present Latent Space Dynamics (LSD), a thermodynamics-inspired framework that models cell differentiation as evolution on a learned Waddington landscape in latent space. LSD jointly infers a low-dimensional cell state, a differentiable potential function governing developmental flow, and a local entropy term that quantifies cellular plasticity. Using a neural ordinary differential equation, LSD reconstructs continuous differentiation trajectories from time-ordered single-cell data. Across diverse developmental systems, LSD accurately recovers lineage hierarchies, predicts fate commitment for unseen cell types, and outperforms existing trajectory inference approaches in directional accuracy. Moreover, in silico gene perturbations reveal how individual regulators reshape the landscape, and entropy provides a quantitative measure of plasticity in development and cancer.
Populus tremuloides as a natural fire barrier in Canada’s boreal forest under a changing climate
Flavie Pelletier
Jeffrey A. Cardille
Joanne C. White
Aspen ( Populus tremuloides ) stands have historically been considered a barrier to wildfire progression across Canada. However, as the clim… (see more)ate changes and negatively impacts fire weather conditions, the established relationship between aspen, weather, and wildfires may also be changing. We explored this relationship using annual maps of dominant tree species extent and wildfire occurrence for three recent active fire years (2021–2023) within four Canadian forested ecozones (275 Mha), where most interactions between aspen stands and wildfires take place. We compared the proportion of aspen at burned perimeters with the proportion of aspen within the burned perimeters and found that aspen was more than twice as common at fire perimeters (ratio of 2.42). Increasing aspen cover also decreased daily burned area, from a median of 717 ha/day to 646 ha/day when aspen cover increased from less than 10% to more than 25%. Our analysis indicated that the increase in daily burned area following a rise in the fire weather index was reduced when greater aspen cover was present. Additionally, comparison of burn severity in spruce- and pine-dominated stands showed that aspen burned at a significantly lower severity than spruce in the two ecozones where aspen presence is greater. Our results indicate that despite a warming climate and an increase in the number of days conducive to severe fires, aspen continues to function as a barrier to the progression of wildfire and mitigates increases in daily burn area under extreme weather conditions. • Aspen act as a fire barrier: it is twice as common at fire perimeters than inside. • Increasing aspen cover reduces daily burned area. • Greater aspen cover moderates increased burned area caused by extreme fire weather. • Aspen burn severity was lower than spruce and pine where aspen presence was greater. • The difference in fire activity between leaf and leafless aspen is mixed.
Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to it… (see more)s widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.
Large language models (LLMs) often rely on explicit chain-of-thought (CoT) traces to solve multi-step reasoning problems, but these traces i… (see more)ncrease inference cost, expose brittle prompt dependence, and complicate training objectives. We study an alternative: \emph{latent deliberation} implemented as a small recurrent refinement module that performs multiple internal ``thinking`` steps while keeping the external sequence length fixed. We introduce \textbf{Recursive Latent Reinforcement Pretraining (RLRP)}, a training recipe that augments a base causal LLM with a shared latent head executed for
2026-03-04
LLM_Reasoning @ International Conference on Learning Representations (published)
We introduce Generative Recursive reAsoning Models (GRAM), a recursion-based generative model that is effective for complex planning and rea… (see more)soning problems. GRAM reformulates recent latent recursive architectures as a stochastic generative process with probabilistic latent transitions, enabling efficient and stable computation entirely in latent space without relying on token-level sequences as in chain-of-thought (CoT) prompting. We optimize this generative recursion via amortized variational inference, allowing the model to represent and explore multiple plausible latent trajectories conditioned on the input. This formulation supports both conditional reasoning through
2026-03-04
RSI @ International Conference on Learning Representations (poster)
Abstract Psychedelics profoundly alter conscious experience, yet how they reshape the relationship between brain anatomy and function remain… (see more)s unclear. In particular, it is unknown whether psychedelic states reflect a global disruption of structure–function organization or a frequency– and network-specific reconfiguration of neural dynamics relative to the structural connectome. Here we address this question using source-localized magnetoencephalography mapped onto connectome harmonics to quantify structure–function coupling in humans under lysergic acid diethylamide (LSD) and placebo. LSD induces a robust decoupling of low-frequency (theta, alpha and beta) activity from anatomical constraints, indicating a global loosening of structure-aligned large-scale dynamics. In contrast, high-frequency gamma activity shows selective reorganization rather than uniform disruption. Greater gamma-band decoupling within core default-mode network regions predicts the intensity of ego dissolution across individuals, demonstrating that while LSD broadly alters large-scale dynamics, subjective loss of self is specifically linked to frequency-selective reorganization of the default-mode network. Functional decoding reveals that LSD does not produce indiscriminate disintegration but instead drives system-specific rebalancing, with preferential decoupling of visual and attentional systems and strengthened coupling within auditory networks. Together, these findings provide electrophysiological evidence that psychedelic states emerge from a frequency-dependent relaxation of structural constraints on brain activity and identify default-mode reorganization as a neural correlate of ego dissolution. These results offer a mechanistic framework for understanding how LSD may exert therapeutic effects by transiently relaxing rigid structural constraints and enhancing dynamical flexibility within networks involved in self-related processing.