The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
Large Reasoning Models (LRMs) have shown remarkable performance on challenging questions, such as math and coding. However, to obtain a high… (see more) quality solution, one may need to sample more than once. In principal, there are two sampling strategies that can be composed to form more complex processes: sequential sampling and parallel sampling. In this paper, we first compare these two approaches with rigor, and observe, aligned with previous works, that parallel sampling seems to outperform sequential sampling even though the latter should have more representation power. To understand the underline reasons, we make three hypothesis on the reason behind this behavior: (i) parallel sampling outperforms due to the aggregator operator; (ii) sequential sampling is harmed by needing to use longer contexts; (iii) sequential sampling leads to less exploration due to conditioning on previous answers. The empirical evidence on various model families and sizes (Qwen3, DeepSeek-R1 distilled models, Gemini 2.5) and question domains (math and coding) suggests that the aggregation and context length do not seem to be the main culprit behind the performance gap. In contrast, the lack of exploration seems to play a considerably larger role, and we argue that this is one main cause for the performance gap.
Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific re… (see more)sources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yor\`ub\'a, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yor\`ub\'a-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yor\`ub\'a natural language processing.
Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual conc… (see more)epts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
AI-based writing assistants are ubiquitous, yet little is known about how users' mental models shape their use. We examine two types of ment… (see more)al models -- functional or related to what the system does, and structural or related to how the system works -- and how they affect control behavior -- how users request, accept, or edit AI suggestions as they write -- and writing outcomes. We primed participants (
Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encod… (see more)e. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp
Psychedelic drugs are re-emerging as promising scientific and clinical tools. However, despite a rapidly expanding literature on their thera… (see more)peutic value, the neural mechanisms underlying psychedelic effects remain unclear. Resting-state functional magnetic resonance imaging studies of acute psychedelic effects, conducted independently by several research groups, have so far yielded fragmented and sometimes inconsistent findings. Here, to help facilitate greater convergence, we conducted a 'mega-analysis' integrating 11 independent resting-state functional magnetic resonance imaging datasets across five psychedelic drugs (psilocybin, lysergic acid diethylamide, mescaline, N,N-dimethyltryptamine and ayahuasca) from research groups spanning three continents and five countries. By applying a uniform preprocessing pipeline and a Bayesian hierarchical modeling framework, we discovered several common features in the induced alterations to brain function across drugs and sites. Most prominently, we identified a core signature of increased functional connectivity between transmodal (default, frontoparietal and limbic) and unimodal networks (visual and somatomotor), with subnetwork specificity. Furthermore, key subcortical regions (thalamus, caudate and putamen) and the cerebellum exhibited altered coupling with sensorimotor networks. In contrast to several single-site reports, Bayesian modeling revealed weak-to-moderate and selective reductions in within-network functional connectivity, with substantial variability across drugs and networks. Together, these findings extend past work by demonstrating that psychedelics reconfigure large-scale cortical organization while selectively engaging subcortical circuitry. This study provides the most comprehensive synthesis of psychedelic brain action to date, helping resolve inconsistencies and offering a probabilistic map of how psychedelics alter large-scale brain organization. We hereby provide a cornerstone to benchmark and shepherd future psychedelic neuroimaging research.
Decoder-Transformers have achieved remarkable success and have laid the groundwork for the development of Large Language Models (LLMs). At t… (see more)he core of these models is the self-attention matrix, which allows different tokens to interact with each other. This process is remarkably similar to the message-passing mechanism used in Graph Neural Networks (GNNs), and as such decoder-Transformers suffer many of the optimization difficulties studied extensively in the GNN literature. In this paper, we present a unified graph perspective that bridges the theoretical understanding of decoder-Transformers and GNNs. We systematically examine how well-known phenomena in GNNs, such as over-smoothing and over-squashing, directly manifest as analogous issues like rank collapse and representational collapse in deep Transformer architectures. By interpreting Transformers' self-attention as a learned adjacency operator, we reveal shared underlying principles governing signal propagation and demonstrate how insights from one field can illuminate challenges and solutions in the other. We analyze the role of architectural components like residual connections, normalization, and causal masking in these issues. We aim to provide a framework for understanding how information flows through deep learning models that perform sequence mixing through an adjacency operator, and to highlight areas for cross-pollination of research, as well as to provide a comprehensive reference for researchers interested in the underpinnings of these architectures.
2026-04-03
Transactions on Machine Learning Research (accepted)
Multivariate count models are often justified by their ability to capture latent dependence, but researchers receive little guidance on when… (see more) this added structure improves on simpler penalized marginal Poisson regression. We study this question using real microbiome data under a unified held-out evaluation framework. For count prediction, we compare PLN and GLMNet(Poisson) on 20 datasets spanning 32 to 18,270 samples and 24 to 257 taxa, using held-out Poisson deviance under leave-one-taxon-out prediction with 3-fold sample cross-validation rather than synthetic or in-sample criteria. For network inference, we compare PLNNetwork and GLMNet(Poisson) neighborhood selection on five publicly available datasets with experimentally validated microbial interaction truth. PLN outperforms GLMNet(Poisson) on most count-prediction datasets, with gains up to 38 percent. The primary predictor of the winner is the sample-to-taxon ratio, with mean absolute correlation as the strongest secondary signal and overdispersion as an additional predictor. PLNNetwork performs best on broad undirected interaction benchmarks, whereas GLMNet(Poisson) is better aligned with local or directional effects. Taken together, these results provide guidance for choosing between latent multivariate count models and penalized Poisson regression in biological count prediction and interaction recovery.
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sp… (see more)arse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
Many real-world scenarios involve solving bilevel optimization problems in which there is an outer discrete optimization problem and an inne… (see more)r problem involving expensive or black box computation. This arises in space-time–dependent variants of the traveling salesman problem, such as when planning space missions that visit multiple astronomical objects. Planning these missions presents significant challenges due to the constant relative motion of the objects involved. There is an outer combinatorial problem of finding the optimal order to visit the objects and an inner optimization problem that requires finding the optimal departure time and trajectory to travel between each pair of objects. The constant motion of the objects complicates the inner problem, making it computationally expensive. This paper introduces a novel framework utilizing decision diagrams (DDs) and a DD-based branch-and-bound technique, peel-and-bound, to achieve exact solutions for such bilevel optimization problems, assuming sufficient inner problem optimizer quality. The framework leverages problem-specific knowledge to expedite search processes and minimize the number of expensive evaluations required. As a case study, we apply this framework to the asteroid routing problem, a benchmark problem in global trajectory optimization. Experimental results demonstrate the framework’s scalability and ability to generate robust heuristic solutions for tested instances. Many of these solutions are exact, contingent on the assumed quality of the inner problem’s optimizer. History: Accepted by Andrea Lodi, Area Editor for Design & Analysis of Algorithms–Discrete. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2024.0866 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2024.0866 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .
Rationale: Heart failure with preserved ejection fraction (HFpEF) is a heterogeneous syndrome with substantial unmet diagnostic and therapeu… (see more)tic needs. Circulating lipid metabolism is increasingly implicated in HFpEF pathophysiology but has not been systematically leveraged for molecular stratification. Objective: To determine whether plasma lipidomics can identify molecular phenogroups of HFpEF associated with distinct clinical characteristics and outcomes. Methods and Results: Untargeted plasma lipidomics was performed in non-HF subjects and HFpEF patients from a primary Belgian cohort and an independent Canadian cohort (n=177 in each cohort). In the Belgian cohort, 235 unique lipids spanning 19 subclasses were annotated, including 96 significantly associated with HFpEF (q<0.02). Unsupervised analyses revealed marked lipidomic heterogeneity, with a distinct HFpEF subgroup separable from non-HF subjects. Hierarchical clustering identified three phenogroups with divergent lipid profiles and clinical features. One phenogroup exhibited severe atrial dysfunction, congestion-related biomarkers, elevated indices of cardiac and liver fibrosis, and markedly reduced survival, a second was characterized by prominent metabolic syndrome features, and a third by preserved renal function. Cross-cohort comparison using a supervised classifier trained on 158 shared lipids confirmed analogous lower-risk phenogroups in the Canadian cohort, while the high-risk phenotype was underrepresented. A signature of 10 lipids across six subclasses, including long-chain acylcarnitines, ether phosphatidylcholines, and oxidized sphingomyelins, discriminated the high-risk group and correlated with markers of disease severity. Conclusion: Our findings demonstrate that HFpEF comprises metabolically distinct patient subgroups across cohorts, revealing specific lipidomic dysfunctions that deepen our understanding of HFpEF heterogeneity and underlying pathophysiology.