Join us on the Venture Scientist Bootcamp, a full time, 4-month incubator at Mila, built specifically for deep tech founders with elite STEM backgrounds.
Learn how to leverage generative AI to support and improve your productivity at work. The next cohort will take place online on April 28 and 30, 2026, in French.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Uncertainty Assessment in Deep Learning-based Plant Trait Retrievals from Hyperspectral data
Abstract. Large-scale mapping of plant biophysical and biochemical traits is essential for ecological and environmental applications. Given … (see more)their finer spectral resolution and unprecedented data availability, hyperspectral data, in concert with machine and particularly deep learning models, have emerged as a promising, non-destructive tool for accurately retrieving these traits. However, when deploying these methods on a large scale, reliably quantifying the associated uncertainty remains a critical challenge, especially when models encounter out-of-domain (OOD) data, i.e., samples that differ substantially from those of the training data, such as unseen geographical regions, species, biomes, data acquisition modalities, or scene components (e.g., clouds and water bodies). Traditional uncertainty quantification methods for deep learning models, including deep ensembles (deterministic and probabilistic) and Monte Carlo dropout, rely on the variance of predictions but often fail to capture uncertainty in OOD scenarios, leading to overly optimistic and possibly misleading uncertainty estimates. To address this limitation, we propose a distance-based uncertainty estimation method (Dis_UN) that quantifies prediction uncertainty by measuring the dissimilarity in the predictor space (spectral inputs) and embedding space (features learned by the deep model) between the training and test data. Dis_UN leverages residuals as a proxy for uncertainty and employs dissimilarity indices in data manifolds to estimate worst-case errors via 95-quantile regression. We evaluate Dis_UN using a pretrained deep learning model to predict multiple plant traits from hyperspectral images, analyzing its performance across OOD data, such as pixels containing spectral variations from urban surfaces, bare ground, water, clouds, or open surface waters. In this study, we target six leaf and canopy traits: leaf mass per area, chlorophylls, carotenoids, nitrogen content, equivalent water thickness, and leaf area index. Compared to scaled variance-based methods, Dis_UN provides (1) a superior estimation of uncertainty in OOD scenarios, achieving 36 % higher contrast (KS distances: 0.648 vs. 0.475) between non-vegetation pixels, particularly under mixed-pixel conditions at medium resolution (30 m); (2) uncertainty quantification without requiring normality or symmetry assumptions, accommodating asymmetric error patterns; (3) enhanced interpretability of uncertainty sources, as uncertainty is directly linked to sample dissimilarity from the training data; and (4) computational efficiency at inference (2.6–7.7× faster), requiring only a single forward pass compared to multiple passes for ensemble-based methods. Challenges remain for traits that are affected by spectral saturation. These findings highlight the advantages of distance-aware uncertainty quantification methods and underscore the necessity of diverse training datasets to minimize sampling biases and enhance model robustness. The proposed framework improves the reliability of uncertainty estimation in vegetation monitoring and offers a promising approach for broader applications.
Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating i… (see more)n continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.
Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific re… (see more)sources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yor\`ub\'a, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yor\`ub\'a-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yor\`ub\'a natural language processing.
Vision-language Models (VLMs), despite achieving strong performance on multimodal benchmarks, often misinterpret straightforward visual conc… (see more)epts that humans identify effortlessly, such as counting, spatial reasoning, and viewpoint understanding. Previous studies manually identified these weaknesses and found that they often stem from deficits in specific skills. However, such manual efforts are costly, unscalable, and subject to human bias, which often overlooks subtle details in favor of salient objects, resulting in an incomplete understanding of a model's vulnerabilities. To address these limitations, we propose a Reinforcement Learning (RL)-based framework to automatically discover the failure modes or blind spots of any candidate VLM on a given data distribution without human intervention. Our framework trains a questioner agent that adaptively generates queries based on the candidate VLM's responses to elicit incorrect answers. Our approach increases question complexity by focusing on fine-grained visual details and distinct skill compositions as training progresses, consequently identifying 36 novel failure modes in which VLMs struggle. We demonstrate the broad applicability of our framework by showcasing its generalizability across various model combinations.
AI-based writing assistants are ubiquitous, yet little is known about how users' mental models shape their use. We examine two types of ment… (see more)al models -- functional or related to what the system does, and structural or related to how the system works -- and how they affect control behavior -- how users request, accept, or edit AI suggestions as they write -- and writing outcomes. We primed participants (
Evolution is an extraordinary engine for enzymatic diversity, yet the chemistry it has explored remains a narrow slice of what DNA can encod… (see more)e. Deep generative models can design new proteins that bind ligands, but none have created enzymes without pre-specifying catalytic residues. We introduce DISCO (DIffusion for Sequence-structure CO-design), a multimodal model that co-designs protein sequence and 3D structure around arbitrary biomolecules, as well as inference-time scaling methods that optimize objectives across both modalities. Conditioned solely on reactive intermediates, DISCO designs diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B-H, and C(sp
Psychedelic drugs are re-emerging as promising scientific and clinical tools. However, despite a rapidly expanding literature on their thera… (see more)peutic value, the neural mechanisms underlying psychedelic effects remain unclear. Resting-state functional magnetic resonance imaging studies of acute psychedelic effects, conducted independently by several research groups, have so far yielded fragmented and sometimes inconsistent findings. Here, to help facilitate greater convergence, we conducted a 'mega-analysis' integrating 11 independent resting-state functional magnetic resonance imaging datasets across five psychedelic drugs (psilocybin, lysergic acid diethylamide, mescaline, N,N-dimethyltryptamine and ayahuasca) from research groups spanning three continents and five countries. By applying a uniform preprocessing pipeline and a Bayesian hierarchical modeling framework, we discovered several common features in the induced alterations to brain function across drugs and sites. Most prominently, we identified a core signature of increased functional connectivity between transmodal (default, frontoparietal and limbic) and unimodal networks (visual and somatomotor), with subnetwork specificity. Furthermore, key subcortical regions (thalamus, caudate and putamen) and the cerebellum exhibited altered coupling with sensorimotor networks. In contrast to several single-site reports, Bayesian modeling revealed weak-to-moderate and selective reductions in within-network functional connectivity, with substantial variability across drugs and networks. Together, these findings extend past work by demonstrating that psychedelics reconfigure large-scale cortical organization while selectively engaging subcortical circuitry. This study provides the most comprehensive synthesis of psychedelic brain action to date, helping resolve inconsistencies and offering a probabilistic map of how psychedelics alter large-scale brain organization. We hereby provide a cornerstone to benchmark and shepherd future psychedelic neuroimaging research.
Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sp… (see more)arse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
Rationale: Heart failure with preserved ejection fraction (HFpEF) is a heterogeneous syndrome with substantial unmet diagnostic and therapeu… (see more)tic needs. Circulating lipid metabolism is increasingly implicated in HFpEF pathophysiology but has not been systematically leveraged for molecular stratification. Objective: To determine whether plasma lipidomics can identify molecular phenogroups of HFpEF associated with distinct clinical characteristics and outcomes. Methods and Results: Untargeted plasma lipidomics was performed in non-HF subjects and HFpEF patients from a primary Belgian cohort and an independent Canadian cohort (n=177 in each cohort). In the Belgian cohort, 235 unique lipids spanning 19 subclasses were annotated, including 96 significantly associated with HFpEF (q<0.02). Unsupervised analyses revealed marked lipidomic heterogeneity, with a distinct HFpEF subgroup separable from non-HF subjects. Hierarchical clustering identified three phenogroups with divergent lipid profiles and clinical features. One phenogroup exhibited severe atrial dysfunction, congestion-related biomarkers, elevated indices of cardiac and liver fibrosis, and markedly reduced survival, a second was characterized by prominent metabolic syndrome features, and a third by preserved renal function. Cross-cohort comparison using a supervised classifier trained on 158 shared lipids confirmed analogous lower-risk phenogroups in the Canadian cohort, while the high-risk phenotype was underrepresented. A signature of 10 lipids across six subclasses, including long-chain acylcarnitines, ether phosphatidylcholines, and oxidized sphingomyelins, discriminated the high-risk group and correlated with markers of disease severity. Conclusion: Our findings demonstrate that HFpEF comprises metabolically distinct patient subgroups across cohorts, revealing specific lipidomic dysfunctions that deepen our understanding of HFpEF heterogeneity and underlying pathophysiology.
High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging… (see more), expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.