Publications

To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble

Simon Roy

Euhid Aman

As robot fleets become more heterogeneous, including humanoids, rovers, quadrupeds, and drones, selecting the right robot for a task becomes… (see more) a core systems problem. We study robot skill prediction: mapping a natural-language task description to the physical capabilities required to execute it, such as fly, wheels, legs, surface water, under water and hands. Since labelled data that maps natural-language task descriptions to robot's physical capabilities does not exist, we construct a synthetic task-to-skill dataset using LLM-assisted generation and targeted label auditing. Trained on this data, a ~133M-parameter ensemble of two fine-tuned sentence encoders (mpnet + MiniLM) reaches 83.5% task-to-skill matching on a stratified 200 task dataset, outperforming Kimi K2 (1T MoE) at 72.0%, GPT-OSS-120B at 71.5%, and Llama-4-Scout-17B at 69.0% under the same zero-shot prompt. These results suggest that, for fixed robot skill taxonomies, small specialized models trained on synthetic data can outperform much larger general-purpose LLMs for fleet-level task routing.

2026-05-19

arXiv (preprint)

doi.org

arxiv.org

Widespread use of invalid statistical tests in biomedical machine learning

Tianchu Zeng

Hui Li

Shaoshi Zhang

Yan Quan Tan

Fang Tian

Csaba Orbán

Lijun An

Wanyu Che

Jingwen Cheng

Joanna Su Xian Chong

Niousha Dehestani

Zijian Dong

Xin Li

Zhizhou Li

Mervyn Jun Rui Lim

Yi Lin

Qinrui Ling

Zijie Ling

Xi Zhi Low

Sina Mansour L. … (see 24 more)

Kwun Kei Ng

Thuan Tinh Nguyen

Leon Qi Rong Ooi

Shreya Pande

Xing Qian

Jingxuan Ruan

Z WANG

Yapei Xie

Chen Zhang

Yichi Zhang

K Patil

Linden Parkes

Elvisha Dhamala

Sidhant Chopra

Andrew Zalesky

Avram Holmes

S Eickhoff

Juan Helen Zhou

Olivier Renaud

Nico Dosenbach

Konrad P. Kording

Danilo Bzdok

Thomas Nichols

B T Thomas Yeo

Abstract Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance – not onl… (see more)y to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor ≥15, 1 June 2020 – 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.

2026-05-19

bioRxiv (preprint)

doi.org

Characterization of limb representation in the pig’s motor cortex

David Bergeron

Hugo Delivet-Mongrain

Marco Bonizzato

Marina Martinez

Due to its large gyrencephalic brain, the pig is increasingly used for neuroscience research, especially for the preclinical testing of nove… (see more)l neuroprostheses. However, our understanding of the pig’s motor system remains limited compared to the common species used for neuroscience research. Here, we aimed to characterize the forelimb and hindlimb representation of the pig motor cortex using intracortical microstimulation (ICMS). Three domestic pigs ( Sus scrofa) were placed in a modified stereotactic frame and maintained under intravenous propofol sedation. We mapped the motor cortex using ICMS, applied at varying cortical coordinates and depths. For each site, we recorded the electrode depth eliciting the maximal limb response and determined the motor threshold. Responses were assessed visually and via electromyographic recordings. ICMS uncovered a large forelimb representation, with stereotypical contralateral responses. Conversely, the hindlimb representation was smaller and located within the interhemispheric fissure. The mean threshold of the five most responsive forelimb sites was 75 ± 25 μA, compared to 280 ± 45 μA for hindlimb sites (p<0.01). A summation of stimulations in the hindlimb representation of the motor cortex unilaterally triggered bilateral alternating hindlimb movements. These results suggest that while the porcine cortex can directly command forelimb movements via the corticospinal pathway, cortical control of hindlimb likely relies on polysynaptic pathways through the brainstem, such as the cortico-reticulospinal pathway.

2026-05-18

Journal of Neurophysiology (published)

doi.org

Improved Ising Model Formulation for Polar Codes

Ryan Seah

Warren J. Gross

This paper presents an improved Ising model framework for polar codes, termed POLARIS, which reduces the number of binary variables by incor… (see more)porating rate-1 node structures and embedding elements of successive-cancellation decoding into the Ising formulation. The decoder scales efficiently to block lengths up to N = 64, doubling prior Ising-based limits. POLARIS achieves near-successive-cancellation list performance within 0.4 dB while reducing QUBO dimensionality from 192 to 126 variables. These advancements bring Ising-based polar decoding closer to practical realization, offering improved efficiency for implementation on both quantum and hybrid CMOS-classical annealing hardware.

2026-05-18

2026 IEEE 56th International Symposium on Multiple-Valued Logic (ISMVL) (published)

doi.org

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

Zhenghan Tai

Vasilii Feofanov

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (see more)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

2026-05-18

arXiv (preprint)

doi.org

arxiv.org

RFGWRK: a hybrid downscaling framework for high-resolution precipitation mapping in geohazard-prone mountainous regions

Simin Zhang

Zeshuang Zheng

Jun Ding

Shengbing Yang

Yuan Zeng

2026-05-18

International Conference on Remote Sensing, Surveying, and Mapping (published)

doi.org

A Universal Source-Free Class Unlearning Framework via Synthetic Embeddings

Zahra Dehghani Tafti

Pablo Piantanida

Mohammadhadi Shateri

Class unlearning in neural classifiers refers to selectively removing the model’s ability to recognize a target (forget) class by reshapin… (see more)g the decision boundaries. This is essential when taxonomies change, labels are corrected, or legal or ethical requirements mandate class removal. The objective is to preserve performance on the remaining (retain) classes while avoiding costly full retraining. Existing methods generally require access to the source, i.e., forget/retain data or a relevant surrogate dataset. This dependency limits their applicability in scenarios where access to source data is restricted or unavailable. Even the recent source-free class unlearning methods rely on generating samples in the data space, which is computationally expensive and not even essential for doing class unlearning. In this work, we propose a novel source-free class unlearning framework that enables existing unlearning methods to operate using only the deployed model. We show that, under assumptions on the forget loss with respect to logits, class unlearning can be performed source-free for any given neural classifier by utilizing randomly generated samples within the classifier’s intermediate space. Specifically, randomly generated embeddings pseudo-labeled by the model as belonging to the forget or retain classes can support effective source-free unlearning. Our analysis further shows that, under conditions on the forget loss and synthetic forget embeddings, minimizing the forget loss induces expected logit shifts consistent with class unlearning, without requiring a specific parametric form of the embedding distribution. We validate our framework on four backbone architectures, ResNet-18, ResNet-50, ViT-B/16, and Swin-T, across three benchmark datasets, CIFAR-10, CIFAR-100, and TinyImageNet. Our experimental results show that existing class unlearning methods can operate within our source-free framework, with minimal impact on their forgetting efficacy and retain class accuracy. The code is available at https://github.com/Yasaman-dt/Source_Free_Class_Unlearning.

2026-05-18

Transactions on Machine Learning Research (accepted)

openreview.net

Factorized and Vectorized Execution: Optimizing Analytical and Semantic Queries over Relations

Sunny Yasser

Anas Dorbani

Amine Mhedhbi

Many-to-many joins are central to analytical and semantic workloads such as fraud detection, network analysis, and recommendation, where ins… (see more)ights arise from relationships between entities. These workloads often suffer from an explosion of intermediate results, sometimes orders of magnitude larger than the inputs. Factorized representations address this problem by exploiting conditional independence among attributes to encode intermediates more compactly. In some cases, they can reduce the output size asymptotically below the worst-case output size. However, adopting factorization in modern vectorized query processors remains challenging: factorized representations are hierarchical, whereas vectorized execution is built around flat, block-oriented processing. Prior approaches either rely on full materialization or support only restricted factorization layouts, sacrificing much of the benefits of both factorization and vectorization. We present FFX, a novel engine for F ast F actorized e X ecution. FFX is the first pipelined engine to support arbitrary factorization schemes while preserving full vectorization. The engine introduces packed factorized vectors and operators that maintain cache-friendly, contiguous layouts. Beyond analytics, FFX also co-optimizes semantic operators by serializing factorized intermediates into compact prompts for large language models (LLMs), substantially reducing token usage and inference cost while maintaining output quality and, in some cases, improving it. Together, these contributions enable efficient execution of join-heavy analytical queries, including queries augmented with semantic operators.

2026-05-17

Proceedings of the ACM on Management of Data (published)

doi.org

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Ken Ming Lee

Paul Barde

Maxime C. Cohen

Derek Nowrouzezahrai

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highl… (see more)y accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

2026-05-17

arXiv (preprint)

doi.org

arxiv.org

PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion

Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, da… (see more)ta imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.

2026-05-17

Medical Imaging with Deep Learning (published)

doi.org

proceedings.mlr.press

Revisiting Age of Acquisition in Curriculum Learning: Disentangling Lexical Features and Semantic Structure

Ian Gifford

Aaron Shah

Catherine Chen

Taimaa Kassab Bachi

Eva Portelance

Previous work has found that ordering training data by children’s Age of Acquisition (AoA) for words increases the stability of distributi… (see more)onal word embeddings, suggesting that early-learned words play a privileged role in shaping semantic structure. In this study, we determine whether AoA itself drives these effects, or whether they emerge from correlated lexical factors such as frequency, concreteness, and phonological complexity. Using incremental Word2Vec training, we construct curricula ordered by AoA and by individual lexical features, while systematically controlling for vocabulary growth and deterministic ordering effects. We show that AoA-ordered curricula produce greater early-phase stability than shuffled baselines, even under controlled exposure conditions. We find that the advantage observed with AoA can be largely explained by correlated factors like overall word frequency. Despite limited gains on general similarity benchmarks, AoA-ordered embeddings outperform shuffled embeddings on a proxy domain-specific task: predicting human AoA norms. This advantage persists after debiasing timestamp effects, implying that AoA curricula induce developmentally meaningful semantic structure.

2026-05-17

CoNLL @ Association for Computational Linguistics (published)

openreview.net

Scalable Environments Drive Generalizable Agents

Jiayi Zhang

Fanqi Kong

Guibin Zhang

Maojia Song

Zhaoyang Yu

Jianhao Ruan

Jinyu Xiang

Bang Liu

Chenglin Wu

Yuyu Luo

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues th… (see more)at such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

2026-05-17

arXiv (preprint)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Publications