Publications

To Select or not to Select, that is the Question: Distilling Robot Skill Prediction into a Small Ensemble
Simon Roy
Euhid Aman
As robot fleets become more heterogeneous, including humanoids, rovers, quadrupeds, and drones, selecting the right robot for a task becomes… (voir plus) a core systems problem. We study robot skill prediction: mapping a natural-language task description to the physical capabilities required to execute it, such as fly, wheels, legs, surface water, under water and hands. Since labelled data that maps natural-language task descriptions to robot's physical capabilities does not exist, we construct a synthetic task-to-skill dataset using LLM-assisted generation and targeted label auditing. Trained on this data, a ~133M-parameter ensemble of two fine-tuned sentence encoders (mpnet + MiniLM) reaches 83.5% task-to-skill matching on a stratified 200 task dataset, outperforming Kimi K2 (1T MoE) at 72.0%, GPT-OSS-120B at 71.5%, and Llama-4-Scout-17B at 69.0% under the same zero-shot prompt. These results suggest that, for fixed robot skill taxonomies, small specialized models trained on synthetic data can outperform much larger general-purpose LLMs for fleet-level task routing.
Widespread use of invalid statistical tests in biomedical machine learning
Tianchu Zeng
Hui Li
Shaoshi Zhang
Yan Quan Tan
Fang Tian
Csaba Orbán
Lijun An
Wanyu Che
Jingwen Cheng
Joanna Su Xian Chong
Niousha Dehestani
Zijian Dong
Xin Li
Zhizhou Li
Mervyn Jun Rui Lim
Yi Lin
Qinrui Ling
Zijie Ling
Xi Zhi Low
Sina Mansour L. … (voir 24 de plus)
Kwun Kei Ng
Thuan Tinh Nguyen
Leon Qi Rong Ooi
Shreya Pande
Xing Qian
Jingxuan Ruan
Z WANG
Yapei Xie
Chen Zhang
Yichi Zhang
K Patil
Linden Parkes
Elvisha Dhamala
Sidhant Chopra
Andrew Zalesky
Avram Holmes
S Eickhoff
Juan Helen Zhou
Olivier Renaud
Nico Dosenbach
Konrad P. Kording
Thomas Nichols
B T Thomas Yeo
Abstract Machine learning is accelerating biomedical research. Cross-validation is widely used to compare predictive performance – not onl… (voir plus)y to benchmark algorithms, but also to inform scientific applications, such as ranking biomarkers. However, prediction performance estimates across cross-validation folds are not independent. Standard tests for comparing prediction performance (e.g., paired t-test) assume independence and can therefore inflate false positive rates. In a PRISMA-guided meta-analysis of 210 studies (impact factor ≥15, 1 June 2020 – 1 June 2025), we find that 97% ignored fold dependence when comparing prediction performance. This problem is ubiquitous across scientific fields and unaffected by impact factor, rigor-promoting policies, or open science practices. Simulations across 420 scenarios spanning four diverse datasets show that ignoring fold dependence leads to invalid false positive control in most settings. Repeated cross-validation further compounds this problem, with false positive rates rising toward 100% as the number of repetitions grows. Existing fold-dependence-aware tests rely on strong assumptions because the variance of fold-level statistics and the between-fold correlation cannot be disentangled under standard cross-validation. We therefore propose the SHARP (Split-HAlf RePeated) test, a simple modification to standard cross-validation that enables direct estimation of variance and correlation. Benchmarked against 12 tests, SHARP provides the best overall balance of false-positive control, statistical power, and confidence-interval calibration across simulation schemes. We conclude by providing best practices and reporting guidelines for valid model comparison inference in biomedical machine learning and beyond.
Characterization of limb representation in the pig’s motor cortex
David Bergeron
Hugo Delivet-Mongrain
Marina Martinez
Due to its large gyrencephalic brain, the pig is increasingly used for neuroscience research, especially for the preclinical testing of nove… (voir plus)l neuroprostheses. However, our understanding of the pig’s motor system remains limited compared to the common species used for neuroscience research. Here, we aimed to characterize the forelimb and hindlimb representation of the pig motor cortex using intracortical microstimulation (ICMS). Three domestic pigs ( Sus scrofa) were placed in a modified stereotactic frame and maintained under intravenous propofol sedation. We mapped the motor cortex using ICMS, applied at varying cortical coordinates and depths. For each site, we recorded the electrode depth eliciting the maximal limb response and determined the motor threshold. Responses were assessed visually and via electromyographic recordings. ICMS uncovered a large forelimb representation, with stereotypical contralateral responses. Conversely, the hindlimb representation was smaller and located within the interhemispheric fissure. The mean threshold of the five most responsive forelimb sites was 75 ± 25 μA, compared to 280 ± 45 μA for hindlimb sites (p<0.01). A summation of stimulations in the hindlimb representation of the motor cortex unilaterally triggered bilateral alternating hindlimb movements. These results suggest that while the porcine cortex can directly command forelimb movements via the corticospinal pathway, cortical control of hindlimb likely relies on polysynaptic pathways through the brainstem, such as the cortico-reticulospinal pathway.
Improved Ising Model Formulation for Polar Codes
Ryan Seah
Warren J. Gross
This paper presents an improved Ising model framework for polar codes, termed POLARIS, which reduces the number of binary variables by incor… (voir plus)porating rate-1 node structures and embedding elements of successive-cancellation decoding into the Ising formulation. The decoder scales efficiently to block lengths up to N = 64, doubling prior Ising-based limits. POLARIS achieves near-successive-cancellation list performance within 0.4 dB while reducing QUBO dimensionality from 192 to 126 variables. These advancements bring Ising-based polar decoding closer to practical realization, offering improved efficiency for implementation on both quantum and hybrid CMOS-classical annealing hardware.
LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series
Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (voir plus)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.
RFGWRK: a hybrid downscaling framework for high-resolution precipitation mapping in geohazard-prone mountainous regions
Simin Zhang
Zeshuang Zheng
Shengbing Yang
Yuan Zeng
A Universal Source-Free Class Unlearning Framework via Synthetic Embeddings
Mohammadhadi Shateri
Class unlearning in neural classifiers refers to selectively removing the model’s ability to recognize a target (forget) class by reshapin… (voir plus)g the decision boundaries. This is essential when taxonomies change, labels are corrected, or legal or ethical requirements mandate class removal. The objective is to preserve performance on the remaining (retain) classes while avoiding costly full retraining. Existing methods generally require access to the source, i.e., forget/retain data or a relevant surrogate dataset. This dependency limits their applicability in scenarios where access to source data is restricted or unavailable. Even the recent source-free class unlearning methods rely on generating samples in the data space, which is computationally expensive and not even essential for doing class unlearning. In this work, we propose a novel source-free class unlearning framework that enables existing unlearning methods to operate using only the deployed model. We show that, under assumptions on the forget loss with respect to logits, class unlearning can be performed source-free for any given neural classifier by utilizing randomly generated samples within the classifier’s intermediate space. Specifically, randomly generated embeddings pseudo-labeled by the model as belonging to the forget or retain classes can support effective source-free unlearning. Our analysis further shows that, under conditions on the forget loss and synthetic forget embeddings, minimizing the forget loss induces expected logit shifts consistent with class unlearning, without requiring a specific parametric form of the embedding distribution. We validate our framework on four backbone architectures, ResNet-18, ResNet-50, ViT-B/16, and Swin-T, across three benchmark datasets, CIFAR-10, CIFAR-100, and TinyImageNet. Our experimental results show that existing class unlearning methods can operate within our source-free framework, with minimal impact on their forgetting efficacy and retain class accuracy. The code is available at https://github.com/Yasaman-dt/Source_Free_Class_Unlearning.
Factorized and Vectorized Execution: Optimizing Analytical and Semantic Queries over Relations
Many-to-many joins are central to analytical and semantic workloads such as fraud detection, network analysis, and recommendation, where ins… (voir plus)ights arise from relationships between entities. These workloads often suffer from an explosion of intermediate results, sometimes orders of magnitude larger than the inputs. Factorized representations address this problem by exploiting conditional independence among attributes to encode intermediates more compactly. In some cases, they can reduce the output size asymptotically below the worst-case output size. However, adopting factorization in modern vectorized query processors remains challenging: factorized representations are hierarchical, whereas vectorized execution is built around flat, block-oriented processing. Prior approaches either rely on full materialization or support only restricted factorization layouts, sacrificing much of the benefits of both factorization and vectorization. We present FFX, a novel engine for F ast F actorized e X ecution. FFX is the first pipelined engine to support arbitrary factorization schemes while preserving full vectorization. The engine introduces packed factorized vectors and operators that maintain cache-friendly, contiguous layouts. Beyond analytics, FFX also co-optimizes semantic operators by serializing factorized intermediates into compact prompts for large language models (LLMs), substantially reducing token usage and inference cost while maintaining output quality and, in some cases, improving it. Together, these contributions enable efficient execution of join-heavy analytical queries, including queries augmented with semantic operators.
Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights
Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highl… (voir plus)y accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.
PRISM: High-Resolution & Precise Counterfactual Medical Image Generation using Language-guided Stable Diffusion
Developing reliable and generalizable deep learning systems for medical imaging faces significant obstacles due to spurious correlations, da… (voir plus)ta imbalances, and limited text annotations in datasets. Addressing these challenges requires architectures robust to the unique complexities posed by medical imaging data. The rapid advancements in vision-language foundation models within the natural image domain prompt the question of how they can be adapted for medical imaging tasks. In this work, we present PRISM, a framework that leverages foundation models to generate high-resolution, language-guided medical image counterfactuals using Stable Diffusion. Our approach demonstrates unprecedented precision in selectively modifying spurious correlations (the medical devices) and disease features, enabling the removal and addition of specific attributes while preserving other image characteristics. Through extensive evaluation, we show how PRISM advances counterfactual generation and enables the development of more robust downstream classifiers for clinically deployable solutions. To facilitate broader adoption and research, we make our code publicly available at https://github.com/Amarkr1/PRISM.
Revisiting Age of Acquisition in Curriculum Learning: Disentangling Lexical Features and Semantic Structure
Aaron Shah
Taimaa Kassab Bachi
Previous work has found that ordering training data by children’s Age of Acquisition (AoA) for words increases the stability of distributi… (voir plus)onal word embeddings, suggesting that early-learned words play a privileged role in shaping semantic structure. In this study, we determine whether AoA itself drives these effects, or whether they emerge from correlated lexical factors such as frequency, concreteness, and phonological complexity. Using incremental Word2Vec training, we construct curricula ordered by AoA and by individual lexical features, while systematically controlling for vocabulary growth and deterministic ordering effects. We show that AoA-ordered curricula produce greater early-phase stability than shuffled baselines, even under controlled exposure conditions. We find that the advantage observed with AoA can be largely explained by correlated factors like overall word frequency. Despite limited gains on general similarity benchmarks, AoA-ordered embeddings outperform shuffled embeddings on a proxy domain-specific task: predicting human AoA norms. This advantage persists after debiasing timestamp effects, implying that AoA curricula induce developmentally meaningful semantic structure.
Scalable Environments Drive Generalizable Agents
Jiayi Zhang
Fanqi Kong
Guibin Zhang
Maojia Song
Zhaoyang Yu
Jianhao Ruan
Jinyu Xiang
Chenglin Wu
Yuyu Luo
Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues th… (voir plus)at such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.