Publications

Model Merging via Data-Free Covariance Estimation
Marawan Gamal Abdel Hameed
Derek Tam
Pascal Jr Tikeng Notsawo
Colin Raffel
Model merging provides a way of cheaply combining individual models to produce a model that inherits each individual's capabilities. While s… (see more)ome merging methods can approach the performance of multitask training, they are often heuristically motivated and lack theoretical justification. A principled alternative is to pose model merging as a layer-wise optimization problem that directly minimizes interference between tasks. However, this formulation requires estimating per-layer covariance matrices from data, which may not be available when performing merging. In contrast, many of the heuristically-motivated methods do not require auxiliary data, making them practically advantageous. In this work, we revisit the interference minimization framework and show that, under certain conditions, covariance matrices can be estimated directly from difference matrices, eliminating the need for data while also reducing computational costs. We validate our approach across vision and language benchmarks on models ranging from 86M parameters to 7B parameters, outperforming previous data-free state-of-the-art merging methods
Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under su… (see more)perposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.
Communicating about Space: Language-Mediated Spatial Integration Across Partial Views
Sudarshan Nikhil
Ponnurangam Kumaraguru
Humans build shared spatial understanding by communicating partial, viewpoint-dependent observations. We ask whether Multimodal Large Langua… (see more)ge Models (MLLMs) can do the same, aligning distinct egocentric views through dialogue to form a coherent, allocentric mental model of a shared environment. To study this systematically, we introduce COSMIC, a benchmark for Collaborative Spatial Communication. In this setting, two static MLLM agents observe a 3D indoor environment from different viewpoints and exchange natural-language messages to solve spatial queries. COSMIC contains 899 diverse scenes and 1250 question-answer pairs spanning five tasks. We find a capability hierarchy, MLLMs are most reliable at identifying shared anchor objects across views, perform worse on relational reasoning, and largely fail at building globally consistent maps, performing near chance, even for frontier models. Moreover, we find thinking capability yields gains in anchor grounding, but is insufficient for higher-level spatial communication. To contextualize model behavior, we collect 250 human-human dialogues. Humans achieve 95% aggregate accuracy, while the best model, Gemini-3-Pro-Thinking, reaches 72%, leaving substantial room for improvement. Moreover, human conversations grow more precise as partners align on a shared spatial understanding, whereas MLLMs keep exploring without converging, suggesting limited capacity to form and sustain a robust shared mental model throughout the dialogue. Our code and data is available at https://github.com/ankursikarwar/Cosmic.
Efficient CMOS Invertible Logic Using Stochastic Computing
Sean C. Smithson
Naoya Onizawa
Brett H. Meyer
Warren J. Gross
Takahiro Hanyu
Invertible logic can operate in one of two modes: 1) a forward mode, in which inputs are presented and a single, correct output is produced,… (see more) and 2) a reverse mode, in which the output is fixed and the inputs take on values consistent with the output. It is possible to create invertible logic using various Boltzmann machine configurations. Such systems have been shown to solve certain challenging problems quickly, such as factorization and combinatorial optimization. In this paper, we show that invertible logic can be implemented using simple spiking neural networks based on stochastic computing. We present a design methodology for invertible stochastic gates, which can be implemented using a small amount of CMOS hardware. We demonstrate that our design can not only correctly implement basic gates with invertible capability, but can also be extended to construct invertible stochastic adder and multiplier circuits. Experimental results are presented which demonstrate correct operation of synthesizable invertible circuitry performing both multiplication and factorization, along with fabricated ASIC measurement results for an invertible multiplier circuit.
Impact of WHO AWaRe Antibiotic Handbook training on antibiotics prescribing knowledge among private primary care providers: a vignette-based, prep–post pilot study in Patna, India
Poshan Thapa
Prachi Shukla
Chandrashekhar Joshi
Sena Sayood
P. Sinha
Diwash Timilsina
Mili Dutta
Madhukar Pai
Samira Abbasgholizadeh Rahimi
Sumanth Gandra
Abstract Introduction Inappropriate antibiotic prescribing is a major concern in low- and middle-income countries (LMICs), particularly at t… (see more)he primary care level. The WHO AWaRe Antibiotic Handbook was introduced to promote rational antibiotic use, yet its real-world feasibility and potential impact remain underexplored. Our study evaluated the impact and usefulness of the WHO AWaRe Handbook training among private primary care providers (PCPs) in Patna, India. Methods We conducted a pre–post pilot study among 145 private PCPs (40 formal PCPs (FPs) and 105 informal PCPs (IPs) in Patna, India. Participants received training from an infectious disease physician on the WHO AWaRe Antibiotic Handbook. Antibiotic prescribing knowledge was assessed before and after the intervention using clinical vignettes for four conditions: acute diarrhea, urinary tract infection (UTI), cellulitis, and community-acquired pneumonia (CAP). An endline survey evaluated the perceived usefulness of the intervention. Changes in prescribing knowledge was analyzed using McNemar’s test for paired data. Results The intervention significantly reduced overall antibiotic prescribing knowledge for acute diarrhea ( p = 0.0003) and UTI ( p = 0.0113), with greater reductions among IPs. No significant changes were observed for cellulitis ( p = 0.3692) or CAP ( p = 0.7150). Watch-category antibiotic prescribing significantly decreased for acute diarrhea ( p < 0.0001), with no significant changes for other conditions. IPs showed greater improvements overall compared to FPs. The majority of PCPs (75%; n = 107) rated the training as moderately or very useful. Conclusion Training private PCPs using the WHO AWaRe Handbook improved antibiotic prescribing knowledge for some common conditions, particularly among IPs. Future research should combine training with strategies that address broader contextual barriers, alongside tailored reinforcement interventions, and longer-term follow-up.
Neighbor-Aware Localized Concept Erasure in Text-to-Image Diffusion Models
Concept erasure in text-to-image diffusion models seeks to remove undesired concepts while preserving overall generative capability. Localiz… (see more)ed erasure methods aim to restrict edits to the spatial region occupied by the target concept. However, we observe that suppressing a concept can unintentionally weaken semantically related neighbor concepts, reducing fidelity in fine-grained domains. We propose Neighbor-Aware Localized Concept Erasure (NLCE), a training-free framework designed to better preserve neighboring concepts while removing target concepts. It operates in three stages: (1) a spectrally-weighted embedding modulation that attenuates target concept directions while stabilizing neighbor concept representations, (2) an attention-guided spatial gate that identifies regions exhibiting residual concept activation, and (3) a spatially-gated hard erasure that eliminates remaining traces only where necessary. This neighbor-aware pipeline enables localized concept removal while maintaining the surrounding concept neighborhood structure. Experiments on fine-grained datasets (Oxford Flowers, Stanford Dogs) show that our method effectively removes target concepts while better preserving closely related categories. Additional results on celebrity identity, explicit content and artistic style demonstrate robustness and generalization to broader erasure scenarios.
The Cluster Evolutionary Reference Ensemble at Low- <i>z</i> (CEREAL) Sample of Galaxy Clusters. I. X-Ray Morphological Properties and Demographics
L. V. White
Michael McDonald
S. J. Allen
Marshall W. Bautz
Michael S. Calzadilla
G. P. Garmire
Ralph Kraft
Adam B. Mantz
Taweewat Somboonpanyakul
Alexey Vikhlinin
Abstract With rapid improvements in the assembly of large samples of galaxy clusters, we are approaching the ability to study clusters at z … (see more)≳ 2. Evolutionary studies comparing these distant clusters to the clusters in our local Universe depend heavily on the reliability of low-redshift cluster samples, most of which are subject to X-ray selection effects, biasing them to relaxed, cool-core clusters. Here, we introduce the Cluster Evolutionary Reference Ensemble at Low- z (CEREAL) sample, composed of Chandra X-ray observations of 169 galaxy clusters that have been selected from the Planck Sunyaev–Zel’dovich catalog. CEREAL has a simple and well-understood selection function, spans an order of magnitude in mass at z ∼ 0.15, and has uniform, high-resolution X-ray follow-up. We present the full sample and provide results based on X-ray surface brightness properties, finding significantly more non-cool-core systems than in X-ray-selected samples. We use surface brightness concentration ( c SB ) as a proxy for cool-core strength and centroid shift ( w ) to measure dynamical state. Over the full sample, we find a cool-core ( c SB > 0.075) fraction of 0.3 9 0.04 + 0.04 , a strong cool-core ( c SB > 0.155) fraction of 0.1 3 0.03 + 0.03 , and a dynamically relaxed ( w < 0.01) frac
Adversarial-Robust Multivariate Time-Series Anomaly Detection via Joint Information Retention
Time-series anomaly detection (TSAD) is a critical component in monitoring complex systems, yet modern deep learning-based detectors are oft… (see more)en highly sensitive to localized input corruptions and structured noise. We propose ARTA (Adversarially Robust multivariate Time-series Anomaly detection via joint information retention), a joint training framework that improves detector robustness through a principled min-max optimization objective. ARTA comprises an anomaly detector and a sparsity-constrained mask generator that are trained simultaneously. The generator identifies minimal, task-relevant temporal perturbations that maximally increase the detector's anomaly score, while the detector is optimized to remain stable under these structured perturbations. The resulting masks characterize the detector's sensitivity to adversarial temporal corruptions and can serve as explanatory signals for the detector's decisions. This adversarial training strategy exposes brittle decision pathways and encourages the detector to rely on distributed and stable temporal patterns rather than spurious localized artifacts. We conduct extensive experiments on the TSB-AD benchmark, demonstrating that ARTA consistently improves anomaly detection performance across diverse datasets and exhibits significantly more graceful degradation under increasing noise levels compared to state-of-the-art baselines.
Can Vision Foundation Models Navigate? Zero-Shot Real-World Evaluation and Lessons Learned
Visual Navigation Models (VNMs) promise generalizable, robot navigation by learning from large-scale visual demonstrations. Despite growing … (see more)real-world deployment, existing evaluations rely almost exclusively on success rate, whether the robot reaches its goal, which conceals trajectory quality, collision behavior, and robustness to environmental change. We present a real-world evaluation of five state-of-the-art VNMs (GNM, ViNT, NoMaD, NaviBridger, and CrossFormer) across two robot platforms and five environments spanning indoor and outdoor settings. Beyond success rate, we combine path-based metrics with vision-based goal-recognition scores and assess robustness through controlled image perturbations (motion blur, sunflare). Our analysis uncovers three systematic limitations: (a) even architecturally sophisticated diffusion and transformer-based models exhibit frequent collisions, indicating limited geometric understanding; (b) models fail to discriminate between different locations that are perceptually similar, however some semantics differences are present, causing goal prediction errors in repetitive environments; and (c) performance degrades under distribution shift. We will publicly release our evaluation codebase and dataset to facilitate reproducible benchmarking of VNMs.
Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets
Alex Koran
Takuya Nanri
Fangge Chen
High infraction rates remain the primary bottleneck for end-to-end (E2E) autonomous driving, as evidenced by the low driving scores on the C… (see more)ARLA Leaderboard. Despite collision-related infractions being the dominant failure mode in closed-loop evaluations, collision-aware representation learning has received limited attention. To address this gap, we first develop a Video-Language-Augmented Anomaly Detector (VLAAD), leveraging a Multiple Instance Learning (MIL) formulation to obtain stable, temporally localized collision signals for proactive prediction. To transition these capabilities into closed-loop simulations, we must overcome the limitations of existing simulator datasets, which lack multimodality and are frequently restricted to simple intersection scenarios. Therefore, we introduce CARLA-Collide, a large-scale multimodal dataset capturing realistic collision events across highly diverse road networks. Trained on this diverse simulator data, VLAAD serves as a collision-aware plug-in module that can be seamlessly integrated into existing E2E driving models. By integrating our module into a pretrained TransFuser++ agent, we demonstrate a 14.12% relative increase in driving score with minimal fine-tuning. Beyond closed-loop evaluation, we further assess the generalization capability of VLAAD in an open-loop setting using real-world driving data. To support this analysis, we introduce Real-Collide, a multimodal dataset of diverse dashcam videos paired with semantically rich annotations for collision detection and prediction. On this benchmark, despite containing only 0.6B parameters, VLAAD outperforms a multi-billion-parameter vision-language model, achieving a 23.3% improvement in AUC.
A Compression Perspective on Simplicity Bias
Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cas… (see more)t new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.
EngineAD: A Real-World Vehicle Engine Anomaly Detection Dataset
Christopher Roth
Rory Woods
Ken Sills
The progress of Anomaly Detection (AD) in safety-critical domains, such as transportation, is severely constrained by the lack of large-scal… (see more)e, real-world benchmarks. To address this, we introduce EngineAD, a novel, multivariate dataset comprising high-resolution sensor telemetry collected from a fleet of 25 commercial vehicles over a six-month period. Unlike synthetic datasets, EngineAD features authentic operational data labeled with expert annotations, distinguishing normal states from subtle indicators of incipient engine faults. We preprocess the data into