Publications

Probabilistic Calibration Is a Trainable Capability in Language Models
Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation pro… (see more)babilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.
scShapeBench: Discovering geometry from high dimensional scRNAseq data
Andrew J. Steindl
João Felipe Rocha
Brian Tshilengi Di Bassinga
Zachary Warren
Shabarni Gupta
Leire Torices
Daniel Neumann
Timothy J. Mann
Ihuan Gunawan
Dhananjay Bhaskar
John G. Lock
Christine L. Chaffer
High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these da… (see more)tasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.
Sleep Spindle-Locked Targeted Memory Reactivation Enhances Declarative Memory Consolidation
Vaishali Mutreja
Prakriti Gupta
Ovidiu Lungu
Latifa Lazzouni
Ella Gabitov
Habib Benali
Hugo Jourde
Emily BJ Coffey
Jean-Marc Lina
Geneviève Albouy
Bradley King
Arnaud Boutin
Julie Carrier
Julien Doyon
Abstract Study Objectives Sleep spindles are implicated in memory consolidation. Yet direct evidence linking spindle dynamics to declarative… (see more) memory outcomes remains limited. We thus tested whether targeted memory reactivation (TMR) time-locked to sleep spindles enhances declarative memory, and whether the temporal organization of stimulated spindles–trains versus isolated events–is selectively associated with distinct memory outcomes. Methods Twenty-eight healthy young adults learned image locations from two categories (animals, clothing) in a grid, each paired with a distinct auditory cue. During overnight NREM sleep, one cue was replayed time-locked to spindles detected in real-time using a closed-loop system (TMR condition); the other served as the non-reactivated control (No-TMR condition). Category-cue assignment was counterbalanced. Post-sleep recall, recognition accuracy, and movement time were assessed. Results Recall accuracy was significantly higher in the TMR than the No-TMR condition (93.96% vs. 90.61%, p = .024), whereas recognition accuracy ( p = .139) and movement time ( p = .651) did not differ. Stimulation intensity within spindle trains correlated with the TMR effect on recall (Spearman ρ = .531, p = .004), whereas the proportion of isolated spindle stimulations correlated with the TMR effect on recognition (ρ = .563, p = .002). Cross-associations were not significant. Conclusions Spindle-locked TMR enhances recall-based declarative memory retention. The selective association between spindle temporal clustering and memory outcomes suggests that train-embedded and isolated spindles support different aspects of memory consolidation, highlighting spindle temporal context as a functionally relevant dimension of sleep-dependent memory processing.
A systematic review of human-LLM interactions in computational thinking empirical studies
Augmenting LLM Reasoning with Dynamic Notes Writing for Complex MultiHop QA
Rishabh Maheshwary
Masoud Hashemi
Khyati Mahajan
Shiva Krishna Reddy Malay
Sathwik Tejaswi Madhusudhan
Vikas Yadav
Exploring Token-Space Manipulation in Latent Audio Tokenizers
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as… (see more) frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.
FRASE: Frame-based Structured Representations for Generalizable SPARQL Query Generation
Papa Abdou Karim Karou Diallo
Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields
Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxCo… (see more)nvNet is a potential-field planner that leverages properties of Maxwell's equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155--0.17s to 0.087--0.089s, or about 1.7--1.95 times faster than A*. Against RRT*(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2--17.5s to about 0.09s, which is roughly 193--201 times faster than RRT*(3k).
Is the representation of fear distributed across the whole brain?
Vincent Taschereau‐Dumouchel
Marjorie Côté
Darius Valevicius
Lisa‐Marie Davignon
Marie-France Marin
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
Ahmed Mehdi Inane
Gintare Karolina Dzugaite
Noise-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destro… (see more)ys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of
Phases of Muon: When Muon Eclipses SignSGD
Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperfo… (see more)rming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent
Bootstrap Sampling Improves Model Soup Performance via Increased Model Diversity for Pneumonia Classification
Sara Early
Omata I. Ehizokhale
Nils D. Forkert
Model soups combine multiple trained neural network checkpoints through weight averaging, often outperforming individual models and achievin… (see more)g performance comparable to deep ensembles without increasing inference cost. However, their effectiveness depends critically on checkpoint diversity, and when models are trained on the same dataset, optimization trajectories may converge toward similar regions of parameter space, limiting this diversity. In this work, we investigate bootstrap resampling as a simple data-level mechanism for increasing checkpoint diversity. Using a binary pneumonia classification task and 644 radiographs from the National Institutes of Health (NIH) ChestXray14 dataset, we train pools of convolutional neural networks under varying bootstrap ratios and construct greedy model soups. While checkpoint models trained on the full dataset achieve the highest mean individual accuracy, they are highly similar and offer little complementary signal, limiting the effectiveness of greedy selection. Bootstrap sampling introduces variability in the training data, producing more diverse checkpoints that, although individually weaker, enable greedy soup construction to combine complementary representations and achieve superior overall performance. The strongest model soup, obtained with 70\% bootstrap sampling, achieves a test accuracy of 0.650, representing a 9.8 percentage point improvement over the mean individual checkpoint accuracy (0.551) under the same condition. While absolute performance is limited by the small cohort size and training-from-scratch setting, this result highlights the substantial gains achievable through diversity-driven weight averaging.