Publications

Probabilistic Calibration Is a Trainable Capability in Language Models

Sruthi Kuriakose

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation pro… (see more)babilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

2026-05-11

arXiv (preprint)

doi.org

arxiv.org

scShapeBench: Discovering geometry from high dimensional scRNAseq data

Andrew J. Steindl

João Felipe Rocha

Brian Tshilengi Di Bassinga

Zachary Warren

Matthew Scicluna

César Miguel Valdez Cordova

Shabarni Gupta

Leire Torices

Daniel Neumann

Timothy J. Mann

Ihuan Gunawan

Dhananjay Bhaskar

John G. Lock

Christine L. Chaffer

Guy Wolf

Smita Krishnaswamy

High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these da… (see more)tasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.

2026-05-11

arXiv (preprint)

doi.org

arxiv.org

Sleep Spindle-Locked Targeted Memory Reactivation Enhances Declarative Memory Consolidation

Vaishali Mutreja

Prakriti Gupta

Ovidiu Lungu

Latifa Lazzouni

Ella Gabitov

Habib Benali

Hugo Jourde

Giovanni Beltrame

Emily BJ Coffey

Jean-Marc Lina

Geneviève Albouy

Bradley King

Arnaud Boutin

Julie Carrier

Julien Doyon

Abstract Study Objectives Sleep spindles are implicated in memory consolidation. Yet direct evidence linking spindle dynamics to declarative… (see more) memory outcomes remains limited. We thus tested whether targeted memory reactivation (TMR) time-locked to sleep spindles enhances declarative memory, and whether the temporal organization of stimulated spindles–trains versus isolated events–is selectively associated with distinct memory outcomes. Methods Twenty-eight healthy young adults learned image locations from two categories (animals, clothing) in a grid, each paired with a distinct auditory cue. During overnight NREM sleep, one cue was replayed time-locked to spindles detected in real-time using a closed-loop system (TMR condition); the other served as the non-reactivated control (No-TMR condition). Category-cue assignment was counterbalanced. Post-sleep recall, recognition accuracy, and movement time were assessed. Results Recall accuracy was significantly higher in the TMR than the No-TMR condition (93.96% vs. 90.61%, p = .024), whereas recognition accuracy ( p = .139) and movement time ( p = .651) did not differ. Stimulation intensity within spindle trains correlated with the TMR effect on recall (Spearman ρ = .531, p = .004), whereas the proportion of isolated spindle stimulations correlated with the TMR effect on recognition (ρ = .563, p = .002). Cross-associations were not significant. Conclusions Spindle-locked TMR enhances recall-based declarative memory retention. The selective association between spindle temporal clustering and memory outcomes suggests that train-embedded and isolated spindles support different aspects of memory consolidation, highlighting spindle temporal context as a functionally relevant dimension of sleep-dependent memory processing.

2026-05-11

bioRxiv (preprint)

doi.org

A systematic review of human-LLM interactions in computational thinking empirical studies

Yimei Zhang

You Song

Doina Precup

Reihaneh Rabbany

Maria Cutumisu

2026-05-11

Computer Science Education (published)

doi.org

Augmenting LLM Reasoning with Dynamic Notes Writing for Complex MultiHop QA

Rishabh Maheshwary

Masoud Hashemi

Khyati Mahajan

Shiva Krishna Reddy Malay

Sai Rajeswar Mudumba

Sathwik Tejaswi Madhusudhan

Spandana Gella

Vikas Yadav

2026-05-10

Language Resources and Evaluation Conference (published)

doi.org

Exploring Token-Space Manipulation in Latent Audio Tokenizers

Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as… (see more) frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio Tokenizer for Token-space Editing (LATTE) that appends a fixed set of learnable latent tokens to the audio feature sequence and retains only these tokens for quantization and decoding. This design produces a compact, non-temporally aligned bottleneck in which each token can aggregate global information across the full utterance. We show that the resulting tokenizer preserves competitive reconstruction quality in low-bitrate speech coding settings while enabling simple token-space interventions. In particular, we find that swapping selected latent token positions between utterances can modify global attributes, such as speaker identity and background noise, and we evaluate these interventions on voice conversion and denoising tasks. Our results suggest that compact latent audio tokenizers can support controllable audio manipulation without supervision in task-specific editing models.

2026-05-10

arXiv (preprint)

doi.org

arxiv.org

FRASE: Frame-based Structured Representations for Generalizable SPARQL Query Generation

Papa Abdou Karim Karou Diallo

Amal Zouaq

2026-05-10

Language Resources and Evaluation Conference (published)

doi.org

Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields

Haechan Mark Bong

Giovanni Beltrame

Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxCo… (see more)nvNet is a potential-field planner that leverages properties of Maxwell's equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155--0.17s to 0.087--0.089s, or about 1.7--1.95 times faster than A*. Against RRT*(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2--17.5s to about 0.09s, which is roughly 193--201 times faster than RRT*(3k).

2026-05-10

arXiv (preprint)

doi.org

arxiv.org

Is the representation of fear distributed across the whole brain?

Vincent Taschereau‐Dumouchel

Marjorie Côté

Darius Valevicius

Lisa‐Marie Davignon

Marie-France Marin

2026-05-10

Research Square (accepted)

doi.org

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

Ahmed Mehdi Inane

Vincent Quirion

Gintare Karolina Dzugaite

Ioannis Mitliagkas

Noise-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destro… (see more)ys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of

2026-05-10

arXiv (preprint)

doi.org

arxiv.org

Phases of Muon: When Muon Eclipses SignSGD

Lucas Benigni

Atish Agarwala

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperfo… (see more)rming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent

2026-05-09

arXiv (preprint)

doi.org

arxiv.org

Bootstrap Sampling Improves Model Soup Performance via Increased Model Diversity for Pneumonia Classification

Sara Early

Omata I. Ehizokhale

Nils D. Forkert

Samira Ebrahimi Kahou

Model soups combine multiple trained neural network checkpoints through weight averaging, often outperforming individual models and achievin… (see more)g performance comparable to deep ensembles without increasing inference cost. However, their effectiveness depends critically on checkpoint diversity, and when models are trained on the same dataset, optimization trajectories may converge toward similar regions of parameter space, limiting this diversity. In this work, we investigate bootstrap resampling as a simple data-level mechanism for increasing checkpoint diversity. Using a binary pneumonia classification task and 644 radiographs from the National Institutes of Health (NIH) ChestXray14 dataset, we train pools of convolutional neural networks under varying bootstrap ratios and construct greedy model soups. While checkpoint models trained on the full dataset achieve the highest mean individual accuracy, they are highly similar and offer little complementary signal, limiting the effectiveness of greedy selection. Bootstrap sampling introduces variability in the training data, producing more diverse checkpoints that, although individually weaker, enable greedy soup construction to combine complementary representations and achieve superior overall performance. The strongest model soup, obtained with 70\% bootstrap sampling, achieves a test accuracy of 0.650, representing a 9.8 percentage point improvement over the mean individual checkpoint accuracy (0.551) under the same condition. While absolute performance is limited by the small cohort size and training-from-scratch setting, this result highlights the substantial gains achievable through diversity-driven weight averaging.

2026-05-08

Medical Imaging with Deep Learning (poster)

openreview.net

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications