Publications

Prognostic data extraction harnessing a privacy-preserving large language model: a clinician-AI collaborative retrospective evaluation in head and neck oncology
George Shenouda
Marie Duclos
Tomás Yokoo Teodoro de Souza
Khalil Sultanem
Farhad Maleki
Privacy regulations and limited expert-validation constrain the deployment of large language models (LLMs) for electronic health record stru… (see more)cturing. We evaluated locally deployed LLMs to extract 30 prognostic variables from 1,360 head and neck cancer reports (882 patients) using zero-shot prompting. A stratified 50-case subset was reviewed by three radiation oncologists (50 cases, 30 fields, 3 reviewers; 4,500 decisions) to form a majority-vote reference for Llama3.3-70B, which achieved 98.6% F1 with high clinician agreement and processed reports in 53 s/report. Among seven additional models (2.6B-70B) benchmarked against this reference, GPT-OSS-20.9B (F1 89.4%) and MedGemma-27B (F1 88.5%) performed best. Integrating LLM-extracted HPV status, smoking history, and Charlson Comorbidity Score into a multivariate Cox Proportional Hazards model (age, sex, T/N stage) improved disease-free survival (likelihood ratio test p = 0.014; ΔC-index + 0.071) and locoregional failure-free survival (p = 0.026; ΔC-index + 0.108) with 1,000-bootstrap internal validation. This clinician-AI collaborative evaluation shows that on-premises LLMs enable privacy-preserving and efficient tumour board support, longitudinal data curation, and outcome prediction.
Rules of the game: Legislative exits in four Westminster systems
Alex B. Rivard
Marc André Bodet
By leveraging over 150 years of electoral and biographical data in the Canadian provinces of Ontario, Quebec, New Brunswick, and Nova Scot… (see more)ia, we argue that voluntary exit is best understood as a cost-benefit calculation shaped by positional and institutional incentives in the legislative arena. We show that institutional changes that make seeking re-election costlier are associated with an increased likelihood of a legislator voluntarily exiting the legislative arena. We also find that the determinants of exit vary across age cohorts: younger legislators are more sensitive to institutional and positional cost-benefit incentives, reflecting greater professional mobility and outside career opportunities. Overall, our results indicate that positional and institutional in part explain a legislator’s decision to not seek re-election, but that their impact of these incentives is mediated by life-cycle and retirement-horizon considerations.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Omar Mahmoud
Aly M. Kassem
Thommen George Karimpanal
Buddhika Laknath Semage
Santu Rana
Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to sp… (see more)ecific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
Amine El hattami
Christopher Pal
AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve effi… (see more)ciency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.
Unsupervised Continual Clustering via Forward-Backward Knowledge Distillation
Unsupervised Continual Learning (UCL) aims to enable neural networks to learn sequential tasks without labels or access to past data. A majo… (see more)r challenge in this setting is Catastrophic Forgetting, where models forget previously learned tasks upon learning new ones. This challenge is amplified in UCL due to the absence of labels to guide learning and memory retention. Existing mitigation strategies, such as knowledge distillation and replay buffers, often raise memory and privacy concerns. Moreover, current UCL methods largely overlook clustering-specific objectives. To fill this gap, we introduce Unsupervised Continual Clustering (UCC) and propose Forward-Backward Knowledge Distillation for Continual Clustering (FBCC). FBCC employs a continual teacher network with a clustering projector and lightweight task-specific students. Through a dual-phase forward-backward distillation process, the teacher learns new clusters while preserving previously discovered cluster structure without storing past data. FBCC represents a pioneering approach to UCC, demonstrating improved clustering performance across sequential tasks. Experiments on four benchmark datasets demonstrate that FBCC consistently outperforms existing continual learning baselines in clustering accuracy while significantly reducing catastrophic forgetting.
Video-Based Prediction of In-Flight Particle Characteristics in Atmospheric Plasma Spraying
Sareh Soleimani
Kintak Raymond Yu
Cristian Cojocaru
Atmospheric plasma spraying (APS) is a widely used coating process in which in-flight particle temperature and velocity strongly influence c… (see more)oating quality. However, these particle characteristics are difficult to monitor continuously during operation, motivating the development of non-invasive data-driven diagnostic methods. In this work, we investigate the predictive potential of high-speed video observations of the plasma plume for estimating in-flight particle characteristics in APS. We introduce three different video-derived feature representations and evaluate them using Tabular Prior-Data Fitted Networks (TabPFN), convolutional neural networks (CNN), and classical regression baselines including Random Forest, Gradient Boosting, Support Vector Regression, and XGBoost. Experiments are conducted using grouped leave-one-out cross-validation on 126 labeled pre- and post-spray video recordings from 63 APS spray runs. Across the engineered feature experiments, TabPFN achieves the most consistent performance for temperature prediction, reaching R2 = 0.86 using the combined feature representation. CNN models particularly perform stronger for velocity prediction, achieving R2 of 0.81. In addition, we evaluate models operating directly on raw video frames using pretrained CNNs and find that the highest performance is achieved by a pretrained CNN with a regression head with R2 of 0.90 and 0.82 for temperature and velocity, respectively. The results demonstrate that video-derived plume information provides a promising and scalable foundation for non-invasive APS diagnostics and real-time process monitoring.
Representation Learning Enables Scalable Multitask Deep Reinforcement Learning
Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achie… (see more)ve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.
SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech
Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily eval… (see more)uated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.
Brain states recur across diverse narrative contexts during longitudinal viewing
Yibei Chen
Matin Ghavami
Marie St‐Laurent
Satrajit S. Ghosh
Abstract What does the brain do during the continuous, varied experience of watching a story unfold? One account holds that the brain traver… (see more)ses a finite repertoire of recurring states, but whether that repertoire is a stable property of the individual or is reshaped by each new experience has not been tested across diverse naturalistic content within the same person. We characterized the dynamic brain-state repertoire in six individuals who watched the television series Friends across its six seasons during fMRI (up to ∼146 episodes, ∼54 hours per person). For each individual we fit a sticky hierarchical Dirichlet process hidden Markov model across all episodes, discovering brain states (recurring whole-brain activity patterns with characteristic coupling) without pre-specifying their number. Each individual’s brain visited roughly forty-five states arrayed along a continuous recurrence gradient, from states active in nearly every episode to episode-specific ones, with no sharp division between them. The repertoire was heterogeneous in why its states recurred: a minority locked to scan-run structure, the majority remaining eligible for content. Transitions were organized by the functional-connectivity similarity between states (per-individual Spearman ρ = 0.33–0.55) and, in most individuals, respected resting-state network boundaries. Episode content was associated with which states the brain occupied moment to moment. The recurrence ordering discovered in Friends transferred to state occupancy during other social-narrative films (five of six individuals) and attenuated as stimuli departed from that class, weakening for visual-only reading and audio-only listening. Across diverse narrative experience, the dynamic repertoire is a property of the individual: content varies which states are visited and when, not which states exist.
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional a… (see more)ttempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods (
Learned Subspace Compression for Communication-Efficient Pipeline Parallelism
Pipeline parallelism enables training of large language models that exceed single-device memory, yet inter-stage activation communication be… (see more)comes the dominant bottleneck when trained on low-bandwidth networks. Recent work in this area has proposed using fixed orthogonal projections to compress activations. However, this still results in a significant performance degradation and requires a number of non-standard adaptations to constrain the optimization. A natural alternative is to learn a low rank projection for each pipeline stage, however maintaining the necessary orthogonality of these projectors during training remains a challenge. We present Manifold Aware Projection Learning (MAPL), a method that treats inter-stage compression as a learnable orthogonal projection under explicit Stiefel manifold (orthogonal matrices) constraints. Rather than prescribing a fixed global subspace, MAPL lets each pipeline stage discover and continuously adapt its own task-optimal compression subspace via manifold-constrained steepest descent. To recover token-specific signals at stage boundaries, we introduce per-stage factorized anchor embeddings that allow for full-rank activation reconstruction with negligible communication overhead. We further show that we can incorporate residual vector quantization after projection with a streaming codebook synchronization protocol that amortizes dictionary communication. Across LLaMA models from 150M to 1B parameters we show that MAPL can be easily applied to the existing pipeline and can achieve high compression with neglibile performance degradation with a drastically improved tradeoffs in performance vs. compression compared to Subspace Networks.
Learning Admissible Heuristics via Cost Partitioning
Hugo Barral
Marie-José Huguet
Sylvie Thiébaux
Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost part… (see more)itioning combines multiple abstraction heuristics while preserving admissibility, but computing optimal partitions online is expensive. We propose a framework that learns to infer admissible cost partitions by leveraging the Lagrangian dual equivalence between cost partitioning and multiplier prediction. Planning states and patterns are encoded as labelled graphs, and an action-centric variant of the Weisfeiler-Leman algorithm extracts structural feature vectors. A deep architecture with axial self-attention and a softmax output layer maps these features to cost weights that satisfy the partition constraints by construction, ensuring admissibility. Experiments demonstrate reduced node expansions compared to suboptimal partitioning baselines while maintaining strict admissibility. To our knowledge, this is the first machine-learned heuristic guaranteed to be admissible.