Generalizable, real-time neural decoding with hybrid state-space models
Avery Hee-Woon Ryoo
Nanda H Krishna
Ximeng Mao
Mehdi Azabou
Eva L Dyer
Real-time decoding of neural activity is central to neuroscience and neurotechnology applications, from closed-loop experiments to brain-com… (voir plus)puter interfaces, where models are subject to strict latency constraints. Traditional methods, including simple recurrent neural networks, are fast and lightweight but often struggle to generalize to unseen data. In contrast, recent Transformer-based approaches leverage large-scale pretraining for strong generalization performance, but typically have much larger computational requirements and are not always suitable for low-resource or real-time settings. To address these shortcomings, we present POSSM, a novel hybrid architecture that combines individual spike tokenization via a cross-attention module with a recurrent state-space model (SSM) backbone to enable (1) fast and causal online prediction on neural activity and (2) efficient generalization to new sessions, individuals, and tasks through multi-dataset pretraining. We evaluate POSSM's decoding performance and inference speed on intracortical decoding of monkey motor tasks, and show that it extends to clinical applications, namely handwriting and speech decoding in human subjects. Notably, we demonstrate that pretraining on monkey motor-cortical recordings improves decoding performance on the human handwriting task, highlighting the exciting potential for cross-species transfer. In all of these tasks, we find that POSSM achieves decoding accuracy comparable to state-of-the-art Transformers, at a fraction of the inference cost (up to 9x faster on GPU). These results suggest that hybrid SSMs are a promising approach to bridging the gap between accuracy, inference speed, and generalization when training neural decoders for real-time, closed-loop applications.
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Johannes Von Oswald
Nino Scherrer
Seijin Kobayashi
Luca Versari
Songlin Yang
Maximilian Schlegel
Kaitlin Maile
Yanick Schimpf
Oliver Sieberling
Alexander Meulemans
Rif A. Saurous
Charlotte Frenkel
Blaise Aguera y Arcas
João Sacramento
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (voir plus)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.
Policy context and digital development: a comparative study of trajectories in 4 Canadian academic health centers over 30 years
Aude Motulsky
Susan Usher
Pascale Lehoux
Trish Reay
Paul Hebert
Lise Gauvin
Alain Biron
G Ross Baker
Marie-Pierre Moreault
Johanne Préval
Jean-Louis Denis
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
Senyu Li
Jiayi Wang
Felermino Dario Mario Ali
Colin Cherry
Daniel Deutsch
Eleftheria Briakou
Rui Sousa-Silva
Henrique Lopes Cardoso
Pontus Stenetorp
Evaluating machine translation (MT) quality for under-resourced African languages remains a significant challenge, as existing metrics often… (voir plus) suffer from limited language coverage and poor performance in low-resource settings. While recent efforts, such as AfriCOMET, have addressed some of the issues, they are still constrained by small evaluation sets, a lack of publicly available training data tailored to African languages, and inconsistent performance in extremely low-resource scenarios. In this work, we introduce SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 13 African language pairs from the News domain, with over 63,000 sentence-level annotations from a diverse set of MT systems. Based on this data, we develop SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free evaluation metrics. We also benchmark prompting-based approaches using state-of-the-art LLMs like GPT-4o and Claude. Our experimental results show that SSA-COMET models significantly outperform AfriCOMET and are competitive with the strongest LLM (Gemini 2.5 Pro) evaluated in our study, particularly on low-resource languages such as Twi, Luo, and Yoruba. All resources are released under open licenses to support future research.
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Andrei Mircea
Supriyo Chakraborty
Nima Chitsazan
Ekaterina Lobacheva
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (voir plus)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis
Robin Yadav
Qi Yan
Renjie Liao
A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product mo… (voir plus)lecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction -- i.e. single-step retrosynthesis -- remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves
Trophic Interactions Are Key to Understanding the Effects of Global Change on the Distribution and Functional Role of the Brown Bear
Pablo M. Lucas
Wilfried Thuiller
Lauren Talluto
Ester Polaina
Jörg Albrecht
Nuria Selva
Marta De Barba
Vincenzo Penteriani
Maya Guéguen
Niko Balkenhol
Trishna Dutta
Ancuta Fedorca
Shane C. Frank
Andreas Zedrosser
Ivan Afonso‐Jordana
Hüseyin Ambarlı
Fernando Ballesteros
Andriy‐Taras Bashta
Cemal Can Bilgin
Neda Bogdanović … (voir 67 de plus)
Edgars Bojārs
Katarzyna Bojarska
Natalia Bragalanti
Henrik Brøseth
Mark W. Chynoweth
Duško Ćirović
Paolo Ciucci
Andrea Corradini
Daniele De Angelis
Miguel de Gabriel Hernando
Csaba Domokos
Aleksander Dutsov
Alper Ertürk
Stefano Filacorda
Lorenzo Frangini
Claudio Groff
Samuli Heikkinen
Bledi Hoxha
Djuro Huber
Otso Huitu
Georgeta Ionescu
Ovidiu Ionescu
Klemen Jerina
Ramon Jurj
Alexandros A. Karamanlidis
Jonas Kindberg
Ilpo Kojola
José Vicente López‐Bao
Peep Männil
Dime Melovski
Yorgos Mertzanis
Paolo Molinari
Anja Molinari‐Jobin
Andrea Mustoni
Javier Naves
Sergey Ogurtsov
Deniz Özüt
Santiago Palazón
Luca Pedrotti
Aleksandar Perović
Vladimir N. Piminov
Ioan‐Mihai Pop
Marius Popa
Maria Psaralexi
Pierre‐Yves Quenette
Georg Rauer
Slaven Reljic
Eloy Revilla
Urmas Saarma
Alexander P. Saveljev
Ali Onur Sayar
Çagan H. Şekercioğlu
Agnieszka Sergiel
George Sîrbu
Tomaž Skrbinšek
Michaela Skuban
Anil Soyumert
Aleksandar Stojanov
Egle Tammeleht
Konstantin Tirronen
Aleksandër Trajçe
Igor Trbojević
Tijana Trbojević
Filip Zięba
Diana Zlatanova
Tomasz Zwijacz‐Kozica
ABSTRACT Biotic interactions are expected to influence species' responses to global changes, but they are rarely considered across broad spa… (voir plus)tial extents. Abiotic factors are thought to operate at larger spatial scales, while biotic factors, such as species interactions, are considered more important at local scales within communities, in part because of the knowledge gap on species interactions at large spatial scales (i.e., the Eltonian shortfall). We assessed, at a continental scale, (i) the importance of biotic interactions, through food webs, on species distributions, and (ii) how biotic interactions under scenarios of climate and land‐use change may affect the distribution of the brown bear ( Ursus arctos ). We built a highly detailed, spatially dynamic, and empirically sampled food web based on the energy contribution of 276 brown bear food species from different taxa (plants, vertebrates, and invertebrates) and their ensemble habitat models at high resolution across Europe. Then, combining energy contribution and predicted habitat of food species, we modelled energy contribution across space and included these layers within Bayesian‐based models of the brown bear distribution in Europe. The inclusion of biotic interactions considerably improved our understanding of brown bear distribution at large (continental) scales compared with Bayesian models including only abiotic factors (climate and land use). Predicted future range shifts, which included changes in the distribution of food species, varied greatly when considering various scenarios of change in biotic factors, providing a warning that future indirect climate and land‐use change are likely to have strong but highly uncertain impacts on species biogeography. Our study confirmed that advancing our understanding of ecological networks of species interactions will improve future projections of biodiversity change, especially for modelling species distributions and their functional role under climate and land‐use change scenarios, which is key for effective conservation of biodiversity and ecosystem services.
Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems
Emma Harvey
Emily Sheng
Su Lin Blodgett
Alexandra Chouldechova
Jean Garcia-Gathright
Hanna Wallach
The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language mo… (voir plus)del (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems
Emma Harvey
Emily Sheng
Su Lin Blodgett
Alexandra Chouldechova
Jean Garcia-Gathright
Hanna Wallach
The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language mo… (voir plus)del (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
Are Large Language Models Good Temporal Graph Learners?
Shenyang Huang
Ali Parviz
Emma Kondrup
Zachary Yang
Zifeng Ding
Michael Bronstein
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. Wh… (voir plus)ile a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
Galaxy cluster characterization with machine learning techniques
M. Sadikov
J. Hlavacek-Larrondo
C. L. Rhea
M. McDonald
M. Ntampaka
J. ZuHone
We present an analysis of the X-ray properties of the galaxy cluster population in the z=0 snapshot of the IllustrisTNG simulations, utilizi… (voir plus)ng machine learning techniques to perform clustering and regression tasks. We examine five properties of the hot gas (the central cooling time, the central electron density, the central entropy excess, the concentration parameter, and the cuspiness) which are commonly used as classification metrics to identify cool core (CC), weak cool core (WCC) and non cool core (NCC) clusters of galaxies. Using mock Chandra X-ray images as inputs, we first explore an unsupervised clustering scheme to see how the resulting groups correlate with the CC/WCC/NCC classification based on the different criteria. We observe that the groups replicate almost exactly the separation of the galaxy cluster images when classifying them based on the concentration parameter. We then move on to a regression task, utilizing a ResNet model to predict the value of all five properties. The network is able to achieve a mean percentage error of 1.8% for the central cooling time, and a balanced accuracy of 0.83 on the concentration parameter, making them the best-performing metrics. Finally, we use simulation-based inference (SBI) to extract posterior distributions for the network predictions. Our neural network simultaneously predicts all five classification metrics using only mock Chandra X-ray images. This study demonstrates that machine learning is a viable approach for analyzing and classifying the large galaxy cluster datasets that will soon become available through current and upcoming X-ray surveys, such as eROSITA.
It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
Matthew Kowal
Jasper Timm
Thomas H Costello
Antonio A. Arechar
Gordon Pennycook
David Rand
Adam Gleave
Kellin Pelrine
Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smok… (voir plus)ing) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval