Publications

Overcoming the Modality Gap in Context-Aided Forecasting

Vincent Zhihao Zheng

Étienne Marcotte

Andrew Robert Williams

Lijun Sun

Valentina Zantedeschi

Context-aided forecasting (CAF) holds promise for integrating domain knowledge and forward-looking information, enabling AI systems to surpa… (see more)ss traditional statistical methods. However, recent empirical studies reveal a puzzling gap: multimodal models often fail to outperform their unimodal counterparts. We hypothesize that this underperformance stems from poor context quality in existing datasets, as verification is challenging. To address these limitations, we introduce a semi-synthetic data augmentation method that generates contexts both descriptive of temporal dynamics and verifiably complementary to numerical histories. This approach enables massive-scale dataset creation, resulting in CAF-7M, a corpus of 7 million context-augmented time series windows, including a rigorously verified test set. We demonstrate that semi-synthetic pre-training transfers effectively to real-world evaluation, and show clear evidence of context utilization. Our results suggest that dataset quality, rather than architectural limitations, has been the primary bottleneck in context-aided forecasting.

2026-03-11

arXiv (preprint)

doi.org

openreview.net

Theta Dual-Brain Stimulation of rTPJ Shapes Joint Agency

Yuto Kurihara

Ayaka Tsuchiya

Guillaume Dumas

Rieko Osu

Summary Joint agency, the shared feeling of “we are doing this together”, has been linked to inter-brain synchrony, but its causal role … (see more)in shaping this experience remains unclear. We applied dual transcranial alternating current stimulation (dual-tACS) over the right temporo-parietal junction (rTPJ) to 13 dyads performing an alternating tapping task (target ITI = 0.5 s; 180 deg. relative phase), manipulating in- and anti-phase coupling at theta (6 Hz), alpha (10 Hz), and beta (20 Hz). As a result, tapping in the theta anti-phase condition was significantly slower than the memorized reference tempo, whereas the other stimulation conditions did not influence the inter-tap interval. Meanwhile, the relative phase remained close to 180 deg. across all conditions. In the theta condition, anti-phase stimulation produced significantly lower joint agency than in-phase stimulation. Furthermore, mediation analysis suggested that the inter-tap interval may partially account for the effect of theta dual-brain stimulation on joint agency, although this indirect pathway did not reach statistical significance. These findings suggest that anti-phase theta stimulation over the rTPJ lowers joint agency, possibly by reducing coordination efficiency while preserving the overall 180 deg. alternation structure.

2026-03-11

bioRxiv (preprint)

doi.org

Tiny Aya: Bridging Scale and Multilingual Depth

Alejandro Salamanca

Diana Abagyan

Daniel D'souza

Ammar Khairi

David Mora

Saurabh Dash

Viraat Aryabumi

Sara Rajaee

Mehrnaz Mofakhami

Ananya Sahu

Thomas Euyang

Brittawnya Prince

Madeline Smith

Hangyu Lin

Acyr Locatelli

Sara Hooker

Tom Kocmi

Aidan Gomez

Ivan Zhang

Phil Blunsom … (see 6 more)

Nick Frosst

Joelle Pineau

Beyza Ermis

Ahmet Üstün

Julia Kreutzer

Marzieh Fadaee

Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraini… (see more)ng, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.

2026-03-11

arXiv (preprint)

doi.org

arxiv.org

TRACE: Temporal Rule-Anchored Chain-of-Evidence on Knowledge Graphs for Interpretable Stock Movement Prediction

Qianggang Ding

Haochen Shi

Luis Castejón Lozano

Miguel Conner

Juan Abia

Luis Gallego-Ledesma

Joshua Fellowes

Gerard Conangla Planes

Adam Elwood

Bang Liu

We present a Temporal Rule-Anchored Chain-of-Evidence (TRACE) on knowledge graphs for interpretable stock movement prediction that unifies s… (see more)ymbolic relational priors, dynamic graph exploration, and LLM-guided decision making in a single end-to-end pipeline. The approach performs rule-guided multi-hop exploration restricted to admissible relation sequences, grounds candidate reasoning chains in contemporaneous news, and aggregates fully grounded evidence into auditable \texttt{UP}/\texttt{DOWN} verdicts with human-readable paths connecting text and structure. On an S\&P~500 benchmark, the method achieves 55.1\% accuracy, 55.7\% precision, 71.5\% recall, and 60.8\% F1, surpassing strong baselines and improving recall and F1 over the best graph baseline under identical evaluation. The gains stem from (i) rule-guided exploration that focuses search on economically meaningful motifs rather than arbitrary walks, and (ii) text-grounded consolidation that selectively aggregates high-confidence, fully grounded hypotheses instead of uniformly pooling weak signals. Together, these choices yield higher sensitivity without sacrificing selectivity, delivering predictive lift with faithful, auditably interpretable explanations.

2026-03-11

arXiv (preprint)

doi.org

arxiv.org

JEDI: Jointly Embedded Inference of Neural Dynamics

Matthew G. Perich

Animal brains flexibly and efficiently achieve many behavioral tasks with a single neural network. A core goal in modern neuroscience is to … (see more)map the mechanisms of the brain's flexibility onto the dynamics underlying neural populations. However, identifying task-specific dynamical rules from limited, noisy, and high-dimensional experimental neural recordings remains a major challenge, as experimental data often provide only partial access to brain states and dynamical mechanisms. While recurrent neural networks (RNNs) directly constrained neural data have been effective in inferring underlying dynamical mechanisms, they are typically limited to single-task domains and struggle to generalize across behavioral conditions. Here, we introduce JEDI, a hierarchical model that captures neural dynamics across tasks and contexts by learning a shared embedding space over RNN weights. This model recapitulates individual samples of neural dynamics while scaling to arbitrarily large and complex datasets, uncovering shared structure across conditions in a single, unified model. Using simulated RNN datasets, we demonstrate that JEDI accurately learns robust, generalizable, condition-specific embeddings. By reverse-engineering the weights learned by JEDI, we show that it recovers ground truth fixed point structures and unveils key features of the underlying neural dynamics in the eigenspectra. Finally, we apply JEDI to motor cortex recordings during monkey reaching to extract mechanistic insight into the neural dynamics of motor control. Our work shows that joint learning of contextual embeddings and recurrent weights provides scalable and generalizable inference of brain dynamics from recordings alone.

2026-03-10

arXiv (preprint)

doi.org

arxiv.org

LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader

Fabian David Schmidt

LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to si… (see more)milar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model's potential response. Specifically, we add trainable special tokens to the LLM's vocabulary, append them to input, and optimize them to represent the LLM's response in a fixed-length sequence. Training is guided by the LLM's own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

2026-03-10

arXiv (preprint)

doi.org

arxiv.org

SPT-CL J0417–4748: A Deep Chandra Study of a Relaxed Galaxy Cluster without Central Star Formation

Taweewat Somboonpanyakul

A. Mantz

S. W. Allen

Anthony M. Flores

R. Glenn Morris

Haley R. Stueber

L. E. Bleem

B. Floyd

Julie Hlavacek-Larrondo

Keunho Kim

Abstract We present an in-depth Chandra X-ray analysis of the galaxy cluster SPT-CL J0417−4748 (hereafter SPT J0417) at z = 0.58 with a fo… (see more)cus on its thermodynamic properties and the apparent absence of central star formation. Utilizing a total Chandra exposure of 103 ks, we find that the large-scale X-ray morphology is consistent with a dynamically relaxed cool-core system. The intracluster medium shows a central density of 0.08 ± 0.01 cm −3 , a central pseudoentropy of 2 6 − 5 + 6 keV cm 2 , and a central cooling time of 51 5 − 75 + 96 Myr, values typical of massive cool-core clusters. Despite these conditions, no evidence of recent or ongoing star formation is detected in the brightest cluster galaxy (BCG). Spectral energy di

2026-03-10

Astrophysical Journal (published)

doi.org

arxiv.org

Learning a Spatial Partitioning and its Causal Relations from Temporal Data

Yaniv Gurwicz

Peer Nowack

Jakob Runge

David Rolnick

Scientific research often seeks to understand the causal structure underlying high-level variables in a system. For example, climate scienti… (see more)sts study how phenomena, such as El Niño, affect other climate processes at remote locations across the globe. However, scientists typically collect low-level measurements, such as geographically distributed temperature readings. From these, one needs to learn both a mapping to causally-relevant latent variables, such as a high-level representation of the El Niño phenomenon and other processes, as well as the causal model over them. The challenge is that this task, called causal representation learning, is highly underdetermined from observational data alone, requiring other constraints during learning to resolve the indeterminacies. In this work, we consider the task of partitioning observed variables into disentangled factors, such as extracting regions from geographically gridded measurement data in climate research or capturing brain regions from neural activity data. We demonstrate the identifiability of the resulting model and propose a differentiable method, Causal Discovery with Single-parent Decoding (CDSD), that simultaneously learns, from temporal data, the underlying latents and a causal graph over them. We assess the validity of our theoretical results using simulated data and showcase the practical validity of our method in an application to real-world data from the climate science field.

2026-03-09

Conference on Causal Learning and Reasoning (poster)

openreview.net

LL-SDR: Low-Latency Speech enhancement through Discrete Representations

Jingyi Li

Luca Della Libera

Mirco Ravanelli

Cem Subakan

Many speech enhancement (SE) methods rely on continuous representations. Recently, discrete audio tokens have been explored to enable autore… (see more)gressive generation for SE. However, it remains unclear whether discretization itself consistently improves SE performance. In this paper, we introduce LL-SDR, a token-based speech enhancement framework that explicitly leverages discretization to better separate speech and noise. Our first contribution is a Variance-Ordered Residual Vector Quantizer (VO-RVQ), designed to disentangle speech and noise distributions during tokenization. Second, we propose a latent-space discriminator to better align enhanced embeddings with semantic embeddings. Experiments show that LL-SDR outperforms continuous baselines and matches the performance of autoregressive token-based approaches, while enabling lightweight, low-latency speech enhancement in both reverberant and non-reverberant noisy environments. Demos and source code are available at our project websites.

2026-03-09

arXiv (preprint)

doi.org

arxiv.org

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

Jingtao Wang

Yucong Wang

Jun Ding

Rui Cai

Xun Wang

Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing int… (see more)erest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.

2026-03-09

arXiv (preprint)

doi.org

arxiv.org

Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Joel Lidin

Amir Sarfi

Erfan Miahi

Quentin Anthony

Shivam Chauhan

Evangelos Pappas

Benjamin Therien

Eugene Belilovsky

Samuel Dare

Recently, there has been increased interest in globally distributed training, which has the promise to both reduce training costs and democr… (see more)atize participation in building large-scale foundation models. However, existing models trained in a globally distributed manner are relatively small in scale and have only been trained with whitelisted participants. Therefore, they do not yet realize the full promise of democratized participation. In this report, we describe Covenant-72B, an LLM produced by the largest collaborative globally distributed pre-training run (in terms of both compute and model scale), which simultaneously allowed open, permissionless participation supported by a live blockchain protocol. We utilized a state-of-the-art communication-efficient optimizer, SparseLoCo, supporting dynamic participation with peers joining and leaving freely. Our model, pre-trained on approximately 1.1T tokens, performs competitively with fully centralized models pre-trained on similar or higher compute budgets, demonstrating that fully democratized, non-whitelisted participation is not only feasible, but can be achieved at unprecedented scale for a globally distributed pre-training run.

2026-03-08

arXiv (preprint)

doi.org

arxiv.org

GPT-based self-supervised anomaly detection in command lines

Miles Q. Li

Julien Keutchayan

François Charest

Benjamin C. M. Fung

2026-03-08

Journal of Computer Virology and Hacking Techniques (published)

doi.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications