Prateek Humane

Feature Geometry of Language Models Transfer Across Modalities to Time Series

Zhenghan Tai

Vasilii Feofanov

Language models transfer to time-series forecasting, but it is unclear whether this reflects reusable internal structure or rapid relearning… (see more) under a familiar architecture. We study this transfer directly by comparing pretrained and randomly initialized versions of the same model on a forecasting objective whose inputs have little semantic overlap with text but still require autoregressive sequential structure. Across Qwen3-0.6B finetuning experiments, language initialization gives coherent per-example gradients from the first update, while random initialization first passes through a low-alignment warmup phase. Effective-rank and hidden-state analyses show that finetuning selectively reshapes an existing representation geometry rather than constructing the simpler temporal geometry found by models trained from scratch. Cross-domain sparse features and causal ablations then expose candidate transferred primitives, including a Layer~1 head--MLP circuit whose ablation selectively increases loss on periodic forecasting and repetitive language passages. These results support an account of cross-modal transfer in which autoregressive pretraining creates temporal feature geometry that can be selected and specialized outside language.

2026-06-10

ICML.cc/2026/Workshop/Mech_Interp (poster)

openreview.net

Forecasting Emerges from Auto-Regressive Pretraining: Latent Predictive Structure in Language Models

Zhenghan Tai

Vasilii Feofanov

Predicting how a sequence will continue is a basic problem for intelligent systems. We show that large language models contain usable foreca… (see more)sting structure before any explicit time-series supervision. A single linear readout from frozen Qwen3-0.6B hidden states maps ordinary text sequences to numerical trajectories that resemble real time series, and those trajectories can be used for straightforward forecasts. The distribution over output tokens also gives coherent, non-crossing probabilistic forecasts in a single forward pass. After time-series specialization, pretrained models show aligned gradients and improve immediately, whereas randomly initialized models spend early training in a destructive-interference regime. These findings suggest that auto-regressive pretraining already shapes representations around temporal continuation; and finetuning adapts that structure to numerical forecasting rather than creating it from scratch.

2026-06-10

ICML.cc/2026/Workshop/Forecast (oral)

openreview.net

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

Zhenghan Tai

Vasilii Feofanov

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (see more)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

2026-05-18

arXiv (preprint)

doi.org

arxiv.org

CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models

Daniel Z Kaplan

Qirui Sun

Jonathan Siu Chi Lim

Quentin Gregory Anthony

Edwin Fennell

Irina Rish

The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (see more)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

2025-01-15

ArXiv (preprint)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Prateek Humane

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Prateek Humane

Publications