Alexis Roger

Feature Geometry of Language Models Transfer Across Modalities to Time Series

Zhenghan Tai

Vasilii Feofanov

Language models transfer to time-series forecasting, but it is unclear whether this reflects reusable internal structure or rapid relearning… (see more) under a familiar architecture. We study this transfer directly by comparing pretrained and randomly initialized versions of the same model on a forecasting objective whose inputs have little semantic overlap with text but still require autoregressive sequential structure. Across Qwen3-0.6B finetuning experiments, language initialization gives coherent per-example gradients from the first update, while random initialization first passes through a low-alignment warmup phase. Effective-rank and hidden-state analyses show that finetuning selectively reshapes an existing representation geometry rather than constructing the simpler temporal geometry found by models trained from scratch. Cross-domain sparse features and causal ablations then expose candidate transferred primitives, including a Layer~1 head--MLP circuit whose ablation selectively increases loss on periodic forecasting and repetitive language passages. These results support an account of cross-modal transfer in which autoregressive pretraining creates temporal feature geometry that can be selected and specialized outside language.

2026-06-10

ICML.cc/2026/Workshop/Mech_Interp (poster)

openreview.net

Forecasting Emerges from Auto-Regressive Pretraining: Latent Predictive Structure in Language Models

Zhenghan Tai

Vasilii Feofanov

Predicting how a sequence will continue is a basic problem for intelligent systems. We show that large language models contain usable foreca… (see more)sting structure before any explicit time-series supervision. A single linear readout from frozen Qwen3-0.6B hidden states maps ordinary text sequences to numerical trajectories that resemble real time series, and those trajectories can be used for straightforward forecasts. The distribution over output tokens also gives coherent, non-crossing probabilistic forecasts in a single forward pass. After time-series specialization, pretrained models show aligned gradients and improve immediately, whereas randomly initialized models spend early training in a destructive-interference regime. These findings suggest that auto-regressive pretraining already shapes representations around temporal continuation; and finetuning adapts that structure to numerical forecasting rather than creating it from scratch.

2026-06-10

ICML.cc/2026/Workshop/Forecast (oral)

openreview.net

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

Zhenghan Tai

Vasilii Feofanov

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (see more)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.

2026-05-18

arXiv (preprint)

doi.org

arxiv.org

CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models

Daniel Z Kaplan

Qirui Sun

Jonathan Siu Chi Lim

Quentin Gregory Anthony

Edwin Fennell

Irina Rish

The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (see more)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

2025-01-15

ArXiv (preprint)

doi.org

arxiv.org

The Effect of Data Corruption on Multimodal Long Form Responses

Daniel Z Kaplan

Alexis Roger

Mohamed Osman

Irina Rish

Despite significant progress, Vision-Language Models (VLMs) still struggle with hallucinations, especially in long-form responses. Existing … (see more)strategies have had limited successes in specific cases, and long-form generation remains problematic. In this work we attempt to establish the link between the data used to train the model and the hallucinations in the model's output. To this end, we examine hallucinations through data corruption. We develop a method to corrupt training data and then train models with this data to see the effect on performance. We will show that corrupting only a small portion of the long-form training data significantly impairs the performance of the model on long-form tasks, while leaving simpler tasks like visual question-answering and multiple choice relatively intact. All training code and models are released for reproducibility and future research.

2024-07-02

ICML.cc/2024/Workshop/FM-Wild (poster)

openreview.net

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Daniel Z Kaplan

Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they becoming increasingly pr… (see more)evalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.

2024-06-27

ICML.cc/2024/Workshop/NextGenAISafety (poster)

doi.org

openreview.net

Towards ethical multimodal systems

Alexis Roger

Esma Aimeur

Irina Rish

2023-04-25

ArXiv (preprint)

doi.org

arxiv.org