The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Language models transfer to time-series forecasting, but it is unclear whether this reflects reusable internal structure or rapid relearning… (see more) under a familiar architecture.
We study this transfer directly by comparing pretrained and randomly initialized versions of the same model on a forecasting objective whose inputs have little semantic overlap with text but still require autoregressive sequential structure.
Across Qwen3-0.6B finetuning experiments, language initialization gives coherent per-example gradients from the first update, while random initialization first passes through a low-alignment warmup phase.
Effective-rank and hidden-state analyses show that finetuning selectively reshapes an existing representation geometry rather than constructing the simpler temporal geometry found by models trained from scratch.
Cross-domain sparse features and causal ablations then expose candidate transferred primitives, including a Layer~1 head--MLP circuit whose ablation selectively increases loss on periodic forecasting and repetitive language passages.
These results support an account of cross-modal transfer in which autoregressive pretraining creates temporal feature geometry that can be selected and specialized outside language.
Predicting how a sequence will continue is a basic problem for intelligent systems. We show that large language models contain usable
foreca… (see more)sting structure before any explicit time-series supervision. A
single linear readout from frozen Qwen3-0.6B hidden states maps ordinary text
sequences to numerical trajectories that resemble real time series, and those
trajectories can be used for straightforward forecasts. The distribution over output tokens also gives coherent, non-crossing probabilistic forecasts in a single forward pass. After time-series
specialization, pretrained models show aligned gradients and improve
immediately, whereas randomly initialized models spend early training in a
destructive-interference regime. These findings suggest that auto-regressive
pretraining already shapes representations around temporal continuation; and
finetuning adapts that structure to numerical forecasting rather than
creating it from scratch.
Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (see more)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.
The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (see more)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.