Portrait de Irina Rish

Irina Rish

Membre académique principal
Chaire en IA Canada-CIFAR
Professeure titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage en ligne
Apprentissage multimodal
Apprentissage par renforcement
Apprentissage profond
Modèles génératifs
Neurosciences computationnelles
Traitement du langage naturel

Biographie

Irina Rish est professeure titulaire à l'Université de Montréal (UdeM), où elle dirige le Laboratoire d'IA autonome. Membre du corps professoral de Mila – Institut québécois d’intelligence artificielle, elle est titulaire d'une chaire d'excellence en recherche du Canada (CERC) et d'une chaire en IA Canada-CIFAR. Irina dirige le projet INCITE du ministère américain de l'Environnement au sujet des modèles de fondation évolutifs sur les superordinateurs Summit et Frontier à l'Oak Ridge Leadership Computing Facility (OLCF). Elle est cofondatrice et directrice scientifique de Nolano.ai.

Ses recherches actuelles portent sur les lois de mise à l'échelle neuronale et les comportements émergents (capacités et alignement) dans les modèles de fondation, ainsi que sur l'apprentissage continu, la généralisation hors distribution et la robustesse. Avant de se joindre à l'UdeM en 2019, Irina était chercheuse au Centre de recherche IBM Thomas J. Watson, où elle a travaillé sur divers projets à l'intersection des neurosciences et de l'IA, et dirigé le défi NeuroAI. Elle a reçu plusieurs prix IBM : ceux de l’excellence et de l’innovation exceptionnelle (2018), celui de la réalisation technique exceptionnelle (2017), et celui de l’accomplissement en recherche (2009). Elle détient 64 brevets et a écrit plus de 120 articles de recherche, plusieurs chapitres de livres, trois livres publiés et une monographie sur la modélisation éparse.

Étudiants actuels

Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - UdeM
Maîtrise recherche - Concordia
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche
Co-superviseur⋅e :
Visiteur de recherche indépendant - -
Collaborateur·rice de recherche - UdeM
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Collaborateur·rice alumni - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Collaborateur·rice de recherche - UdeM
Collaborateur·rice alumni - UdeM
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Maîtrise recherche - UdeM
Collaborateur·rice alumni - UdeM
Collaborateur·rice de recherche
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - Polytechnique
Doctorat - McGill
Superviseur⋅e principal⋅e :
Maîtrise recherche - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Doctorat - McGill
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche
Doctorat - Concordia
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Baccalauréat - McGill
Doctorat - UdeM
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Doctorat - McGill

Publications

Feature Geometry of Language Models Transfer Across Modalities to Time Series
Language models transfer to time-series forecasting, but it is unclear whether this reflects reusable internal structure or rapid relearning… (voir plus) under a familiar architecture. We study this transfer directly by comparing pretrained and randomly initialized versions of the same model on a forecasting objective whose inputs have little semantic overlap with text but still require autoregressive sequential structure. Across Qwen3-0.6B finetuning experiments, language initialization gives coherent per-example gradients from the first update, while random initialization first passes through a low-alignment warmup phase. Effective-rank and hidden-state analyses show that finetuning selectively reshapes an existing representation geometry rather than constructing the simpler temporal geometry found by models trained from scratch. Cross-domain sparse features and causal ablations then expose candidate transferred primitives, including a Layer~1 head--MLP circuit whose ablation selectively increases loss on periodic forecasting and repetitive language passages. These results support an account of cross-modal transfer in which autoregressive pretraining creates temporal feature geometry that can be selected and specialized outside language.
Forecasting Emerges from Auto-Regressive Pretraining: Latent Predictive Structure in Language Models
Predicting how a sequence will continue is a basic problem for intelligent systems. We show that large language models contain usable foreca… (voir plus)sting structure before any explicit time-series supervision. A single linear readout from frozen Qwen3-0.6B hidden states maps ordinary text sequences to numerical trajectories that resemble real time series, and those trajectories can be used for straightforward forecasts. The distribution over output tokens also gives coherent, non-crossing probabilistic forecasts in a single forward pass. After time-series specialization, pretrained models show aligned gradients and improve immediately, whereas randomly initialized models spend early training in a destructive-interference regime. These findings suggest that auto-regressive pretraining already shapes representations around temporal continuation; and finetuning adapts that structure to numerical forecasting rather than creating it from scratch.
Representing Time Series as Structured Programs for LLM Reasoning
Jaeho Kim
Changhun Oh
Seokhyun Lee
Changhee Lee
Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful too… (voir plus)ls for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.
Rank Collapse, Fixed Points, and the Renormalization Group Structure of MLP Residual Networks
Parviz Haggi-Mani
The analogy between deep neural network forward passes and renormalization group (RG) flows has been repeatedly noted in the literature, but… (voir plus) existing treatments remain qualitative: depth is described as a coarse-graining scale, attention is likened to a partition function, and representations are said to flow toward fixed points. No existing work has defined a measurable RG order parameter, tested it under controlled variation of the input distribution, or made quantitative predictions that are empirically verified. We study the simplest architecture for which the analogy is tractable: a pure MLP residual stack trained on masked token prediction over synthetic Markov chain sequences with known spectral properties. We report three findings. (i) The effective rank of the residual stream decreases monotonically with depth after training, consistent with progressive integration of irrelevant degrees of freedom. (ii) This rank collapse is selective: it occurs for chains with short correlation length approximately 1 but is absent for chains with long correlation length approximately 7, measured at the position level to control for mean-pooling artifacts. The network preserves exactly the degrees of freedom relevant to the prediction task, the content of the RG relevance criterion. (iii) Inter-layer kernel drift is concentrated at one or two specific transitions, with the remainder of the network near a fixed point, consistent with a discrete fixed-point plateau. Together these findings constitute the first quantitative, position-level evidence that MLP residual networks implement a selective coarse-graining procedure governed by the spectral structure of the input distribution.
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional a… (voir plus)ttempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods (
Unified Neural Scaling Laws
Priyank Jaini
David Krueger
We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling b… (voir plus)ehaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.
LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series
Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer ari… (voir plus)ses because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.
World models, artificial general intelligence and the hard problems of life–mind continuity: toward a unified understanding of natural and artificial intelligence
Adam Safron
Michael Levin
Victoria Klimaj
Dalton Sakthivadivel
Adeel Razi
David Ha
Nick Hay
Kevin Schmidt
David Krakauer
Melanie Mitchell
Samuel J. Gershman
Joshua B. Tenenbaum
Abstract This special issue examines how natural and artificial intelligences (AIs) model the world, and what this modelling reveals about c… (voir plus)ognition and relationships between life and mind. Rather than adopting a single definition, the collection considers how world models function and emerge in biological and artificial systems, exploring a diverse range of world modelling including causal, self-referential, individual goal-directed, collective and narrative forms. A recurring theme is the extent to which current AI systems trained on vast quantities of data learn the context-sensitive, temporally embedded, value-laden dimensions of world modelling that characterize diverse biological intelligences, or whether their impressive capabilities arise primarily from statistical surface regularities. The contributions also raise broader issues concerning embodiment, complexity, learning architectures and the social and scientific contexts in which world models operate. With this collection, we hope to clarify the conceptual landscape, identify key points of similarity and divergence between natural and artificial minds, and outline questions that may guide future research on the forms of world modelling that support grounded understanding, robust agency and potentially human-like general intelligence. This article is part of the theme issue ‘World models in natural and artificial intelligence’.
Emergent Reasoning via Recursive Latent Reinforcement Pretraining
Large language models (LLMs) often rely on explicit chain-of-thought (CoT) traces to solve multi-step reasoning problems, but these traces i… (voir plus)ncrease inference cost, expose brittle prompt dependence, and complicate training objectives. We study an alternative: \emph{latent deliberation} implemented as a small recurrent refinement module that performs multiple internal ``thinking`` steps while keeping the external sequence length fixed. We introduce \textbf{Recursive Latent Reinforcement Pretraining (RLRP)}, a training recipe that augments a base causal LLM with a shared latent head executed for
On the Adversarial Robustness of Discrete Image Tokenizers
Nicolas Flammarion
Francesco Croce
Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal syst… (voir plus)ems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.
Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent
Shagun Gupta
Youssef Briki
Parameswaran Raman
Hao-Jun Michael Shi
To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, re… (voir plus)lying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum (
$\mu$LO: Compute-Efficient Meta-Generalization of Learned Optimizers
Learned optimizers (LOs) have the potential to significantly reduce the wall-clock training time of neural networks. However, they can strug… (voir plus)gle to optimize unseen tasks (*meta-generalize*), especially when training networks wider than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (