Portrait of Irina Rish

Irina Rish

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department

Biography

Irina Rish is a full professor at the Université de Montréal (UdeM), where she leads the Autonomous AI Lab, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

In addition to holding a Canada Excellence Research Chair (CERC) and a CIFAR Chair, she leads the U.S. Department of Energy’s INCITE project on Scalable Foundation Models on Summit & Frontier supercomputers at the Oak Ridge Leadership Computing Facility. She co-founded and serves as CSO of Nolano.ai.

Rish’s current research interests include neural scaling laws and emergent behaviors (capabilities and alignment) in foundation models, as well as continual learning, out-of-distribution generalization and robustness.

Before joining UdeM in 2019, she was a research scientist at the IBM T.J. Watson Research Center, where she worked on various projects at the intersection of neuroscience and AI, and led the Neuro-AI challenge. She was awarded the IBM Eminence & Excellence Award and IBM Outstanding Innovation Award (2018), IBM Outstanding Technical Achievement Award (2017) and IBM Research Accomplishment Award (2009).

She holds 64 patents and has published 120 research papers, several book chapters, three edited books and a monograph on sparse modeling.

Current Students

PhD - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher
Collaborating researcher - Université de Montréal
Research Intern - Technical University of Munich
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - McGill University
Principal supervisor :
Independent visiting researcher - Université de Montréal
Co-supervisor :
PhD - Concordia University
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
Collaborating Alumni - Université de Montréal
Master's Research - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
Research Intern - Université de Montréal
Professional Master's - Université de Montréal
PhD - Université de Montréal
Principal supervisor :
Research Intern - Université de Montréal
Collaborating researcher - Politecnico di Milano
Postdoctorate - Université de Montréal
Principal supervisor :
Master's Research - Université de Montréal
Master's Research - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Concordia University
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :

Publications

Unsupervised Concept Discovery Mitigates Spurious Correlations
Md Rifat Arefin
Yan Zhang
Aristide Baratin
Francesco Locatello
Dianbo Liu
Kenji Kawaguchi
Deep Generative Sampling in the Dual Divergence Space: A Data-efficient&Interpretative Approach for Generative AI
Sahil Garg
Anderson Schneider
Anant Raj
Kashif Rasul
Yuriy Nevmyvaka
S. Gopal
Amit Dhurandhar
Guillermo A. Cecchi
Building on the remarkable achievements in generative sampling of natural images, we propose an innovative challenge, potentially overly amb… (see more)itious, which involves generating samples of entire multivariate time series that resemble images. However, the statistical challenge lies in the small sample size, sometimes consisting of a few hundred subjects. This issue is especially problematic for deep generative models that follow the conventional approach of generating samples from a canonical distribution and then decoding or denoising them to match the true data distribution. In contrast, our method is grounded in information theory and aims to implicitly characterize the distribution of images, particularly the (global and local) dependency structure between pixels. We achieve this by empirically estimating its KL-divergence in the dual form with respect to the respective marginal distribution. This enables us to perform generative sampling directly in the optimized 1-D dual divergence space. Specifically, in the dual space, training samples representing the data distribution are embedded in the form of various clusters between two end points. In theory, any sample embedded between those two end points is in-distribution w.r.t. the data distribution. Our key idea for generating novel samples of images is to interpolate between the clusters via a walk as per gradients of the dual function w.r.t. the data dimensions. In addition to the data efficiency gained from direct sampling, we propose an algorithm that offers a significant reduction in sample complexity for estimating the divergence of the data distribution with respect to the marginal distribution. We provide strong theoretical guarantees along with an extensive empirical evaluation using many real-world datasets from diverse domains, establishing the superiority of our approach w.r.t. state-of-the-art deep learning methods.
Simple and Scalable Strategies to Continually Pre-train Large Language Models
Adam Ibrahim
Benjamin Thérien
Kshitij Gupta
Mats Leon Richter
Quentin Anthony
Timothee LESORT
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes ava… (see more)ilable. A much more efficient solution is to continually pre-train these models, saving significant compute compared to re-training. However, the distribution shift induced by new data typically results in degraded performance on previous data or poor adaptation to the new data. In this work, we show that a simple and scalable combination of learning rate (LR) re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch on all available data, as measured by the final loss and the average score on several language model (LM) evaluation benchmarks. Specifically, we show this for a weak but realistic distribution shift between two commonly used LLM pre-training datasets (English
Predicting Grokking Long Before it Happens: A look into the loss landscape of models which grok
Tikeng Notsawo Pascal Junior
Pascal Notsawo
Hattie Zhou
Mohammad Pezeshki
Effective Latent Differential Equation Models via Attention and Multiple Shooting
Germán Abrevaya
Mahta Ramezanian-Panahi
Jean-Christophe Gagnon-Audet
Pablo Polosecki
Silvina Ponce Dawson
Guillermo Cecchi
Amplifying Pathological Detection in EEG Signaling Pathways through Cross-Dataset Transfer Learning
Mohammad-Javad Darvishi-Bayazi
Mohammad S. Ghaemi
Timothee LESORT
Md Rifat Arefin
Jocelyn Faubert
Towards Machines that Trust: AI Agents Learn to Trust in the Trust Game
Ardavan S. Nobandegani
Thomas Shultz
Widely considered a cornerstone of human morality, trust shapes many aspects of human social interactions. In this work, we present a theore… (see more)tical analysis of the
Challenging Common Assumptions about Catastrophic Forgetting and Knowledge Accumulation
Timothee LESORT
Oleksiy Ostapenko
Pau Rodriguez
Diganta Misra
Md Rifat Arefin
Lag-Llama: Towards Foundation Models for Time Series Forecasting
Kashif Rasul
Arjun Ashok
Andrew Robert Williams
Arian Khorasani
George Adamopoulos
Rishika Bhagwatkar
Marin Biloš
Hena Ghonia
Nadhir Hassen
Anderson Schneider
Sahil Garg
Yuriy Nevmyvaka
Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (see more)Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.
Gradient Masked Averaging for Federated Learning
Irene Tenison
Sai Aravind Sreeramadas
Vaikkunth Mugunthan
Edouard Oyallon
Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a u… (see more)nified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting
Kashif Rasul
Arjun Ashok
Andrew Robert Williams
Arian Khorasani
George Adamopoulos
Rishika Bhagwatkar
Marin Bilovs
Hena Ghonia
N. Hassen
Anderson Schneider
Sahil Garg
Yuriy Nevmyvaka
Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-sho… (see more)t and few-shot generalization. However, despite the success of foundation models in modalities such as natural language processing and computer vision, the development of foundation models for time series forecasting has lagged behind. We present Lag-Llama, a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates. Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities compared to a wide range of forecasting models on downstream datasets across domains. Moreover, when fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance, outperforming prior deep learning approaches, emerging as the best general-purpose model on average. Lag-Llama serves as a strong contender to the current state-of-art in time series forecasting and paves the way for future advancements in foundation models tailored to time series data.
Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting
Kashif Rasul
Arjun Ashok
Andrew Robert Williams
Arian Khorasani
George Adamopoulos
Rishika Bhagwatkar
Marin Bilovs
Hena Ghonia
Nadhir Hassen
Anderson Schneider
Sahil Garg
Yuriy Nevmyvaka