Irina Rish

Biography

Irina Rish is a full professor at the Université de Montréal (UdeM), where she leads the Autonomous AI Lab, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

In addition to holding a Canada Excellence Research Chair (CERC) and a CIFAR Chair, she leads the U.S. Department of Energy’s INCITE project on Scalable Foundation Models on Summit & Frontier supercomputers at the Oak Ridge Leadership Computing Facility. She co-founded and serves as CSO of Nolano.ai.

Rish’s current research interests include neural scaling laws and emergent behaviors (capabilities and alignment) in foundation models, as well as continual learning, out-of-distribution generalization and robustness.

Before joining UdeM in 2019, she was a research scientist at the IBM T.J. Watson Research Center, where she worked on various projects at the intersection of neuroscience and AI, and led the Neuro-AI challenge. She was awarded the IBM Eminence & Excellence Award and IBM Outstanding Innovation Award (2018), IBM Outstanding Technical Achievement Award (2017) and IBM Research Accomplishment Award (2009).

She holds 64 patents and has published 120 research papers, several book chapters, three edited books and a monograph on sparse modeling.

Current Students

George Adamopoulos

Research Intern

Ivan Anokhin

PhD - Université de Montréal

Co-supervisor :

Samira Ebrahimi Kahou

Rifat Arefin

PhD - Université de Montréal

Arjun Ashok

PhD - Université de Montréal

Co-supervisor :

Master's Research - Université de Montréal

PhD - McGill University

Principal supervisor :

Blake Richards

Mohammad Javad Darvishi Bayazi

Amin Darabi

PhD - Université de Montréal

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Karim Jerbi

Wagner Drew

Master's Research - Concordia University

Principal supervisor :

Mirco Ravanelli

Mojtaba Faramarzi

PhD - Université de Montréal

Nadhir Hassen

Collaborating Alumni - Université de Montréal

Master's Research

Collaborating Alumni - Université de Montréal

Principal supervisor :

Ioannis Mitliagkas

Nizar Islah

PhD - Université de Montréal

Principal supervisor :

Eilif Benjamin Muller

PhD - Université de Montréal

Collaborating researcher

Zafir Khalid

Master's Research - Concordia University

Principal supervisor :

Master's Research - Université de Montréal

Neeraj Kumar

Collaborating Alumni - Université de Montréal

Gwen Legate

PhD - Concordia University

Principal supervisor :

Eugene Belilovsky

David Lemay

Master's Research - Université de Montréal

Jonathan Lim

Collaborating researcher

amin.mansouri@mila.quebec

Baihan Lin

Independent visiting researcher - Mt. Sinai

Master's Research - Université de Montréal

Collaborating researcher

Andrei Mircea

PhD - Université de Montréal

Master's Research - Université de Montréal

Diganta Misra

Master's Research - Université de Montréal

Timothy Nest

PhD - Université de Montréal

Co-supervisor :

Eilif Benjamin Muller

Mohammad Pezeshki

Collaborating researcher

Co-supervisor :

PhD - McGill University

Principal supervisor :

Pouya Bashivan

Mahta Ramezanian

Master's Research - Université de Montréal

Co-supervisor :

Guillaume Dumas

Roland Riachi

Collaborating researcher - Université de Montréal

Matthew Riemer

PhD - Université de Montréal

Alexis Roger

PhD - McGill University

Principal supervisor :

Blake Richards

Vaibhav Singh

PhD - Concordia University

Principal supervisor :

Eugene Belilovsky

Gopeshh Subbaraj

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Master's Research - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Master's Research - Université de Montréal

Publications

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Kashif Rasul

Arjun Ashok

Andrew Robert Williams

Arian Khorasani

George Adamopoulos

Rishika Bhagwatkar

Marin Biloš

Hena Ghonia

N. Hassen

Anderson Schneider

Sahil Garg

Alexandre Drouin

Nicolas Chapados

Yuriy Nevmyvaka

Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (see more)Llama , a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen “out-of-distribution” time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws [7] to fit and predict model scaling behavior. The open source code is made available at https://github

2023-01-01

arXiv.org (preprint)

Towards Continual Reinforcement Learning: A Review and Perspectives

Khimya Khetarpal

Matthew D Riemer

Doina Precup

2022-12-22

Journal of Artificial Intelligence Research (published)

Continual Learning with Foundation Models: An Empirical Study of Latent Replay

Oleksiy Ostapenko

Timothee LESORT

Pau Rodriguez

Md Rifat Arefin

Arthur Douillard

Laurent Charlin

Rapid development of large-scale pre-training has resulted in foundation models that can act as effective feature extractors on a variety of… (see more) downstream tasks and domains. Motivated by this, we study the efficacy of pre-trained vision models as a foundation for downstream continual learning (CL) scenarios. Our goal is twofold. First, we want to understand the compute-accuracy trade-off between CL in the raw-data space and in the latent space of pre-trained encoders. Second, we investigate how the characteristics of the encoder, the pre-training algorithm and data, as well as of the resulting latent space affect CL performance. For this, we compare the efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space. Notably, this study shows how transfer, forgetting, task similarity and learning are dependent on the input data characteristics and not necessarily on the CL algorithms. First, we show that under some circumstances reasonable CL performance can readily be achieved with a non-parametric classifier at negligible compute. We then show how models pre-trained on broader data result in better performance for various replay sizes. We explain this with representational similarity and transfer properties of these representations. Finally, we show the effectiveness of self-supervised pre-training for downstream domains that are out-of-distribution as compared to the pre-training domain. We point out and validate several research directions that can further increase the efficacy of latent CL including representation ensembling. The diverse set of datasets used in this study can serve as a compute-efficient playground for further CL research. We will publish the code.

2022-11-28

Proceedings of The 1st Conference on Lifelong Learning Agents (published)

proceedings.mlr.press

APP: Anytime Progressive Pruning

Diganta Misra

Bharat Runwal

Tianlong Chen

Zhangyang Wang

With the latest advances in deep learning, several methods have been investigated for optimal learning settings in scenarios where the data … (see more)stream is continuous over time. However, training sparse networks in such settings has often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of

2022-11-17

ACML.org/2022/Workshop/CLL (published)

openreview.net

Knowledge Distillation for Federated Learning: a Practical Guide

Alessio Mora

Irene Tenison

Paolo Bellavista

Federated Learning (FL) enables the training of Deep Learning models without centrally collecting possibly sensitive raw data. This paves th… (see more)e way for stronger privacy guarantees when building predictive models. The most used algorithms for FL are parameter-averaging based schemes (e.g., Federated Averaging) that, however, have well known limits: (i) Clients must implement the same model architecture; (ii) Transmitting model weights and model updates implies high communication cost, which scales up with the number of model parameters; (iii) In presence of non-IID data distributions, parameter-averaging aggregation schemes perform poorly due to client model drifts. Federated adaptations of regular Knowledge Distillation (KD) can solve and/or mitigate the weaknesses of parameter-averaging FL algorithms while possibly introducing other trade-offs. In this article, we provide a review of KD-based algorithms tailored for specific FL issues.

2022-11-09

ArXiv (preprint)

Aligning MAGMA by Few-Shot Learning and Finetuning

Jean-Charles Layoun

Alexis Roger

2022-10-18

ArXiv (preprint)

Generative Models of Brain Dynamics

Mahta Ramezanian-Panahi

Germán Abrevaya

Jean-Christophe Gagnon-Audet

Vikram Voleti

Guillaume Dumas

2022-07-15

Frontiers in Artificial Intelligence (published)

Challenging Common Assumptions about Catastrophic Forgetting

Timothee LESORT

Oleksiy Ostapenko

Pau Rodriguez

Md Rifat Arefin

Diganta Misra

Laurent Charlin

Building learning agents that can progressively learn and accumulate knowledge is the core goal of the continual learning (CL) research fiel… (see more)d. Unfortunately, training a model on new data usually compromises the performance on past data. In the CL literature, this effect is referred to as catastrophic forgetting (CF). CF has been largely studied, and a plethora of methods have been proposed to address it on short sequences of non-overlapping tasks. In such setups, CF always leads to a quick and significant drop in performance in past tasks. Nevertheless, despite CF, recent work showed that SGD training on linear models accumulates knowledge in a CL regression setup. This phenomenon becomes especially visible when tasks reoccur. We might then wonder if DNNs trained with SGD or any standard gradient-based optimization accumulate knowledge in such a way. Such phenomena would have interesting consequences for applying DNNs to real continual scenarios. Indeed, standard gradient-based optimization methods are significantly less computationally expensive than existing CL algorithms. In this paper, we study the progressive knowledge accumulation (KA) in DNNs trained with gradient-based algorithms in long sequences of tasks with data re-occurrence. We propose a new framework, SCoLe (Scaling Continual Learning), to investigate KA and discover that catastrophic forgetting has a limited effect on DNNs trained with SGD. When trained on long sequences with data sparsely re-occurring, the overall accuracy improves, which might be counter-intuitive given the CF phenomenon. We empirically investigate KA in DNNs under various data occurrence frequencies and propose simple and scalable strategies to increase knowledge accumulation in DNNs.

2022-07-10

ArXiv (preprint)

openreview.net

Towards Scaling Difference Target Propagation by Learning Backprop Targets

Maxence Ernoult

Fabrice Normandin

Abhinav Moudgil

Sean Spinney

The development of biologically-plausible learning algorithms is important for understanding learning in the brain, but most of them fail to… (see more) scale-up to real-world tasks, limiting their potential as explanations for learning by real brains. As such, it is important to explore learning algorithms that come with strong theoretical guarantees and can match the performance of backpropagation (BP) on complex tasks. One such algorithm is Difference Target Propagation (DTP), a biologically-plausible learning algorithm whose close relation with Gauss-Newton (GN) optimization has been recently established. However, the conditions under which this connection rigorously holds preclude layer-wise training of the feedback pathway synaptic weights (which is more biologically plausible). Moreover, good alignment between DTP weight updates and loss gradients is only loosely guaranteed and under very specific conditions for the architecture being trained. In this paper, we propose a novel feedback weight training scheme that ensures both that DTP approximates BP and that layer-wise feedback weight training can be restored without sacrificing any theoretical guarantees. Our theory is corroborated by experimental results and we report the best performance ever achieved by DTP on CIFAR-10 and ImageNet 32

2022-06-28

Proceedings of the 39th International Conference on Machine Learning (published)

proceedings.mlr.press

Parametric Scattering Networks

Shanel Gauthier

Benjamin Thérien

Laurent Alséne-Racicot

Muawiz Chaudhary

Eugene Belilovsky

Michael Eickenberg

Guy Wolf

The wavelet scattering transform creates geometric in-variants and deformation stability. In multiple signal do-mains, it has been shown to … (see more)yield more discriminative rep-resentations compared to other non-learned representations and to outperform learned representations in certain tasks, particularly on limited labeled data and highly structured signals. The wavelet filters used in the scattering trans-form are typically selected to create a tight frame via a pa-rameterized mother wavelet. In this work, we investigate whether this standard wavelet filterbank construction is op-timal. Focusing on Morlet wavelets, we propose to learn the scales, orientations, and aspect ratios of the filters to produce problem-specific parameterizations of the scattering transform. We show that our learned versions of the scattering transform yield significant performance gains in small-sample classification settings over the standard scat-tering transform. Moreover, our empirical results suggest that traditional filterbank constructions may not always be necessary for scattering transforms to extract effective rep-resentations.

2022-06-18

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (published)

A Remedy For Distributional Shifts Through Expected Domain Translation

Jean-Christophe Gagnon-Audet

Soroosh Shahtalebi

Frank Rudzicz

Machine learning models often fail to generalize to unseen domains due to the distributional shifts. A family of such shifts, “correlation… (see more) shifts,” is caused by spurious correlations in the data. It is studied under the overarching topic of “domain generalization.” In this work, we employ multi-modal translation networks to tackle the correlation shifts that appear when data is sampled out-of-distribution. Learning a generative model from training domains enables us to translate each training sample under the special characteristics of other possible domains. We show that by training a predictor solely on the generated samples, the spurious correlations in training domains average out, and the invariant features corresponding to true correlations emerge. Our proposed technique, Expected Domain Translation (EDT), is benchmarked on the Colored MNIST dataset and drastically improves the state-of-the-art classification accuracy by 38% with train-domain validation model selection.

2022-05-23

IEEE International Conference on Acoustics, Speech, and Signal Processing (published)

Summarizing Societies: Agent Abstraction in Multi-Agent Reinforcement Learning

Amin Memarian

Maximilian Puelma Touzel

Matthew D Riemer

Rupali Bhati