Portrait of Irina Rish

Irina Rish

Core Academic Member
Canada CIFAR AI Chair
Full Professor, Université de Montréal, Department of Computer Science and Operations Research Department
Research Topics
Computational Neuroscience
Deep Learning
Generative Models
Multimodal Learning
Natural Language Processing
Online Learning
Reinforcement Learning

Biography

Irina Rish is a full professor at the Université de Montréal (UdeM), where she leads the Autonomous AI Lab, and a core academic member of Mila – Quebec Artificial Intelligence Institute.

In addition to holding a Canada Excellence Research Chair (CERC) and a CIFAR Chair, she leads the U.S. Department of Energy’s INCITE project on Scalable Foundation Models on Summit & Frontier supercomputers at the Oak Ridge Leadership Computing Facility. She co-founded and serves as CSO of Nolano.ai.

Rish’s current research interests include neural scaling laws and emergent behaviors (capabilities and alignment) in foundation models, as well as continual learning, out-of-distribution generalization and robustness.

Before joining UdeM in 2019, she was a research scientist at the IBM T.J. Watson Research Center, where she worked on various projects at the intersection of neuroscience and AI, and led the Neuro-AI challenge. She was awarded the IBM Eminence & Excellence Award and IBM Outstanding Innovation Award (2018), IBM Outstanding Technical Achievement Award (2017) and IBM Research Accomplishment Award (2009).

She holds 64 patents and has published 120 research papers, several book chapters, three edited books and a monograph on sparse modeling.

Current Students

Independent visiting researcher - Université de Montréal
Co-supervisor :
Research Intern
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Master's Research - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
Collaborating researcher - Université de Montréal
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
Research Intern - Université de Montréal
Collaborating researcher
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Collaborating researcher
Master's Research - Université de Montréal
Professional Master's - Université de Montréal
PhD - Concordia University
Principal supervisor :
Master's Research - Université de Montréal
Postdoctorate
Principal supervisor :
Master's Research - Université de Montréal
Collaborating researcher
Independent visiting researcher
PhD - Université de Montréal
Master's Research - Université de Montréal
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Collaborating researcher
PhD - McGill University
Principal supervisor :
Master's Research - Université de Montréal
Co-supervisor :
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
PhD - McGill University
Principal supervisor :
Postdoctorate - Université de Montréal
Principal supervisor :
PhD - Concordia University
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Université de Montréal
Research Intern - Technical University of Munich
Research Intern - Université de Montréal

Publications

Simple and Scalable Strategies to Continually Pre-train Large Language Models
Adam Ibrahim
Benjamin Thérien
Kshitij Gupta
Mats Leon Richter
Quentin Gregory Anthony
Timothee LESORT
Unsupervised Concept Discovery Mitigates Spurious Correlations
Md Rifat Arefin
Yan Zhang
Aristide Baratin
Francesco Locatello
Dianbo Liu
Kenji Kawaguchi
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Ayush Kaushal
Tejas Vaidhya
Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces… (see more) the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress Large Language Models (LLMs) for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.
The Effect of Data Corruption on Multimodal Long Form Responses
Daniel Z Kaplan
Alexis Roger
Mohamed Osman
Despite significant progress, Vision-Language Models (VLMs) still struggle with hallucinations, especially in long-form responses. Existing … (see more)strategies have had limited successes in specific cases, and long-form generation remains problematic. In this work we attempt to establish the link between the data used to train the model and the hallucinations in the model's output. To this end, we examine hallucinations through data corruption. We develop a method to corrupt training data and then train models with this data to see the effect on performance. We will show that corrupting only a small portion of the long-form training data significantly impairs the performance of the model on long-form tasks, while leaving simpler tasks like visual question-answering and multiple choice relatively intact. All training code and models are released for reproducibility and future research.
TriLM vs FloatLM: Ternary LLMs are more Performant than Quantized FP16 LLMs
Ayush Kaushal
Tejas Vaidhya
Tejas Pandey
Aaryan Bhagat
Ternary LLMs offer significantly better performance for their size (measured in bits) than the models trained and deployed in FP16/BF16. Giv… (see more)en the widespread usage of quantization before deployment and advancements in Post Training Quantization of LLMs, a pivotal question arises: do ternary LLMs indeed provide any discernible benefits? To address this, we first build an open family of pre-trained ternary Large Language Models (TriLM). Additionally, we include their counterparts pre-trained in FP16 (FloatLM) and quantized versions of FloatLM (QuantLM) with parameters across almost two orders of magnitude - from 99M to 3.9B parameters. We demonstrate that TriLMs with 3B+ parameters start to offer competitive performance compared to FloatLMs with the same parameter count, while providing significantly better performance for their size. Specifically, TriLM 3.9B, with less bits than FloatLM 830M, ranks between FloatLM 2.4B and FloatLM 3.9B when averaged across 6 popular commonsense and reasoning benchmarks. TriLMs also outperform quantized models, with TriLM 3.9B surpassing the larger QuantLM-3bit 3.9B. Furthermore, across knowledge-based benchmarks, TriLM maintains a superiority for its size, but lags for its parameter count. TriLM 3.9B falls halfway between FloatLM 1.5B and 2.4B, close to QuantLM-4bit 2.4B. To advance research on Ternary LMs, we open source over 500+ checkpoints across the model families.
VFA: Vision Frequency Analysis of Foundation Models and Human
Mohammad Javad Darvishi Bayazi
Md Rifat Arefin
Jocelyn Faubert
Machine learning models often struggle with distribution shifts in real-world scenarios, whereas humans exhibit robust adaptation. Models th… (see more)at better align with human perception may achieve higher out-of-distribution generalization. In this study, we investigate how various characteristics of large-scale computer vision models influence their alignment with human capabilities and robustness. Our findings indicate that increasing model and data size, along with incorporating rich semantic information and multiple modalities, significantly enhances models' alignment with human perception and their overall robustness. Our empirical analysis demonstrates a strong correlation between out-of-distribution accuracy and human alignment.
Adversarial Training with Synthesized Data: A Path to Robust and Generalizable Neural Networks
Reza Bayat
Adversarial Training (AT) is a well-known framework designed to mitigate adversarial vulnerabilities in neural networks. Recent research ind… (see more)icates that incorporating adversarial examples (AEs) in training can enhance models' generalization capabilities. To understand the impact of AEs on learning dynamics, we study AT through the lens of sample difficulty methodologies. Our findings show that AT leads to more stable learning dynamics compared to Natural Training (NT), resulting in gradual performance improvements and less overconfident predictions. This suggests that AT steers training away from learning easy, perturbable spurious features toward more resilient and generalizable ones. However, a trade-off exists between adversarial robustness and generalization gains, due to robust overfitting, limiting practical deployment. To address this, we propose using synthesized data to bridge this gap. Our results demonstrate that AT benefits significantly from synthesized data, whereas NT does not, enhancing generalization without compromising robustness and offering new avenues for developing robust and generalizable models.
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
Rishika Bhagwatkar
Shravan Nayak
Reza Bayat
Alexis Roger
Daniel Z Kaplan
Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they becoming increasingly pr… (see more)evalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys
George Adamopoulos
Mehrab Hamidi
Stephanie Milani
Mohammad Reza Samsami
Artem Zholus
Sonia Joseph
Özgür Şimşek
Understanding the mechanisms behind decisions taken by large foundation models in sequential tasks is critical to ensuring that such systems… (see more) operate transparently and safely. However, interpretability methods have not yet been applied extensively to large-scale agents based on reinforcement learning. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We try to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task --- crafting a diamond pickaxe. The agent seems to pay attention to the 4 last frames and several key-frames further back. This provides clues as to how it maintains coherence in the task that takes 3-10 minutes, despite the agent's short memory span of only six seconds. Second, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk and punches it to death, when positioned stationary under green tree leaves. We demonstrate similar misbehavior in a related agent (STEVE-1), which motivates the use of VPT as a model organism for large-scale vision-based agent interpretability.
Handling Delay in Reinforcement Learning Caused by Parallel Computations of Neurons
Ivan Anokhin
Rishav
Stephen Chung
Biological neural networks operate in parallel, a feature that sets them apart from artificial neural networks and can significantly enhance… (see more) inference speed. However, this parallelism introduces challenges: when each neuron operates asynchronously with a fixed execution time, an
Realtime Reinforcement Learning: Towards Rapid Asynchronous Deployment of Large Models
Matthew D Riemer
Gopeshh Subbaraj
Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectivel… (see more)y minimize long-term regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment's effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pokemon and Tetris.
Scalable Approaches for a Theory of Many Minds
Maximilian Puelma Touzel
Amin Memarian
Matthew D Riemer
Andrei Mircea
Andrew Robert Williams
Elin Ahlstrand
Lucas Lehnert
Rupali Bhati
A major challenge as we move towards building agents for real-world problems, which could involve a massive number of human and/or machine a… (see more)gents, is that we must learn to reason about the behavior of these many other agents. In this paper, we consider the problem of scaling a predictive Theory of Mind (ToM) model to a very large number of interacting agents with a fixed computational budget. Motivated by the limited diversity of agent types, existing approaches to scalable TOM learn versatile single-agent representations for quickly adapting to new agents encountered sequentially. We consider the more general setting that many agents are observed in parallel and formulate the corresponding Theory of Many Minds (ToMM) problem of estimating the joint policy. We frame the scaling behavior of solutions in terms of parameter sharing schemes and in particular propose two parameter-free architectural features that endow models with the ability to exploit action correlations: encoding a multi-agent context, and decoding through an abstracted joint action space. The increased predictive capabilities that have come with foundation models have made it easier to imagine the possibility of using these models to make simulations that imitate the behavior of many agents within complex real-world systems. Being able to perform these simulations in a general-purpose way would not only help make more capable agents, it also would be a very useful capability for applications in social science, political science, and economics.