Guillaume Rabusseau

Biographie

Depuis septembre 2018, je suis professeur adjoint à Mila – Institut québécois d’intelligence artificielle et au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal (UdeM). Je suis titulaire d’une chaire de recherche en IA Canada-CIFAR depuis mars 2019. Avant de me joindre à l’UdeM, j’ai été chercheur postdoctoral au laboratoire de raisonnement et d'apprentissage de l'Université McGill, où j'ai travaillé avec Prakash Panangaden, Joelle Pineau et Doina Precup.

J'ai obtenu mon doctorat en 2016 à l’Université d’Aix-Marseille (AMU), où j'ai travaillé dans l'équipe Qarma (apprentissage automatique et multimédia), sous la supervision de François Denis et Hachem Kadri. Auparavant, j'ai obtenu une maîtrise en informatique fondamentale de l'AMU et une licence en informatique de la même université en formation à distance.

Je m'intéresse aux méthodes de tenseurs pour l'apprentissage automatique et à la conception d'algorithmes d'apprentissage pour les données structurées par l’utilisation de l'algèbre linéaire et multilinéaire (par exemple, les méthodes spectrales).

Étudiants actuels

Jun Dai

Postdoctorat - UdeM

Alireza Dizaji

Collaborateur·rice alumni - UdeM

Doctorat - UdeM

Doctorat - UdeM

Co-superviseur⋅e :

Collaborateur·rice alumni - McGill

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Sitao Luan

Postdoctorat - McGill

Co-superviseur⋅e :

Github

Larry Ma

Stagiaire de recherche - Polytechnique

Ewan Murphy

Collaborateur·rice de recherche - INRIA

Github

Charlotte Noxon

Maîtrise recherche - UdeM

Michael Rizvi-Martel

Doctorat - UdeM

Co-superviseur⋅e :

Pascal Tikeng Notsawo

Doctorat - UdeM

Co-superviseur⋅e :

Collaborateur·rice de recherche - UdeM

Co-superviseur⋅e :

Site web

Beheshteh Toloueirakhshan

Doctorat - UdeM

Site web

Publications

Tensor Cookbook: Mastering Tensors through Diagrams

Beheshteh T. Rakhshan

High-dimensional data arise naturally in many areas of science and engineering, including machine learning, signal processing, computational… (voir plus) physics, and statistics. Such data are often represented as tensors, multi-dimensional generalizations of matrices. While tensors provide a natural representation for multi-modal structure, their direct manipulation quickly becomes challenging as the order grows: the number of parameters increases exponentially, and algebraic expressions involving many indices become difficult to interpret and implement. Tensor networks (TNs) provide an effective framework for addressing these challenges. Originally introduced by Penrose and developed extensively in quantum physics, the graphical language of tensor networks encodes contractions as edges in a graph, reducing notational overhead and revealing structural properties obscured by index notation. Despite the central role of high-dimensional tensors in modern machine learning and numerical analysis, tensor network diagrams remain underutilized outside quantum computing, partly due to the lack of a self-contained mathematical reference accessible to a broad technical audience. This manuscript provides a self-contained guide to tensor networks and their use in tensor algebra. We present the main operations on tensors, contractions, products, and reshaping through, graphical notation, and show how classical tensor decompositions and related computations are naturally expressed in this framework. We also illustrate how tensor networks simplify the derivation of gradients and the manipulation of high-dimensional probability distributions. Throughout, we show that the diagrammatic approach yields genuinely shorter and more transparent proofs of classical identities, rank bounds, and gradient formulas that would otherwise require laborious index manipulation.

2026-05-14

arXiv (prépublication)

Orth-Dion: Eliminating Geometric Mismatch in Distributed Low-Rank Spectral Optimization

Tatsuhiro Nakamori

Laura Gomezjurado Gonzalez

Ganesh Talluri

Ansh Tiwari

Hideyuki Kawashima

Ioannis Mitliagkas

Hiroki Naganuma

Low-rank gradient compression reduces communication in distributed training by representing updates with rank-…

2026-05-06

arXiv (prépublication)

ControBench: An Interaction-Aware Benchmark for Controversial Discourse Analysis on Social Networks

Ta Thanh Thuy

Jiaqi Zhu

Xuan Liu

Lin Shang

Lihui Chen

Zheng Yilun

Sitao Luan

Understanding how people argue across ideological divides online is important for studying political polarization, misinformation, and conte… (voir plus)nt moderation. Existing datasets capture only part of this problem: some preserve text but ignore interaction structure, some model structure without rich semantics, and others represent conversations without stable user-level ideological identity. We introduce ControBench, a benchmark for controversial discourse analysis that combines heterogeneous social interaction graphs with rich textual semantics. Built from Reddit discussions on three topics, Trump, abortion, and religion, ControBench contains 7,370 users, 1,783 posts, and 26,525 interactions. The graph contains user and post nodes connected by semantically enriched edges; in particular, user-comment-user edges encode both a reply and the parent comment that it responds to, preserving local argumentative context. User labels are derived from self-declared Reddit flairs, providing a scalable proxy for ideological identity without manual annotation. The resulting datasets exhibit low or negative adjusted homophily (Trump: -0.77, Abortion: 0.06, Religion: 0.04), reflecting the cross-cutting structure of real-world debate. We evaluate graph neural networks, pretrained language models, and large language models on ControBench and observe distinct performance patterns across topics and model families, especially when ideological boundaries are ambiguous. These results position ControBench as a challenging and realistic benchmark for controversial discourse analysis.

2026-04-30

arXiv (prépublication)

The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

Michael Rizvi-Martel

Marius Mosbach

Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating i… (voir plus)n continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

2026-04-06

arXiv (prépublication)

Model Merging via Data-Free Covariance Estimation

Marawan Gamal Abdel Hameed

Derek Tam

Pascal Jr Tikeng Notsawo

Colin Raffel

Model merging provides a way of cheaply combining individual models to produce a model that inherits each individual's capabilities. While s… (voir plus)ome merging methods can approach the performance of multitask training, they are often heuristically motivated and lack theoretical justification. A principled alternative is to pose model merging as a layer-wise optimization problem that directly minimizes interference between tasks. However, this formulation requires estimating per-layer covariance matrices from data, which may not be available when performing merging. In contrast, many of the heuristically-motivated methods do not require auxiliary data, making them practically advantageous. In this work, we revisit the interference minimization framework and show that, under certain conditions, covariance matrices can be estimated directly from difference matrices, eliminating the need for data while also reducing computational costs. We validate our approach across vision and language benchmarks on models ranging from 86M parameters to 7B parameters, outperforming previous data-free state-of-the-art merging methods

2026-03-31

arXiv (prépublication)

Grokking Finite-Dimensional Algebra

Pascal Jr Tikeng Notsawo

Guillaume Dumas

This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed d… (voir plus)uring neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

2026-02-22

arXiv (prépublication)

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Damien Lesens

Beheshteh T. Rakhshan

The Key–Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vecto… (voir plus)rs to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are sub-optimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

2026-02-02

Artificial Intelligence and Statistics (poster)

On the Role of Depth in the Expressivity of RNNs

Maude Lizaire

Michael Rizvi-Martel

Éric Dupuis

The benefits of depth in feedforward neural networks (FNNs) are well known: composing multiple layers of linear transformations with nonline… (voir plus)ar activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs’ memory capacity efficiently with respect to parameters, enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We extend our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.

2026-02-02

International Conference on Artificial Intelligence and Statistics (spotlight)

Tractable Shapley Values and Interactions via Tensor Networks

Farzaneh Heidari

Chao Li

We show how to replace the …

2026-02-02

Artificial Intelligence and Statistics (poster)

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

Farzaneh Heidari

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their compu… (voir plus)tation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

2025-12-31

International Conference on Machine Learning (Accept (regular))

Higher Order Transformers: Efficient Attention Mechanism for Tensor Structured Data

Soroush Omranpour

Transformers are now ubiquitous for sequence modeling tasks, but their extension to multi-dimensional data remains a challenge due to the qu… (voir plus)adratic cost of the attention mechanism. In this paper, we propose Higher-Order Transformers (HOT), a novel architecture designed to efficiently process data with more than two axes, i.e. higher-order tensors. To address the computational challenges associated with high-order tensor attention, we introduce a novel Kronecker factorized attention mechanism that reduces the attention cost to quadratic in each axis' dimension, rather than quadratic in the total size of the input tensor. To further enhance efficiency, HOT leverages kernelized attention, reducing the complexity to linear. This strategy maintains the model's expressiveness while enabling scalable attention computation. We validate the effectiveness of HOT on two high-dimensional tasks, including multivariate time series forecasting, and 3D medical image classification. Experimental results demonstrate that HOT achieves competitive performance while significantly improving computational efficiency, showcasing its potential for tackling a wide range of complex, multi-dimensional data.

2025-11-15

TMLR (accepté)

TGM: a Modular and Efficient Library for Machine Learning on Temporal Graphs

Tran Gia Bao Ngo

Jure Leskovec

Michael M. Bronstein

Matthias Fey

Well-designed open-source software drives progress in Machine Learning (ML) research. While static graph ML enjoys mature frameworks like Py… (voir plus)Torch Geometric and DGL, ML for temporal graphs (TG), networks that evolve over time, lacks comparable infrastructure. Existing TG libraries are often tailored to specific architectures, hindering support for diverse models in this rapidly evolving field. Additionally, the divide between continuous- and discrete-time dynamic graph methods (CTDG and DTDG) limits direct comparisons and idea transfer. To address these gaps, we introduce Temporal Graph Modelling (TGM), a research-oriented library for ML on temporal graphs, the first to unify CTDG and DTDG approaches. TGM offers first-class support for dynamic node features, time-granularity conversions, and native handling of link-, node-, and graph-level tasks. Empirically, TGM achieves an average 7.8x speedup across multiple models, datasets, and tasks compared to the widely used DyGLib, and an average 175x speedup on graph discretization relative to available implementations. Beyond efficiency, we show in our experiments how TGM unlocks entirely new research possibilities by enabling dynamic graph property prediction and time-driven training paradigms, opening the door to questions previously impractical to study. TGM is available at https://github.com/tgm-team/tgm

2025-10-07

ArXiv (prépublication)