Guillaume Lajoie

Biographie

Guillaume Lajoie est professeur agrégé au Département de mathématiques et de statistiques (DMS) de l'Université de Montréal et membre académique principal de Mila – Institut québécois d’intelligence artificielle. Il est titulaire d'une chaire CIFAR (CCAI Canada) ainsi que d'une chaire de recherche du Canada (CRC) en calcul et interfaçage neuronaux.

Ses recherches sont positionnées à l'intersection de l'IA et des neurosciences où il développe des outils pour mieux comprendre les mécanismes d'intelligence communs aux systèmes biologiques et artificiels. Les contributions de son groupe de recherche vont des progrès des paradigmes d'apprentissage à plusieurs échelles pour les grands systèmes artificiels aux applications en neurotechnologie. Dr. Lajoie participe activement aux efforts de développement responsables de l'IA, cherchant à identifier les lignes directrices et les meilleures pratiques pour l'utilisation de l'IA dans la recherche et au-delà.

Étudiants actuels

Federico Arangath Joseph

Collaborateur·rice de recherche - ETH Zurich

Stefan Bauer

Visiteur de recherche indépendant

Superviseur⋅e principal⋅e :

Yoshua Bengio

Sangnie Bhardwaj

Doctorat - UdeM

Co-superviseur⋅e :

Hugo Larochelle

Colin Bredenberg

Postdoctorat - UdeM

Co-superviseur⋅e :

Blake Richards

Leo Choiniere

Doctorat - UdeM

Olivier Codol

Postdoctorat - UdeM

Co-superviseur⋅e :

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Leo Gagnon

Doctorat - UdeM

Skylar Gu

Stagiaire de recherche - McGill

Superviseur⋅e principal⋅e :

Juan Guerra

Maîtrise recherche - Polytechnique

Superviseur⋅e principal⋅e :

Nanda Harishankar Krishna

Doctorat - UdeM

Collaborateur·rice de recherche - Western Washington University (faculty; assistant prof))

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Co-superviseur⋅e :

Maîtrise recherche - UdeM

Co-superviseur⋅e :

tejaskasetty@gmail.com

Stagiaire de recherche - Concordia

Co-superviseur⋅e :

Matt Perich

Site web

Ximeng Mao

Doctorat - UdeM

Co-superviseur⋅e :

Joelle Pineau

Abdel Mfougouon Njupoun

Doctorat - UdeM

Co-superviseur⋅e :

Doctorat - UdeM

Co-superviseur⋅e :

Amine Natik

Doctorat - UdeM

Co-superviseur⋅e :

Guy Wolf

Alexandre Payeur

Collaborateur·rice de recherche - UdeM

Mohammad Pezeshki

Collaborateur·rice de recherche

Superviseur⋅e principal⋅e :

Collaborateur·rice alumni - McGill

Superviseur⋅e principal⋅e :

Julia Price

Maîtrise recherche - UdeM

Avery Ryoo

Maîtrise recherche - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

Co-superviseur⋅e :

Lune Bellec

Ayesha Vermani

Visiteur de recherche indépendant - Champalimeau Institute for the Unknown

Ryan Vogt

Postdoctorat - UdeM

Apprentissage automatique pour la segmentation des différentes activations des fibres nerveuses à partir des signaux neuronaux du cerveau vers le corps

Vivian White

Stagiaire de recherche - Western Washington University

Co-superviseur⋅e :

Doctorat - UdeM

Billets de blogue

Représentation graphique d'un nerf vague

21 mai 2025

par

Param Raval

Olivier Tessier-Larivière

Pascal Fortier-Poisson

Blake Richards

Guillaume Lajoie

Lire l'article

13 juin 2024

Que nous apprennent les distributions des coefficients synaptiques au sujet de l’apprentissage dans le cerveau ?

par

Roman Pogodin

Jonathan Cornford

Arna Ghosh

Gauthier Gidel

Guillaume Lajoie

Blake Richards

Lire l'article

Publications

Celo: Training Versatile Learned Optimizers on a Compute Diet

Abhinav Moudgil

Boris Knyazev

Eugene Belilovsky

Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned upda… (voir plus)te rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.

2025-07-08

TMLR (accepté)

Neuromorphic hierarchical modular reservoirs

Filip Milisav

Andrea I Luppi

Laura E Suárez

Bratislav Mišić

2025-06-21

bioRxiv (prépublication)

Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

Leo Gagnon

Eric Elmoznino

Sarthak Mittal

Tom Marty

Tejas Kasetty

The rapid adaptation ability of auto-regressive foundation models is often attributed to the diversity of their pre-training data. This is b… (voir plus)ecause, from a Bayesian standpoint, minimizing prediction error in such settings requires integrating over all plausible latent hypotheses consistent with observations. While this behavior is desirable in principle, it often proves too ambitious in practice: under high ambiguity, the number of plausible latent alternatives makes Bayes-optimal prediction computationally intractable. Cognitive science has long recognized this limitation, suggesting that under such conditions, heuristics or information-seeking strategies are preferable to exhaustive inference. Translating this insight to next-token prediction, we hypothesize that low- and high-ambiguity predictions pose different computational demands, making ambiguity-agnostic next-token prediction a detrimental inductive bias. To test this, we introduce MetaHMM, a synthetic sequence meta-learning benchmark with rich compositional structure and a tractable Bayesian oracle. We show that Transformers indeed struggle with high-ambiguity predictions across model sizes. Motivated by cognitive theories, we propose a method to convert pre-trained models into Monte Carlo predictors that decouple task inference from token prediction. Preliminary results show substantial gains in ambiguous contexts through improved capacity allocation and test-time scalable inference, though challenges remain.

2025-06-19

ArXiv (prépublication)

Next-Token Prediction Should be Ambiguity-Sensitive: A Meta-Learning Perspective

Leo Gagnon

Eric Elmoznino

Sarthak Mittal

Tom Marty

Tejas Kasetty

2025-06-19

ArXiv (prépublication)

Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective

Leo Gagnon

Eric Elmoznino

Sarthak Mittal

Tom Marty

Tejas Kasetty

2025-06-11

ICML.cc/2025/Workshop/ES-FoMo-III (publié)

Tracing the representation geometry of language models from pretraining to post-training

Melody Zixuan Li

Kumar Krishna Agrawal

Arna Ghosh

Komal Kumar Teru

Blake Richards

The geometry of representations in a neural network can significantly impact downstream generalization. It is unknown how representation geo… (voir plus)metry changes in large language models (LLMs) over pretraining and post-training. Here, we characterize the evolving geometry of LLM representations using spectral methods (effective rank and eigenspectrum decay). With the OLMo and Pythia model families we uncover a consistent non-monotonic sequence of three distinct geometric phases in pretraining. An initial \warmup phase sees rapid representational compression. This is followed by an "entropy-seeking" phase, characterized by expansion of the representation manifold's effective dimensionality, which correlates with an increase in memorization. Subsequently, a "compression seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, correlating with improved downstream task performance. We link the emergence of these phases to the fundamental interplay of cross-entropy optimization, information bottleneck, and skewed data distribution. Additionally, we find that in post-training the representation geometry is further transformed: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) correlate with another "entropy-seeking" dynamic to integrate specific instructional or preferential data, reducing out-of-distribution robustness. Conversely, Reinforcement Learning with Verifiable Rewards (RLVR) often exhibits a "compression seeking" dynamic, consolidating reward-aligned behaviors and reducing the entropy in its output distribution. This work establishes the utility of spectral measures of representation geometry for understanding the multiphase learning dynamics within LLMs.

2025-06-09

ICML.cc/2025/Workshop/HiLD (poster)

Generalizable, real-time neural decoding with hybrid state-space models

Avery Hee-Woon Ryoo

Nanda H Krishna

Ximeng Mao

Mehdi Azabou

Eva L Dyer

Matt Perich

Real-time decoding of neural activity is central to neuroscience and neurotechnology applications, from closed-loop experiments to brain-com… (voir plus)puter interfaces, where models are subject to strict latency constraints. Traditional methods, including simple recurrent neural networks, are fast and lightweight but often struggle to generalize to unseen data. In contrast, recent Transformer-based approaches leverage large-scale pretraining for strong generalization performance, but typically have much larger computational requirements and are not always suitable for low-resource or real-time settings. To address these shortcomings, we present POSSM, a novel hybrid architecture that combines individual spike tokenization via a cross-attention module with a recurrent state-space model (SSM) backbone to enable (1) fast and causal online prediction on neural activity and (2) efficient generalization to new sessions, individuals, and tasks through multi-dataset pretraining. We evaluate POSSM's decoding performance and inference speed on intracortical decoding of monkey motor tasks, and show that it extends to clinical applications, namely handwriting and speech decoding in human subjects. Notably, we demonstrate that pretraining on monkey motor-cortical recordings improves decoding performance on the human handwriting task, highlighting the exciting potential for cross-species transfer. In all of these tasks, we find that POSSM achieves decoding accuracy comparable to state-of-the-art Transformers, at a fraction of the inference cost (up to 9x faster on GPU). These results suggest that hybrid SSMs are a promising approach to bridging the gap between accuracy, inference speed, and generalization when training neural decoders for real-time, closed-loop applications.

2025-06-05

ArXiv (prépublication)

Generalizable, real-time neural decoding with hybrid state-space models

Avery Hee-Woon Ryoo

Nanda H Krishna

Ximeng Mao

Mehdi Azabou

Eva L Dyer

Matt Perich

2025-06-05

ArXiv (prépublication)

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes Von Oswald

Nino Scherrer

Seijin Kobayashi

Luca Versari

Songlin Yang

Maximilian Schlegel

Kaitlin Maile

Yanick Schimpf

Oliver Sieberling

Alexander Meulemans

Rif A. Saurous

Charlotte Frenkel

Razvan Pascanu

Blaise Aguera y Arcas

João Sacramento

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, trans… (voir plus)formers require scaling memory and compute linearly during inference. A recent stream of work linearized the softmax operation, resulting in powerful recurrent neural network (RNN) models with constant memory and compute costs such as DeltaNet, Mamba or xLSTM. These models can be unified by noting that their recurrent layer dynamics can all be derived from an in-context regression objective, approximately optimized through an online learning rule. Here, we join this line of work and introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer (von Oswald et al., 2024), and study it in language modeling at the billion-parameter scale. This layer again stems from an in-context loss, but which is now minimized to optimality at every time point using a fast conjugate gradient solver. Through an extensive suite of experiments, we show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs, especially on tasks requiring long context understanding. This performance gain comes at the cost of additional flops spent during inference time. Our results are therefore intriguingly related to recent trends of increasing test-time compute to improve performance -- here by spending compute to solve sequential optimization problems within the neural network itself.

2025-06-05

ArXiv (prépublication)

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Johannes Von Oswald

Nino Scherrer

Seijin Kobayashi

Luca Versari

Songlin Yang

Maximilian Schlegel

Kaitlin Maile

Yanick Schimpf

Oliver Sieberling

Alexander Meulemans

Rif A. Saurous

Charlotte Frenkel

Razvan Pascanu

Blaise Aguera y Arcas

João Sacramento

2025-06-01

arXiv (publié)

Bidirectional Information Flow (BIF) - A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization

Juan David Guerra

Thomas Garbay

Marco Bonizzato

Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, mak… (voir plus)ing them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 85% and 5x higher

2025-05-01

arXiv (publié)

Does learning the right latent variables necessarily improve in-context learning?

Sarthak Mittal

Eric Elmoznino

Leo Gagnon

Sangnie Bhardwaj

Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting ave… (voir plus)nues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.

2025-05-01

ICML.cc/2025/Conference (poster)