Simon Lacoste-Julien

Biography

Simon Lacoste-Julien is an associate professor at Mila – Quebec Artificial Intelligence Institute and in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal. He is also a Canada CIFAR AI Chair and heads (part time) the SAIT AI Lab Montréal.

Lacoste-Julien‘s research interests are machine learning and applied mathematics, along with their applications to computer vision and natural language processing. He completed a BSc in mathematics, physics and computer science at McGill University, a PhD in computer science at UC Berkeley and a postdoc at the University of Cambridge.

After spending several years as a researcher at INRIA and the École normale supérieure in Paris, he returned to his home city of Montréal in 2016 to answer Yoshua Bengio’s call to help grow the Montréal AI ecosystem.

Current Students

Reza Babanezhad Harikandeh

Independent visiting researcher - Samsung SAIT

Aristide Baratin

Independent visiting researcher - Samsung SAIT

PhD - Université de Montréal

PhD - Université de Montréal

Marwa El Halabi

Independent visiting researcher - Samsung SAIT

PhD - Université de Montréal

Yash Goyal

Independent visiting researcher - Samsung SAIT

Meraj Hashemizadeh

Collaborating researcher - Université de Montréal

Fahimeh HosseiniNoohdani

Collaborating researcher - Université de Montréal

Pedram Khorsandi

PhD - Université de Montréal

Boris Knyazev

Independent visiting researcher - Université de Montréal

Independent visiting researcher - Samsung - SAIT

Lucas Maes

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Co-supervisor :

Collaborating Alumni - Université de Montréal

Juan Ramirez

PhD - Université de Montréal

PhD - Université de Montréal

Independent visiting researcher - Univeristy of Tübingen

Theo Saulus

PhD - Université de Montréal

Co-supervisor :

Dhanya Sridhar

Damien Scieur

Independent visiting researcher - Samsung SAIT

Motahareh Sohrabi

Collaborating researcher - Université de Montréal

Helen Zhang

PhD - Université de Montréal

Yan Zhang

Independent visiting researcher - Samsung SAIT

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation

Blog Posts

March 18, 2024

Sébastien Lachapelle

Divyat Mahajan

Ioannis Mitliagkas

Simon Lacoste-Julien

Read the article

Publications

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

Alan Milligan

Zikun Xu

Felix Dangel

Wu Lin

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decompos… (see more)ition. Because existing QR implementations require single-precision (FP32) arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using BFloat16 (BFP16) storage to reduce memory usage can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports BFP16 storage and forms a complete basis by combining updated basis vectors with unchanged ones. By updating only part of the basis through QR decomposition in a subspace, our approach reduces computational overhead while mitigating the performance degradation caused by BFP16 storage. Our approach applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo, SOAP, and KL-SOAP. In particular, it improves the performance of SOAP and KL-SOAP under BFP16 storage, enabling KL-SOAP to match or exceed KL-Shampoo. Overall, our approach makes Shampoo-based methods more memory- and time-efficient.

2026-05-24

arXiv (preprint)

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vitoria Barin Pacela

Shruti Joshi

Isabela Camacho

David Klindt

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under su… (see more)perposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.

2026-03-29

arXiv (preprint)

arxiv.org

Lloyd's $K$-Means Clustering Algorithm is Frank-Wolfe in Disguise

Michael Pokojovy

J. Marcus Jobe

Lloyd's …

2026-02-02

International Conference on Artificial Intelligence and Statistics (poster)

The Role of Causal Features in Strategic Classification for Robustness and Alignment

Sophia Gunluk

Antonio Gois

Nir Rosenfeld

Nidhi Hegde

Dhanya Sridhar

In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility i… (see more)n a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to determine when causal classifiers have an advantage. Finally, we show that causal classifiers can align long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.

2026-02-02

Artificial Intelligence and Statistics (poster)

Accelerated and Stable Convergence with Anchored Generalized Optimistic Method

Jianxin You

We study first-order methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the… (see more) extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In particular, we show that for monotone Lipschitz operators, GOMA achieves an accelerated last-iterate convergence rate of

2025-12-31

International Conference on Machine Learning (Accept (regular))

Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise

Juan Ramirez

Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are… (see more) typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Operationalizing Quantized Disentanglement

Vitória Barin-Pacela

Kartik Ahuja

P Vincent

2025-11-24

ArXiv (preprint)

arxiv.org

Tight Lower Bounds and Improved Convergence in Performative Prediction

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (see more)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Alexia Jolicoeur-Martineau

Understanding Adam Requires Better Rotation Dependent Assumptions

Lucas Maes

Tianyue H. Zhang

Alan Milligan

Ioannis Mitliagkas

Damien Scieur

Charles Guille-escuret

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (see more) paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Quantized Disentanglement: A Practical Approach

Vitória Barin-Pacela

Kartik Ahuja

P Vincent

2025-06-08

ICML.cc/2025/Workshop/SIM (poster)

Performative Prediction on Games and Mechanism Design

Fernando P. Santos

2025-04-22

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (published)

proceedings.mlr.press

Accelerating Training with Neuron Interaction and Nowcasting Networks

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (see more) learnable update rules can be costly and unstable to train and use. Recently, Jang et al. (2023) proposed a simpler approach to accelerate training based on weight nowcaster networks (WNNs). In their approach, Adam is used for most of the optimization steps and periodically, only every few steps, a WNN nowcasts (predicts near future) parameters. We improve WNNs by proposing neuron interaction and nowcasting (NiNo) networks. In contrast to WNNs, NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters. We further show that in some networks, such as Transformers, modeling neuron connectivity accurately is challenging. We address this and other limitations, which allows NiNo to accelerate Adam training by up to 50% in vision and language tasks.

2025-01-21

ICLR.cc/2025/Conference (poster)