Portrait of Simon Lacoste-Julien

Simon Lacoste-Julien

Core Academic Member
Canada CIFAR AI Chair
Associate Scientific Director, Mila, Full Professor, Université de Montréal, Department of Computer Science and Operations Research
Vice President and Lab Director, Samsung Advanced Institute of Technology (SAIT) AI Lab, Montréal
Research Topics
Causality
Computer Vision
Deep Learning
Generative Models
Machine Learning Theory
Natural Language Processing
Optimization
Probabilistic Models

Biography

Simon Lacoste-Julien is an associate professor at Mila – Quebec Artificial Intelligence Institute and in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal. He is also a Canada CIFAR AI Chair and heads (part time) the SAIT AI Lab Montréal.

Lacoste-Julien‘s research interests are machine learning and applied mathematics, along with their applications to computer vision and natural language processing. He completed a BSc in mathematics, physics and computer science at McGill University, a PhD in computer science at UC Berkeley and a postdoc at the University of Cambridge.

After spending several years as a researcher at INRIA and the École normale supérieure in Paris, he returned to his home city of Montréal in 2016 to answer Yoshua Bengio’s call to help grow the Montréal AI ecosystem.

Current Students

Independent visiting researcher - Samsung SAIT
Independent visiting researcher - Samsung SAIT
PhD - Université de Montréal
Independent visiting researcher - Samsung
PhD - Université de Montréal
Independent visiting researcher - Samsung SAIT
PhD - Université de Montréal
Independent visiting researcher - Samsung SAIT
Collaborating researcher - Université de Montréal
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher - Université de Montréal
Independent visiting researcher - Samsung - SAIT
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
Collaborating Alumni - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher - Univeristy of Tübingen
PhD - Université de Montréal
Co-supervisor :
Independent visiting researcher - Samsung SAIT
Collaborating researcher - Université de Montréal
PhD - Université de Montréal
Independent visiting researcher - Samsung SAIT

Publications

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under su… (see more)perposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary learning -- not the inference procedure -- as the binding constraint: SAE-learned dictionaries point in substantially wrong directions, and replacing the encoder with per-sample FISTA on the same dictionary does not close the gap. An oracle baseline proves the problem is solvable with a good dictionary at all scales tested. Our results reframe the SAE failure as a dictionary learning challenge, not an amortisation problem, and point to scalable dictionary learning as the key open problem for sparse inference under superposition.
Lloyd's $K$-Means Clustering Algorithm is Frank-Wolfe in Disguise
Michael Pokojovy
J. Marcus Jobe
Lloyd's …
The Role of Causal Features in Strategic Classification for Robustness and Alignment
In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility i… (see more)n a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to determine when causal classifiers have an advantage. Finally, we show that causal classifiers can align long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.
Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are… (see more) typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
Operationalizing Quantized Disentanglement
Vitória Barin-Pacela
P Vincent
Tight Lower Bounds and Improved Convergence in Performative Prediction
Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (see more)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.
Understanding Adam Requires Better Rotation Dependent Assumptions
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (see more) paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.
Quantized Disentanglement: A Practical Approach
Vitória Barin-Pacela
P Vincent
Performative Prediction on Games and Mechanism Design
Accelerating Training with Neuron Interaction and Nowcasting Networks
Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (see more) learnable update rules can be costly and unstable to train and use. Recently, Jang et al. (2023) proposed a simpler approach to accelerate training based on weight nowcaster networks (WNNs). In their approach, Adam is used for most of the optimization steps and periodically, only every few steps, a WNN nowcasts (predicts near future) parameters. We improve WNNs by proposing neuron interaction and nowcasting (NiNo) networks. In contrast to WNNs, NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters. We further show that in some networks, such as Transformers, modeling neuron connectivity accurately is challenging. We address this and other limitations, which allows NiNo to accelerate Adam training by up to 50% in vision and language tasks.
Feasible Learning
Ignacio Hounie
Juan Elenter
Jose Gallego-Posada
Alejandro Ribeiro
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bound… (see more)s the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance \emph{on every individual data point}. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization
Tianyue H. Zhang
Jose Gallego-Posada
Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problem… (see more)s are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the lack of reliable, general-purpose update schemes for the Lagrange multipliers. This paper proposes the