Portrait de Simon Lacoste-Julien

Simon Lacoste-Julien

Membre académique principal
Chaire en IA Canada-CIFAR
Directeur scientifique adjoint, Mila, Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Vice-président et directeur de laboratoire, Samsung Advanced Institute of Technology (SAIT) AI Lab, Montréal
Sujets de recherche
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Optimisation
Théorie de l'apprentissage automatique
Traitement du langage naturel
Vision par ordinateur

Biographie

Simon Lacoste-Julien est professeur agrégé au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal, membre cofondateur de Mila – Institut québécois d’intelligence artificielle et titulaire d'une chaire en IA Canada-CIFAR. Il dirige également à temps partiel le SAIT AI Lab Montréal.

Ses recherches portent sur l'apprentissage automatique et les mathématiques appliquées, et intègrent des applications à la vision artificielle et au traitement du langage naturel. Il a obtenu une licence en mathématiques, physique et informatique à l’Université McGill, un doctorat en informatique à l’Université de Californie à Berkeley et un postdoctorat à l'Université de Cambridge.

Il a passé quelques années à l'Institut national de recherche en sciences et technologies du numérique (INRIA) et à l'École normale supérieure de Paris en tant que professeur de recherche avant de revenir à Montréal, en 2016, pour répondre à l'appel de Yoshua Bengio et contribuer à la croissance de l'écosystème de l'IA à Montréal.

Étudiants actuels

Doctorat - UdeM
Co-superviseur⋅e :
Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung
Doctorat - UdeM
Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung SAIT
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche - UdeM
Visiteur de recherche indépendant - UdeM
Visiteur de recherche indépendant - Samsung - SAIT
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice alumni - UdeM
Visiteur de recherche indépendant - Univeristy of Tübingen
Doctorat - UdeM
Co-superviseur⋅e :
Visiteur de recherche indépendant - Samsung SAIT
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Visiteur de recherche indépendant - Samsung SAIT

Publications

Dual Optimistic Ascent (PI Control) is the Augmented Lagrangian Method in Disguise
Constrained optimization is a powerful framework for enforcing requirements on neural networks. These constrained deep learning problems are… (voir plus) typically solved using first-order methods on their min-max Lagrangian formulation, but such approaches often suffer from oscillations and can fail to find all local solutions. While the Augmented Lagrangian method (ALM) addresses these issues, practitioners often favor dual optimistic ascent schemes (PI control) on the standard Lagrangian, which perform well empirically but lack formal guarantees. In this paper, we establish a previously unknown equivalence between these approaches: dual optimistic ascent on the Lagrangian is equivalent to gradient descent-ascent on the Augmented Lagrangian. This finding allows us to transfer the robust theoretical guarantees of the ALM to the dual optimistic setting, proving it converges linearly to all local solutions. Furthermore, the equivalence provides principled guidance for tuning the optimism hyper-parameter. Our work closes a critical gap between the empirical success of dual optimistic methods and their theoretical foundation.
Operationalizing Quantized Disentanglement
Vitória Barin-Pacela
P Vincent
Tight Lower Bounds and Improved Convergence in Performative Prediction
Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (voir plus)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.
Understanding Adam Requires Better Rotation Dependent Assumptions
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (voir plus) paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.
Quantized Disentanglement: A Practical Approach
Vitória Barin-Pacela
P Vincent
Performative Prediction on Games and Mechanism Design
Accelerating Training with Neuron Interaction and Nowcasting Networks
Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However,… (voir plus) learnable update rules can be costly and unstable to train and use. Recently, Jang et al. (2023) proposed a simpler approach to accelerate training based on weight nowcaster networks (WNNs). In their approach, Adam is used for most of the optimization steps and periodically, only every few steps, a WNN nowcasts (predicts near future) parameters. We improve WNNs by proposing neuron interaction and nowcasting (NiNo) networks. In contrast to WNNs, NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters. We further show that in some networks, such as Transformers, modeling neuron connectivity accurately is challenging. We address this and other limitations, which allows NiNo to accelerate Adam training by up to 50% in vision and language tasks.
Feasible Learning
Ignacio Hounie
Juan Elenter
Jose Gallego-Posada
Alejandro Ribeiro
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bound… (voir plus)s the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance \emph{on every individual data point}. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
On PI Controllers for Updating Lagrange Multipliers in Constrained Optimization
Tianyue H. Zhang
Jose Gallego-Posada
Constrained optimization offers a powerful framework to prescribe desired behaviors in neural network models. Typically, constrained problem… (voir plus)s are solved via their min-max Lagrangian formulations, which exhibit unstable oscillatory dynamics when optimized using gradient descent-ascent. The adoption of constrained optimization techniques in the machine learning community is currently limited by the lack of reliable, general-purpose update schemes for the Lagrange multipliers. This paper proposes the
Promoting Exploration in Memory-Augmented Adam using Critical Momenta
Adaptive gradient-based optimizers, notably Adam, have left their mark in training large-scale deep learning models, offering fast convergen… (voir plus)ce and robustness to hyperparameter settings. However, they often struggle with generalization, attributed to their tendency to converge to sharp minima in the loss landscape. To address this, we propose a new memory-augmented version of Adam that encourages exploration towards flatter minima by incorporating a buffer of critical momentum terms during training. This buffer prompts the optimizer to overshoot beyond narrow minima, promoting exploration. Through comprehensive analysis in simple settings, we illustrate the efficacy of our approach in increasing exploration and bias towards flatter minima. We empirically demonstrate that it can improve model performance for image classification on ImageNet and CIFAR10/100, language modelling on Penn Treebank, and online learning tasks on TinyImageNet and 5-dataset. Our code is available at https://github.com/chandar-lab/CMOptimizer.
Weight-Sharing Regularization
Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights …
PopulAtion Parameter Averaging (PAPA)