Portrait de Simon Lacoste-Julien

Simon Lacoste-Julien

Membre académique principal
Chaire en IA Canada-CIFAR
Directeur scientifique adjoint, Mila, Professeur titulaire, Université de Montréal, Département d'informatique et de recherche opérationnelle
Vice-président et directeur de laboratoire, Samsung Advanced Institute of Technology (SAIT) AI Lab, Montréal
Sujets de recherche
Apprentissage profond
Causalité
Modèles génératifs
Modèles probabilistes
Optimisation
Théorie de l'apprentissage automatique
Traitement du langage naturel
Vision par ordinateur

Biographie

Simon Lacoste-Julien est professeur agrégé au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal, membre cofondateur de Mila – Institut québécois d’intelligence artificielle et titulaire d'une chaire en IA Canada-CIFAR. Il dirige également à temps partiel le SAIT AI Lab Montréal.

Ses recherches portent sur l'apprentissage automatique et les mathématiques appliquées, et intègrent des applications à la vision artificielle et au traitement du langage naturel. Il a obtenu une licence en mathématiques, physique et informatique à l’Université McGill, un doctorat en informatique à l’Université de Californie à Berkeley et un postdoctorat à l'Université de Cambridge.

Il a passé quelques années à l'Institut national de recherche en sciences et technologies du numérique (INRIA) et à l'École normale supérieure de Paris en tant que professeur de recherche avant de revenir à Montréal, en 2016, pour répondre à l'appel de Yoshua Bengio et contribuer à la croissance de l'écosystème de l'IA à Montréal.

Étudiants actuels

Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung
Doctorat - UdeM
Visiteur de recherche indépendant - Samsung SAIT
Visiteur de recherche indépendant - Samsung SAIT
Collaborateur·rice de recherche - UdeM
Collaborateur·rice de recherche - UdeM
Visiteur de recherche indépendant - UdeM
Visiteur de recherche indépendant - Samsung - SAIT
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice alumni - UdeM
Visiteur de recherche indépendant - Univeristy of Tübingen
Doctorat - UdeM
Co-superviseur⋅e :
Visiteur de recherche indépendant - Samsung SAIT
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Visiteur de recherche indépendant - Samsung SAIT

Publications

SVRG meets AdaGrad: painless variance reduction
Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints
The performance of trained neural networks is robust to harsh levels of pruning. Coupled with the ever-growing size of deep learning models,… (voir plus) this observation has motivated extensive research on learning sparse models. In this work, we focus on the task of controlling the level of sparsity when performing sparse learning. Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor, thus lacking direct control of the resulting model sparsity. In response, we adopt a constrained formulation: using the gate mechanism proposed by Louizos et al. (2018), we formulate a constrained optimization problem where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion. Experiments on CIFAR-{10, 100}, TinyImageNet, and ImageNet using WideResNet and ResNet{18, 50} models validate the effectiveness of our proposal and demonstrate that we can reliably achieve pre-determined sparsity targets without compromising on predictive performance.
Partial Disentanglement via Mechanism Sparsity
Bayesian Structure Learning with Generative Flow Networks
In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian ne… (voir plus)tworks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.
Data-Efficient Structured Pruning via Submodular Optimization
Structured pruning is an effective approach for compressing large pre-trained neural networks without significantly affecting their performa… (voir plus)nce. However, most current structured pruning methods do not provide any performance guarantees, and often require fine-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efficient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer's input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is guaranteed to have an exponentially decreasing error between the original model and the pruned model outputs w.r.t the pruned size, under reasonable assumptions. It is also one of the few methods in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms state-of-the-art methods in the limited-data regime.
Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA
Pau Rodríguez
Yash Sharma
Katie E Everett
Rémi Le Priol
Alexandre Lacoste
This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent f… (voir plus)actors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations.
Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution
Recently, Loizou et al. (2021), proposed and analyzed stochastic gradient descent (SGD) with stochastic Polyak stepsize (SPS). The proposed … (voir plus)SPS comes with strong convergence guarantees and competitive performance; however, it has two main drawbacks when it is used in non-over-parameterized regimes: (i) It requires a priori knowledge of the optimal mini-batch losses, which are not available when the interpolation condition is not satisfied (e.g., regularized objectives), and (ii) it guarantees convergence only to a neighborhood of the solution. In this work, we study the dynamics and the convergence properties of SGD equipped with new variants of the stochastic Polyak stepsize and provide solutions to both drawbacks of the original SPS. We first show that a simple modification of the original SPS that uses lower bounds instead of the optimal function values can directly solve issue (i). On the other hand, solving issue (ii) turns out to be more challenging and leads us to valuable insights into the method's behavior. We show that if interpolation is not satisfied, the correlation between SPS and stochastic gradients introduces a bias, which effectively distorts the expectation of the gradient signal near minimizers, leading to non-convergence - even if the stepsize is scaled down during training. To fix this issue, we propose DecSPS, a novel modification of SPS, which guarantees convergence to the exact minimizer - without a priori knowledge of the problem parameters. For strongly-convex optimization problems, DecSPS is the first stochastic adaptive optimization method that converges to the exact solution without restrictive assumptions like bounded iterates/gradients.
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity
Two of the most prominent algorithms for solving unconstrained smooth games are the classical stochastic gradient descent-ascent (SGDA) and … (voir plus)the recently introduced stochastic consensus optimization (SCO) [Mescheder et al., 2017]. SGDA is known to converge to a stationary point for specific classes of games, but current convergence analyses require a bounded variance assumption. SCO is used successfully for solving large-scale adversarial problems, but its convergence guarantees are limited to its deterministic variant. In this work, we introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO under this condition for solving a class of stochastic variational inequality problems that are potentially non-monotone. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size, and we propose insightful stepsize-switching rules to guarantee convergence to the exact solution. In addition, our convergence guarantees hold under the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching.
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a method… (voir plus)ology to quickly predict tactical solutions to a given operational problem. In this context, the tactical solution is less detailed than the operational one but it has to be computed in very short time and under imperfect information. The problem is of importance in various applications where tactical and operational planning problems are interrelated and information about the operational problem is revealed over time. This is for instance the case in certain capacity planning and demand management systems. We formulate the problem as a two-stage optimal prediction stochastic program whose solution we predict with a supervised machine learning algorithm. The training data set consists of a large number of deterministic (second stage) problems generated by controlled probabilistic sampling. The labels are computed based on solutions to the deterministic problems (solved independently and offline) employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application in load planning for rail transportation show that deep learning algorithms produce highly accurate predictions in very short computing time (milliseconds or less). The prediction accuracy is comparable to solutions computed by sample average approximation of the stochastic program.
Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence
We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing… (voir plus) the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.
An Analysis of the Adaptation Speed of Causal Models
Consider a collection of datasets generated by unknown interventions on an unknown structural causal model …
Stochastic Hamiltonian Gradient Methods for Smooth Games
The success of adversarial formulations in machine learning has brought renewed motivation for smooth games. In this work, we focus on the c… (voir plus)lass of stochastic Hamiltonian methods and provide the first convergence guarantees for certain classes of stochastic smooth games. We propose a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and highlight its benefits. Using tools from the optimization literature we show that SHGD converges linearly to the neighbourhood of a stationary point. To guarantee convergence to the exact solution, we analyze SHGD with a decreasing step-size and we also present the first stochastic variance reduced Hamiltonian method. Our results provide the first global non-asymptotic last-iterate convergence guarantees for the class of stochastic unconstrained bilinear games and for the more general class of stochastic games that satisfy a "sufficiently bilinear" condition, notably including some non-convex non-concave problems. We supplement our analysis with experiments on stochastic bilinear and sufficiently bilinear games, where our theory is shown to be tight, and on simple adversarial machine learning formulations.