Portrait of Ioannis Mitliagkas

Ioannis Mitliagkas

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Scientist, Google DeepMind
Research Topics
Deep Learning
Distributed Systems
Dynamical Systems
Generative Models
Machine Learning Theory
Optimization
Representation Learning

Biography

Ioannis Mitliagkas (Γιάννης Μητλιάγκας) is an associate professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal, as well as a Core Academic member of Mila – Quebec Artificial Intelligence Institute and a Canada CIFAR AI Chair. He holds a part-time position as a staff research scientist at Google DeepMind Montréal.

Previously, he was a postdoctoral scholar in the Departments of statistics and computer science at Stanford University. He obtained his PhD from the Department of Electrical and Computer Engineering at the University of Texas at Austin.

His research includes topics in machine learning, with emphasis on optimization, deep learning theory, statistical learning. His recent work includes methods for efficient and adaptive optimization, studying the interaction between optimization and the dynamics of large-scale learning systems and the dynamics of games.

Current Students

PhD - Université de Montréal
Research Intern - Université de Montréal
Research Intern - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Postdoctorate - McGill University
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Professional Master's - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Master's Research - Université de Montréal

Publications

Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning
Sébastien Lachapelle
Tristan Deleu
Divyat Mahajan
Quentin Bertrand
Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding … (see more)is limited. In this work, we provide evidence that disentangled representations coupled with sparse task-specific predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.
Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning
Sébastien Lachapelle
Tristan Deleu
Divyat Mahajan
Quentin Bertrand
Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding … (see more)is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.
Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning
Sébastien Lachapelle
Tristan Deleu
Divyat Mahajan
Quentin Bertrand
Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding … (see more)is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.
Towards efficient representation identification in supervised learning
Kartik Ahuja
Divyat Mahajan
Vasilis Syrgkanis
Humans have a remarkable ability to disentangle complex sensory inputs (e.g., image, text) into simple factors of variation (e.g., shape, co… (see more)lor) without much supervision. This ability has inspired many works that attempt to solve the following question: how do we invert the data generation process to extract those factors with minimal or no supervision? Several works in the literature on non-linear independent component analysis have established this negative result; without some knowledge of the data generation process or appropriate inductive biases, it is impossible to perform this inversion. In recent years, a lot of progress has been made on disentanglement under structural assumptions, e.g., when we have access to auxiliary information that makes the factors of variation conditionally independent. However, existing work requires a lot of auxiliary information, e.g., in supervised classification, it prescribes that the number of label classes should be at least equal to the total dimension of all factors of variation. In this work, we depart from these assumptions and ask: a) How can we get disentanglement when the auxiliary information does not provide conditional independence over the factors of variation? b) Can we reduce the amount of auxiliary information required for disentanglement? For a class of models where auxiliary information does not ensure conditional independence, we show theoretically and experimentally that disentanglement (to a large extent) is possible even when the auxiliary information dimension is much less than the dimension of the true latent representation.
A Unified Approach to Reinforcement Learning, Quantal Response Equilibria, and Two-Player Zero-Sum Games
Samuel Sokota
Ryan D’orazio
J. Z. Kolter
Nicolas Loizou
Marc Lanctot
Noam Brown
Christian Kroer
This work studies an algorithm, which we call magnetic mirror descent, that is inspired by mirror descent and the non-Euclidean proximal gra… (see more)dient algorithm. Our contribution is demonstrating the virtues of magnetic mirror descent as both an equilibrium solver and as an approach to reinforcement learning in two-player zero-sum games. These virtues include: 1) Being the first quantal response equilibria solver to achieve linear convergence for extensive-form games with first order feedback; 2) Being the first standard reinforcement learning algorithm to achieve empirically competitive results with CFR in tabular settings; 3) Achieving favorable performance in 3x3 Dark Hex and Phantom Tic-Tac-Toe as a self-play deep reinforcement learning algorithm.
Gradient Descent Is Optimal Under Lower Restricted Secant Inequality And Upper Error Bound
Charles Guille-Escuret
Baptiste Goujaud
Adam Ibrahim
The study of first-order optimization is sensitive to the assumptions made on the objective functions. These assumptions induce complexity c… (see more)lasses which play a key role in worst-case analysis, including the fundamental concept of algorithm optimality. Recent work argues that strong convexity and smoothness—popular assumptions in literature—lead to a pathological definition of the condition number. Motivated by this result, we focus on the class of functions satisfying a lower restricted secant inequality and an upper error bound. On top of being robust to the aforementioned pathological behavior and including some non-convex functions, this pair of conditions displays interesting geometrical properties. In particular, the necessary and sufficient conditions to interpolate a set of points and their gradients within the class can be separated into simple conditions on each sampled gradient. This allows the performance estimation problem (PEP) to be solved analytically, leading to a lower bound on the convergence rate that proves gradient descent to be exactly optimal on this class of functions among all first-order algorithms.
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity
Two of the most prominent algorithms for solving unconstrained smooth games are the classical stochastic gradient descent-ascent (SGDA) and … (see more)the recently introduced stochastic consensus optimization (SCO) [Mescheder et al., 2017]. SGDA is known to converge to a stationary point for specific classes of games, but current convergence analyses require a bounded variance assumption. SCO is used successfully for solving large-scale adversarial problems, but its convergence guarantees are limited to its deterministic variant. In this work, we introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO under this condition for solving a class of stochastic variational inequality problems that are potentially non-monotone. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size, and we propose insightful stepsize-switching rules to guarantee convergence to the exact solution. In addition, our convergence guarantees hold under the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching.
Gotta Go Fast When Generating Data with Score-Based Models
Alexia Jolicoeur-Martineau
Ke Li
Rémi Piché-Taillefer
Tal Kachman
Score-based (denoising diffusion) generative models have recently gained a lot of success in generating realistic and diverse data. These ap… (see more)proaches define a forward diffusion process for transforming data to noise and generate data by reversing it (thereby going from noise to data). Unfortunately, current score-based models generate data very slowly due to the sheer number of score network evaluations required by numerical SDE solvers. In this work, we aim to accelerate this process by devising a more efficient SDE solver. Existing approaches rely on the Euler-Maruyama (EM) solver, which uses a fixed step size. We found that naively replacing it with other SDE solvers fares poorly - they either result in low-quality samples or become slower than EM. To get around this issue, we carefully devise an SDE solver with adaptive step sizes tailored to score-based generative models piece by piece. Our solver requires only two score function evaluations, rarely rejects samples, and leads to high-quality samples. Our approach generates data 2 to 10 times faster than EM while achieving better or equal sample quality. For high-resolution images, our method leads to significantly higher quality samples than all other methods tested. Our SDE solver has the benefit of requiring no step size tuning.
A Study of Condition Numbers for First-Order Optimization
Charles Guille-Escuret
Baptiste Goujaud
Manuela Girotti
Adversarial score matching and improved sampling for image generation
Alexia Jolicoeur-Martineau
Rémi Piché-Taillefer
Remi Tachet des Combes
Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has recently found success in generative modeling. The approach works by … (see more)first training a neural network to estimate the score of a distribution, and then using Langevin dynamics to sample from the data distribution assumed by the score network. Despite the convincing visual quality of samples, this method appears to perform worse than Generative Adversarial Networks (GANs) under the Fréchet Inception Distance, a standard metric for generative models. We show that this apparent gap vanishes when denoising the final Langevin samples using the score network. In addition, we propose two improvements to DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of both Denoising Score Matching and adversarial objectives. By combining these two techniques and exploring different network architectures, we elevate score matching methods and obtain results competitive with state-of-the-art image generation on CIFAR-10.
Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization
Kartik Ahuja
Ethan Caballero
Dinghuai Zhang
Jean-Christophe Gagnon-Audet
The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address… (see more) out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.
A Study of Condition Numbers for First-Order Optimization
Charles Guille-Escuret
Baptiste Goujaud
Manuela Girotti
The study of first-order optimization algorithms (FOA) typically starts with assumptions on the objective functions, most commonly smoothnes… (see more)s and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy. Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. Finally we discuss how our work impacts the theoretical understanding of FOA and their performances.