Portrait of Ioannis Mitliagkas

Ioannis Mitliagkas

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, Université de Montréal, Department of Computer Science and Operations Research

Biography

I am an associate professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal, as well as a core academic member of Mila – Quebec Artificial Intelligence Institute and a Canada CIFAR AI Chair. I hold a part-time position as a staff research scientist at Google DeepMind Montréal.

Previously, I was a postdoctoral scholar in the departments of statistics and computer science at Stanford University. I obtained my PhD from the Department of Electrical and Computer Engineering at the University of Texas at Austin.

My research interests lie in statistical learning and inference, with a focus on optimization, efficient large-scale and distributed algorithms, statistical learning theory and MCMC methods. My recent research has focused on methods for efficient and adaptive optimization, understanding the interaction between optimization and the dynamics of large-scale learning systems, and the dynamics of games.

Current Students

PhD - Université de Montréal
Co-supervisor :
Research Intern - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Postdoctorate - McGill University
Principal supervisor :
Master's Research - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :

Publications

Adversarial target-invariant representation learning for domain generalization
Isabela Albuquerque
Joao Monteiro
Tiago Falk
In many applications of machine learning, the training and test set data come from different distributions, or domains. A number of domain g… (see more)eneralization strategies have been introduced with the goal of achieving good performance on out-of-distribution data. In this paper, we propose an adversarial approach to the problem. We propose a process that enforces pair-wise domain invariance while training a feature extractor over a diverse set of domains. We show that this process ensures invariance to any distribution that can be expressed as a mixture of the training domains. Following this insight, we then introduce an adversarial approach in which pair-wise divergences are estimated and minimized. Experiments on two domain generalization benchmarks for object recognition (i.e., PACS and VLCS) show that the proposed method yields higher average accuracy on the target domains in comparison to previously introduced adversarial strategies, as well as recently proposed methods based on learning invariant representations.
Generalizing to unseen domains via distribution matching
Isabela Albuquerque
Joao Monteiro
Mohammad-Javad Darvishi-Bayazi
Tiago Falk
Supervised learning results typically rely on assumptions of i.i.d. data. Unfortunately, those assumptions are commonly violated in practice… (see more). In this work, we tackle this problem by focusing on domain generalization: a formalization where the data generating process at test time may yield samples from never-before-seen domains (distributions). Our work relies on a simple lemma: by minimizing a notion of discrepancy between all pairs from a set of given domains, we also minimize the discrepancy between any pairs of mixtures of domains. Using this result, we derive a generalization bound for our setting. We then show that low risk over unseen domains can be achieved by representing the data in a space where (i) the training distributions are indistinguishable, and (ii) relevant information for the task at hand is preserved. Minimizing the terms in our bound yields an adversarial formulation which estimates and minimizes pairwise discrepancies. We validate our proposed strategy on standard domain generalization benchmarks, outperforming a number of recently introduced methods. Notably, we tackle a real-world application where the underlying data corresponds to multi-channel electroencephalography time series from different subjects, each considered as a distinct domain.
Negative Momentum for Improved Game Dynamics
Reyhane Askari Hemmat
Mohammad Pezeshki
Gabriel Huang
Rémi LE PRIOL
Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (see more)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.
Multi-objective training of Generative Adversarial Networks with multiple discriminators
Isabela Albuquerque
Joao Monteiro
Thang Doan
Breandan Considine
T. Falk
Recent literature has demonstrated promising results for training Generative Adversarial Networks by employing a set of discriminators, in c… (see more)ontrast to the traditional game involving one generator against a single adversary. Such methods perform single-objective optimization on some simple consolidation of the losses, e.g. an arithmetic average. In this work, we revisit the multiple-discriminator setting by framing the simultaneous minimization of losses provided by different models as a multi-objective optimization problem. Specifically, we evaluate the performance of multiple gradient descent and the hypervolume maximization algorithm on a number of different datasets. Moreover, we argue that the previously proposed methods and hypervolume maximization can all be seen as variations of multiple gradient descent in which the update direction can be computed efficiently. Our results indicate that hypervolume maximization presents a better compromise between sample quality and computational cost than previous methods.
Reducing the variance in online optimization by transporting past gradients
Sébastien M. R. Arnold
Pierre-Antoine Manzagol
Reza Babanezhad Harikandeh
Most stochastic optimization methods use gradients once before discarding them. While variance reduction methods have shown that reusing pas… (see more)t gradients can be beneficial when there is a finite number of datapoints, they do not easily extend to the online setting. One issue is the staleness due to using past gradients. We propose to correct this staleness using the idea of implicit gradient transport (IGT) which transforms gradients computed at previous iterates into gradients evaluated at the current iterate without using the Hessian explicitly. In addition to reducing the variance and bias of our updates over time, IGT can be used as a drop-in replacement for the gradient estimate in a number of well-understood methods such as heavy ball or Adam. We show experimentally that it achieves state-of-the-art results on a wide range of architectures and benchmarks. Additionally, the IGT gradient estimator yields the optimal asymptotic convergence rate for online stochastic optimization in the restricted setting where the Hessians of all component functions are equal.
Negative Momentum for Improved Game Dynamics
Reyhane Askari Hemmat
Mohammad Pezeshki
Gabriel Huang
Rémi LE PRIOL
Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (see more)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.
Accelerated Stochastic Power Iteration
Peng Xu
Bryan Dawei He
Christopher De Sa
Christopher Re
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, … (see more)requires O ( 1 / Δ ) full-data passes to recover the principal component of a matrix with eigen-gap Δ. Lanczos, a significantly more complex method, achieves an accelerated rate of O ( 1 / Δ ) passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity O ( σ 2 / Δ 2 ) . Unfortunately, they are fully sequential, and also require O ( σ 2 / Δ 2 ) iterations, far from the O ( 1 / Δ ) rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the full-pass setting, standard analysis shows that momentum achieves the accelerated rate, O ( 1 / Δ ) . We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breaking-point variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity O ( 1 / Δ ) . Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wall-clock time if deployed in a parallel environment. Our approach is very general, and applies to many non-convex optimization problems that can now be accelerated using the same technique.
Deep Learning @15 Petaflops/second: Semi-supervised pattern detection for 15 Terabytes of climate data
W. Collins
M. Wehner
M. Prabhat
Thorsten Kurth
Nadathur Satish
Jian Zhang
Evan Racah
Md. Mostofa Ali Patwary
Narayanan Sundaram
Pradeep Dubey
Accelerated Stochastic Power Iteration
Peng Xu
Bryan Dawei He
Christopher De Sa
Christopher Re
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, … (see more)requires O ( 1 / Δ ) full-data passes to recover the principal component of a matrix with eigen-gap Δ. Lanczos, a significantly more complex method, achieves an accelerated rate of O ( 1 / Δ ) passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity O ( σ 2 / Δ 2 ) . Unfortunately, they are fully sequential, and also require O ( σ 2 / Δ 2 ) iterations, far from the O ( 1 / Δ ) rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the full-pass setting, standard analysis shows that momentum achieves the accelerated rate, O ( 1 / Δ ) . We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breaking-point variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity O ( 1 / Δ ) . Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wall-clock time if deployed in a parallel environment. Our approach is very general, and applies to many non-convex optimization problems that can now be accelerated using the same technique.