Portrait of Gauthier Gidel

Gauthier Gidel

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, Université de Montréal, Department of Computer Science and Operations Research
Research Topics
Generative Models
Machine Learning Theory
Optimization
Reinforcement Learning

Biography

I am an assistant professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal, a core academic member of Mila – Quebec Artificial Intelligence Institute, and a Canada CIFAR AI Chair.

Previously, I was awarded a Borealis AI Graduate Fellowship, worked at DeepMind and Element AI, and was a Long-Term Visitor at the Simons Institute at UC Berkeley.

My research interests lie at the intersection of game theory, optimization and machine learning.

Current Students

Independent visiting researcher - UBC
Master's Research - Polytechnique Montréal
Principal supervisor :
Collaborating Alumni - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
Postdoctorate - McGill University
Principal supervisor :
PhD - McGill University
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Co-supervisor :
PhD - Université de Montréal
Principal supervisor :
Independent visiting researcher - N/A
Independent visiting researcher - Technical Univeristy of Munich
PhD - Université de Montréal
PhD - Université de Montréal
Co-supervisor :
Independent visiting researcher - N/A

Publications

A General Framework For Proving The Equivariant Strong Lottery Ticket Hypothesis
Damien Ferbach
Christos Tsirigotis
Avishek Joey Bose
The Strong Lottery Ticket Hypothesis (SLTH) stipulates the existence of a subnetwork within a sufficiently overparameterized (dense) neural … (see more)network that -- when initialized randomly and without any training -- achieves the accuracy of a fully trained target network. Recent works by Da Cunha et. al 2022; Burkholz 2022 demonstrate that the SLTH can be extended to translation equivariant networks -- i.e. CNNs -- with the same level of overparametrization as needed for the SLTs in dense networks. However, modern neural networks are capable of incorporating more than just translation symmetry, and developing general equivariant architectures such as rotation and permutation has been a powerful design principle. In this paper, we generalize the SLTH to functions that preserve the action of the group
Convergence of Proximal Point and Extragradient-Based Methods Beyond Monotonicity: the Case of Negative Comonotonicity
Eduard Gorbunov
Adrien Taylor
Samuel Horváth
Algorithms for min-max optimization and variational inequalities are often studied under monotonicity assumptions. Motivated by non-monotone… (see more) machine learning applications, we follow the line of works (Diakonikolas et al., 2021; Lee & Kim, 2021; Pethick et al., 2022; Bohm,2022) aiming at going beyond monotonicity by considering the weaker *negative comonotonicity* assumption. In this work, we provide tight complexity analyses for the Proximal Point (PP), Extragradient (EG), and Optimistic Gradient (OG) methods in this setup, closing several questions on their working guarantees beyond monotonicity. In particular, we derive the first non-asymptotic convergence rates for PP under negative comonotonicity and star-negative comonotonicity and show their tightness via constructing worst-case examples; we also relax the assumptions for the last-iterate convergence guarantees for EG and OG and prove the tightness of the existing best-iterate guarantees for EG and OG via constructing counter-examples.
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples
Marco Jiralerspong
Ian Gemp
Chongli Qin
Yoram Bachrach
Feature Likelihood Score: Evaluating Generalization of Generative Models Using Samples
Marco Jiralerspong
Avishek Joey Bose
Deep generative models have demonstrated the ability to generate complex, high-dimensional, and photo-realistic data. However, a unified fr… (see more)amework for evaluating different generative modeling families remains a challenge. Indeed, likelihood-based metrics do not apply in many cases while pure sample-based metrics such as FID fail to capture known failure modes such as overfitting on training data. In this work, we introduce the Feature Likelihood Score (FLS), a parametric sample-based score that uses density estimation to quantitatively measure the quality/diversity of generated samples while taking into account overfitting. We empirically demonstrate the ability of FLS to identify specific overfitting problem cases, even when previously proposed metrics fail. We further perform an extensive experimental evaluation on various image datasets and model classes. Our results indicate that FLS matches intuitions of previous metrics, such as FID, while providing a more holistic evaluation of generative models that highlights models whose generalization abilities are under or overappreciated. Code for computing FLS is provided at https://github.com/marcojira/fls.
Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization
Chris Junchi Li
Angela Yuan
Quanquan Gu
Michael Jordan
We propose a new first-order optimization algorithm --- AcceleratedGradient-OptimisticGradient (AG-OG) Descent Ascent---for separable convex… (see more)-concave minimax optimization. The main idea of our algorithm is to carefully leverage the structure of the minimax problem, performing Nesterov acceleration on the individual component and optimistic gradient on the coupling component. Equipped with proper restarting, we show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings, including bilinearly coupled strongly convex-strongly concave minimax optimization (bi-SC-SC), bilinearly coupled convex-strongly concave minimax optimization (bi-C-SC), and bilinear games. We also extend our algorithm to the stochastic setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings. AG-OG is the first single-call algorithm with optimal convergence rates in both deterministic and stochastic settings for bilinearly coupled minimax optimization problems.
Nesterov Meets Optimism: Rate-Optimal Separable Minimax Optimization
Chris Junchi Li
Huizhuo Yuan
Angela Yuan
Quanquan Gu
Michael Jordan
We propose a new first-order optimization algorithm — AcceleratedGradient-OptimisticGradient (AG-OG) Descent Ascent—for separable convex… (see more)-concave minimax optimization. The main idea of our algorithm is to carefully leverage the structure of the minimax problem, performing Nesterov acceleration on the individual component and optimistic gradient on the coupling component. Equipped with proper restarting, we show that AG-OG achieves the optimal convergence rate (up to a constant) for a variety of settings, including bilinearly coupled strongly convex-strongly concave minimax optimization (bi-SC-SC), bilinearly coupled convex-strongly concave minimax optimization (bi-C-SC), and bilinear games. We also extend our algorithm to the stochastic setting and achieve the optimal convergence rate in both bi-SC-SC and bi-C-SC settings. AG-OG is the first single-call algorithm with optimal convergence rates in both deterministic and stochastic settings for bilinearly coupled minimax optimization problems.
Performative Prediction with Neural Networks
Performative Prediction with Neural Networks
Performative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers t… (see more)hat are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for finding a performatively stable classifier with the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model's parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model's predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model's parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure.
On the Limitations of Elo: Real-World Games, are Transitive, not Additive
Quentin Bertrand
Wojciech M. Czarnecki
Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these g… (see more)ames are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top
Eduard Gorbunov
Samuel Horváth
Peter Richtárik
Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However,… (see more) many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA - a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Lojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.
Momentum Extragradient is Optimal for Games with Cross-Shaped Spectrum
Junhyung Lyle Kim
Anastasios Kyrillidis
Fabian Pedregosa
Google Research
© J.l. Kim
The extragradient method has recently gained a lot of attention, due to its convergence behavior on smooth games. In games, the eigenvalues … (see more)of the Jacobian of the vector field are distributed on the complex plane, exhibiting more convoluted dynamics compared to minimization. In this work, we take a polynomial-based analysis of the extragradient with momentum for optimizing games with \emph{cross-shaped} spectrum on the complex plane. We show two results: first, the extragradient with momentum exhibits three different modes of convergence based on the hyperparameter setup: when the eigenvalues are distributed
Dissecting adaptive methods in GANs
Samy Jelassi
David Dobre
Arthur Mensch
Yuanzhi Li
Adaptive methods are a crucial component widely used for training generative adversarial networks (GANs). While there has been some work to … (see more)pinpoint the “marginal value of adaptive methods” in standard tasks, it remains unclear why they are still critical for GAN training. In this paper, we formally study how adaptive methods help train GANs; inspired by the grafting method proposed in Agarwal et al. (2020), we separate the magnitude and direction components of the Adam updates, and graft them to the direction and magnitude of SGDA updates respectively. By considering an update rule with the magnitude of the Adam update and the normalized direction of SGD, we empirically show that the adaptive magnitude of Adam is key for GAN training. This motivates us to have a closer look at the class of normalized stochastic gradient descent ascent (nSGDA) methods in the context of GAN training. We propose a synthetic theoretical framework to compare the performance of nSGDA and SGDA for GAN training with neural networks. We prove that in that setting, GANs trained with nSGDA recover all the modes of the true distribution, whereas the same networks trained with SGDA (and any learning rate configuration) suffer from mode collapse. The critical insight in our analysis is that normalizing the gradients forces the discriminator and generator to be updated at the same pace. We also experimentally show that for several datasets, Adam’s performance can be recovered with nSGDA methods.