Portrait de Ioannis Mitliagkas

Ioannis Mitliagkas

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Google DeepMind
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Modèles génératifs
Optimisation
Systèmes distribués
Systèmes dynamiques
Théorie de l'apprentissage automatique

Biographie

Ioannis Mitliagkas est un professeur associé au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Je suis également membre de Mila – Institut québécois d’intelligence artificielle et titulaire d’une chaire en IA Canada-CIFAR. J'occupe aussi un poste de chercheur scientifique à temps partiel chez Google DeepMind à Montréal.

Auparavant, j'ai été chercheur postdoctoral aux départements de statistique et d'informatique de l'Université de Stanford; j'ai obtenu mon doctorat à l'Université du Texas à Austin, au Département d'ingénierie électrique et informatique. Mes recherches portent sur l'apprentissage statistique et l'inférence, et plus particulièrement sur l'optimisation, les algorithmes efficaces distribués et à grande échelle, la théorie de l'apprentissage statistique et les méthodes MCMC. Mes travaux récents s’intéressent notamment aux méthodes d'optimisation efficace et adaptative, à l'étude de l'interaction entre l'optimisation et la dynamique des systèmes d'apprentissage à grande échelle, et à la dynamique des jeux.

Étudiants actuels

Stagiaire de recherche - UdeM
Postdoctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Maîtrise recherche - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche - UdeM

Publications

No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths
Charles Guille-Escuret
Hiroki Naganuma
Kilian FATRAS
Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (voir plus)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones
Mehrnaz Mofakhami
Reza Bayat
Joao Monteiro
Valentina Zantedeschi
Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks
Daniel Beaglehole
Atish Agarwala
Understanding the mechanisms through which neural networks extract statistics from input-label pairs is one of the most important unsolved p… (voir plus)roblems in supervised learning. Prior works have identified that the gram matrices of the weights in trained neural networks of general architectures are proportional to the average gradient outer product of the model, in a statement known as the Neural Feature Ansatz (NFA). However, the reason these quantities become correlated during training is poorly understood. In this work, we clarify the nature of this correlation and explain its emergence at early training times. We identify that the NFA is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent kernel. We identify a centering of the NFA that isolates this alignment and is robust to initialization scale. We show that, through this centering, the speed of NFA development can be predicted analytically in terms of simple statistics of the inputs and labels.
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition
Eleni Triantafillou
Peter Kairouz
Fabian Pedregosa
Jamie Hayes
Meghdad Kurmanji
Kairan Zhao
Vincent Dumoulin
Julio C. S. Jacques Junior
Jun Wan
Lisheng Sun-Hosoya
Sergio Escalera
Peter Triantafillou
Isabelle Guyon
We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and in… (voir plus)itiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.
Smoothness-Adaptive Sharpness-Aware Minimization for Finding Flatter Minima
Hiroki Naganuma
Junhyung Lyle Kim
Anastasios Kyrillidis
The sharpness-aware minimization (SAM) procedure recently gained increasing attention due to its favorable generalization ability to unseen … (voir plus)data. SAM aims to find flatter (local) minima, utilizing a minimax objective. An immediate challenge in the application of SAM is the adjustment of two pivotal step sizes, which significantly influence its effectiveness. We introduce a novel, straightforward approach for adjusting step sizes that adapts to the smoothness of the objective function, thereby reducing the necessity for manual tuning. This method, termed Smoothness-Adaptive SAM (SA-SAM), not only simplifies the optimization process but also promotes the method's inherent tendency to converge towards flatter minima, enhancing performance in specific models.
Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks
Daniel Beaglehole
Atish Agarwala
Empirical Analysis of Model Selection for Heterogeneous Causal Effect Estimation
Divyat Mahajan
Brady Neal
Vasilis Syrgkanis
Empirical Analysis of Model Selection for Heterogenous Causal Effect Estimation
Divyat Mahajan
Brady Neal
Vasilis Syrgkanis
We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estima… (voir plus)tion under binary treatments. Unlike model selection in machine learning, there is no perfect analogue of cross-validation as we do not observe the counterfactual potential outcome for any data point. Towards this, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models estimated from the observed data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can access the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics introduced in the literature, and novel ones introduced in this work, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. Our analysis suggests novel model selection strategies based on careful hyperparameter tuning of CATE estimators and causal ensembling.
Stochastic Mirror Descent: Convergence Analysis and Adaptive Variants via the Mirror Stochastic Polyak Stepsize
Ryan D'Orazio
Nicolas Loizou
Issam Hadj Laradji
We investigate the convergence of stochastic mirror descent (SMD) under interpolation in relatively smooth and smooth convex optimization. I… (voir plus)n relatively smooth convex optimization we provide new convergence guarantees for SMD with a constant stepsize. For smooth convex optimization we propose a new adaptive stepsize scheme --- the mirror stochastic Polyak stepsize (mSPS). Notably, our convergence results in both settings do not make bounded gradient assumptions or bounded variance assumptions, and we show convergence to a neighborhood that vanishes under interpolation. Consequently, these results correspond to the first convergence guarantees under interpolation for the exponentiated gradient algorithm for fixed or adaptive stepsizes. mSPS generalizes the recently proposed stochastic Polyak stepsize (SPS) (Loizou et al. 2021) to mirror descent and remains both practical and efficient for modern machine learning applications while inheriting the benefits of mirror descent. We complement our results with experiments across various supervised learning tasks and different instances of SMD, demonstrating the effectiveness of mSPS.
Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation
Sébastien Lachapelle
Divyat Mahajan
We tackle the problems of latent variables identification and "out-of-support'' image generation in representation learning. We show that bo… (voir plus)th are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.
CADet: Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning
Charles Guille-Escuret
Pau Rodriguez
David Vazquez
Joao Monteiro
Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection
Charles Guille-Escuret
Pierre-Andre Noel
David Vazquez
Joao Monteiro
Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs.… (voir plus) However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.