Portrait de Ioannis Mitliagkas

Ioannis Mitliagkas

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle
Chercheur scientifique, Google DeepMind
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Modèles génératifs
Optimisation
Systèmes distribués
Systèmes dynamiques
Théorie de l'apprentissage automatique

Biographie

Ioannis Mitliagkas est un professeur associé au Département d'informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal. Je suis également membre de Mila – Institut québécois d’intelligence artificielle et titulaire d’une chaire en IA Canada-CIFAR. J'occupe aussi un poste de chercheur scientifique à temps partiel chez Google DeepMind à Montréal.

Auparavant, j'ai été chercheur postdoctoral aux départements de statistique et d'informatique de l'Université de Stanford; j'ai obtenu mon doctorat à l'Université du Texas à Austin, au Département d'ingénierie électrique et informatique. Mes recherches portent sur l'apprentissage statistique et l'inférence, et plus particulièrement sur l'optimisation, les algorithmes efficaces distribués et à grande échelle, la théorie de l'apprentissage statistique et les méthodes MCMC. Mes travaux récents s’intéressent notamment aux méthodes d'optimisation efficace et adaptative, à l'étude de l'interaction entre l'optimisation et la dynamique des systèmes d'apprentissage à grande échelle, et à la dynamique des jeux.

Étudiants actuels

Stagiaire de recherche - UdeM
Stagiaire de recherche - UdeM
Co-superviseur⋅e :
Postdoctorat - McGill
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Maîtrise recherche - UdeM

Publications

Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Daniel Beaglehole
Atish Agarwala
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the … (voir plus)most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. We prove the derivative alignment occurs with high probability in specific high dimensional settings. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
Solving Hidden Monotone Variational Inequalities with Surrogate Losses
Ryan D'Orazio
Danilo Vucetic
Zichu Liu
Junhyung Lyle Kim
Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minim… (voir plus)izing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.
Understanding Adam Requires Better Rotation Dependent Assumptions
Lucas Maes
Tianyue H. Zhang
Alexia Jolicoeur-Martineau
Damien Scieur
Charles Guille-Escuret
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (voir plus) paper investigates Adam's sensitivity to rotations of the parameter space. We demonstrate that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature, evaluating their adequacy in explaining Adam's behavior across various rotation types. This work highlights the need for new, rotation-dependent theoretical frameworks to fully understand Adam's empirical success in modern machine learning tasks.
Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching
Ange-Cl'ement Akazan
Alexia Jolicoeur-Martineau
Compositional Risk Minimization
Divyat Mahajan
Mohammad Pezeshki
Kartik Ahuja
Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection
Charles Guille-Escuret
Pierre-Andre Noel
David Vazquez
Joao Monteiro
No Wrong Turns: The Simple Geometry Of Neural Networks Optimization Paths
Charles Guille-Escuret
Hiroki Naganuma
Kilian FATRAS
Understanding the optimization dynamics of neural networks is necessary for closing the gap between theory and practice. Stochastic first-or… (voir plus)der optimization algorithms are known to efficiently locate favorable minima in deep neural networks. This efficiency, however, contrasts with the non-convex and seemingly complex structure of neural loss landscapes. In this study, we delve into the fundamental geometric properties of sampled gradients along optimization paths. We focus on two key quantities, which appear in the restricted secant inequality and error bound. Both hold high significance for first-order optimization. Our analysis reveals that these quantities exhibit predictable, consistent behavior throughout training, despite the stochasticity induced by sampling minibatches. Our findings suggest that not only do optimization trajectories never encounter significant obstacles, but they also maintain stable dynamics during the majority of training. These observed properties are sufficiently expressive to theoretically guarantee linear convergence and prescribe learning rate schedules mirroring empirical practices. We conduct our experiments on image classification, semantic segmentation and language modeling across different batch sizes, network architectures, datasets, optimizers, and initialization seeds. We discuss the impact of each factor. Our work provides novel insights into the properties of neural network loss functions, and opens the door to theoretical frameworks more relevant to prevalent practice.
Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones
Mehrnaz Mofakhami
Reza Bayat
Joao Monteiro
Valentina Zantedeschi
Gradient descent induces alignment between weights and the pre-activation tangents for deep non-linear networks
Daniel Beaglehole
Atish Agarwala
Understanding the mechanisms through which neural networks extract statistics from input-label pairs is one of the most important unsolved p… (voir plus)roblems in supervised learning. Prior works have identified that the gram matrices of the weights in trained neural networks of general architectures are proportional to the average gradient outer product of the model, in a statement known as the Neural Feature Ansatz (NFA). However, the reason these quantities become correlated during training is poorly understood. In this work, we clarify the nature of this correlation and explain its emergence at early training times. We identify that the NFA is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent kernel. We identify a centering of the NFA that isolates this alignment and is robust to initialization scale. We show that, through this centering, the speed of NFA development can be predicted analytically in terms of simple statistics of the inputs and labels.
Are we making progress in unlearning? Findings from the first NeurIPS unlearning competition
Eleni Triantafillou
Peter Kairouz
Fabian Pedregosa
Jamie Hayes
Meghdad Kurmanji
Kairan Zhao
Vincent Dumoulin
Julio C. S. Jacques Junior
Jun Wan
Lisheng Sun-Hosoya
Sergio Escalera
Peter Triantafillou
Isabelle Guyon
We present the findings of the first NeurIPS competition on unlearning, which sought to stimulate the development of novel algorithms and in… (voir plus)itiate discussions on formal and robust evaluation methodologies. The competition was highly successful: nearly 1,200 teams from across the world participated, and a wealth of novel, imaginative solutions with different characteristics were contributed. In this paper, we analyze top solutions and delve into discussions on benchmarking unlearning, which itself is a research problem. The evaluation methodology we developed for the competition measures forgetting quality according to a formal notion of unlearning, while incorporating model utility for a holistic evaluation. We analyze the effectiveness of different instantiations of this evaluation framework vis-a-vis the associated compute cost, and discuss implications for standardizing evaluation. We find that the ranking of leading methods remains stable under several variations of this framework, pointing to avenues for reducing the cost of evaluation. Overall, our findings indicate progress in unlearning, with top-performing competition entries surpassing existing algorithms under our evaluation framework. We analyze trade-offs made by different algorithms and strengths or weaknesses in terms of generalizability to new datasets, paving the way for advancing both benchmarking and algorithm development in this important area.
Smoothness-Adaptive Sharpness-Aware Minimization for Finding Flatter Minima
Hiroki Naganuma
Junhyung Lyle Kim
Anastasios Kyrillidis
The sharpness-aware minimization (SAM) procedure recently gained increasing attention due to its favorable generalization ability to unseen … (voir plus)data. SAM aims to find flatter (local) minima, utilizing a minimax objective. An immediate challenge in the application of SAM is the adjustment of two pivotal step sizes, which significantly influence its effectiveness. We introduce a novel, straightforward approach for adjusting step sizes that adapts to the smoothness of the objective function, thereby reducing the necessity for manual tuning. This method, termed Smoothness-Adaptive SAM (SA-SAM), not only simplifies the optimization process but also promotes the method's inherent tendency to converge towards flatter minima, enhancing performance in specific models.
Gradient descent induces alignment between weights and the empirical NTK for deep non-linear networks
Daniel Beaglehole
Atish Agarwala