Ioannis Mitliagkas

Core Academic Member

ioannis@mila.quebec

Canada CIFAR AI Chair

Associate Professor, Université de Montréal, Department of Computer Science and Operations Research

Research Scientist, Google DeepMind

Research Topics

Deep Learning

Distributed Systems

Dynamical Systems

Generative Models

Machine Learning Theory

Optimization

Representation Learning

Website

Google Scholar

Biography

Ioannis Mitliagkas (Γιάννης Μητλιάγκας) is an associate professor in the Department of Computer Science and Operations Research (DIRO) at Université de Montréal, as well as a Core Academic member of Mila – Quebec Artificial Intelligence Institute and a Canada CIFAR AI Chair. He holds a part-time position as a staff research scientist at Google DeepMind Montréal.

Previously, he was a postdoctoral scholar in the Departments of statistics and computer science at Stanford University. He obtained his PhD from the Department of Electrical and Computer Engineering at the University of Texas at Austin.

His research includes topics in machine learning, with emphasis on optimization, deep learning theory, statistical learning. His recent work includes methods for efficient and adaptive optimization, studying the interaction between optimization and the dynamics of large-scale learning systems and the dynamics of games.

Current Students

Mehdi Inane Ahmed

PhD - Université de Montréal

Github

Google Scholar

Ryan D'Orazio

PhD - Université de Montréal

Charles Guille-escuret

Collaborating Alumni - Université de Montréal

Collaborating Alumni - Université de Montréal

Co-supervisor :

Irina Rish

Website

Google Scholar

Zichu Liu

PhD - Université de Montréal

Principal supervisor :

Gauthier Gidel

Youry Macius

Professional Master's - Université de Montréal

PhD - Université de Montréal

Yukti Makhija

PhD - Université de Montréal

Principal supervisor :

Collaborating researcher - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Master's Research - Université de Montréal

Blog Posts

March 18, 2024

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation

Sébastien Lachapelle

Divyat Mahajan

Ioannis Mitliagkas

Simon Lacoste-Julien

Read the article

Publications

The Geometry of Spectral Gradient Descent: Layerwise Criteria for SignSGD vs SpecSGD

Hiroki Naganuma

Laura Gomezjurado

Mahdi Ghaznavi

Ioannis Mitliagkas

Optimization in deep learning has expanded beyond Euclidean methods to include entrywise sign updates (SignSGD) and spectral sign updates (S… (see more)pecGD/Muon). While both can be viewed as steepest descent under non-Euclidean geometries (

2026-03-01

GRaM @ International Conference on Learning Representations (poster)

openreview.net

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Divyat Mahajan

Sachin Goyal

Badr Youbi Idrissi

Mohammad Pezeshki

Ioannis Mitliagkas

David Lopez-Paz

Kartik Ahuja

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, … (see more)and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

DisTaC: Conditioning Task Vectors via Distillation for Robust Model Merging

Kotaro Yoshida

Yuji Naraki

Takafumi Horie

Ryotaro Shimizu

Ioannis Mitliagkas

Hiroki Naganuma

Model merging has emerged as an efficient and flexible paradigm for multi-task learning, with numerous methods being proposed in recent year… (see more)s. However, these state-of-the-art techniques are typically evaluated on benchmark suites that are highly favorable to model merging, and their robustness in more realistic settings remains largely unexplored. In this work, we first investigate the vulnerabilities of model-merging methods and pinpoint the source-model characteristics that critically underlie them. Specifically, we identify two factors that are particularly harmful to the merging process: (1) disparities in task vector norms, and (2) the low confidence of the source models. To address this issue, we propose **DisTaC** (**Dis**tillation for **Ta**sk vector **C**onditioning), a novel method that pre-conditions these problematic task vectors before the merge. DisTaC leverages knowledge distillation to adjust a task vector's norm and increase source-model confidence while preserving its essential task-specific knowledge. Our extensive experiments demonstrate that by pre-conditioning task vectors with DisTaC, state-of-the-art merging techniques can successfully integrate models that exhibit these harmful traits, where they would otherwise fail, and achieve significant performance gains.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

A Derandomization Framework for Structure Discovery: Applications in Neural Networks and Beyond

Nikos Tsikouras

Yorgos Pantis

Ioannis Mitliagkas

Christos Tzamos

2025-10-21

ArXiv (preprint)

doi.org

arxiv.org

Compositional Risk Minimization

Charles Arnal

Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Understanding Adam Requires Better Rotation Dependent Assumptions

Lucas Maes

Tianyue H. Zhang

Alan Milligan

Alexia Jolicoeur-Martineau

Ioannis Mitliagkas

Damien Scieur

Simon Lacoste-Julien

Charles Guille-escuret

Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This… (see more) paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of the update as a promising indicator of Adam's basis sensitivity, suggesting it may be the key quantity for developing rotation-dependent theoretical frameworks that better explain its empirical success.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

doi.org

openreview.net

Pseudo-Asynchronous Local SGD: Robust and Efficient Data-Parallel Training

Hiroki Naganuma

Xinzhi Zhang

Man-Chung Yue

Ioannis Mitliagkas

Russell J. Hewett

Philipp Andre Witte

Yin Tat Lee

2025-09-14

TMLR (accepted)

openreview.net

Learning to Defer for Causal Discovery with Imperfect Experts

Sara Magliacane

Valentina Zantedeschi

Alexandre Drouin

Integrating expert knowledge, e.g. from large language models, into causal discovery algorithms can be challenging when the knowledge is not… (see more) guaranteed to be correct. Expert recommendations may contradict data-driven results, and their reliability can vary significantly depending on the domain or specific query. Existing methods based on soft constraints or inconsistencies in predicted causal relationships fail to account for these variations in expertise. To remedy this, we propose L2D-CD, a method for gauging the correctness of expert recommendations and optimally combining them with data-driven causal discovery results. By adapting learning-to-defer (L2D) algorithms for pairwise causal discovery (CD), we learn a deferral function that selects whether to rely on classical causal discovery methods using numerical data or expert recommendations based on textual meta-data. We evaluate L2D-CD on the canonical Tübingen pairs dataset and demonstrate its superior performance compared to both the causal discovery method and the expert used in isolation. Moreover, our approach identifies domains where the expert's performance is strong or weak. Finally, we outline a strategy for generalizing this approach to causal discovery on graphs with more than two variables, paving the way for further research in this area.

2025-03-04

ICLR.cc/2025/Workshop/LLM_Reason_and_Plan (published)

doi.org

openreview.net

Solving Hidden Monotone Variational Inequalities with Surrogate Losses

Junhyung Lyle Kim

Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minim… (see more)izing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.

2025-01-21

ICLR.cc/2025/Conference (poster)

doi.org

openreview.net

An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration

Hiroki Naganuma

Ryuichiro Hataya

Kotaro Yoshida

Ioannis Mitliagkas

2024-12-31

Trans. Mach. Learn. Res. (published)

openreview.net

Generating Tabular Data Using Heterogeneous Sequential Feature Forest Flow Matching

Ange-Cl'ement Akazan

Alexia Jolicoeur-Martineau

Ioannis Mitliagkas

2024-10-19

ArXiv (preprint)

doi.org

arxiv.org

Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection

Charles Guille-escuret

Pierre-Andre Noel

Ioannis Mitliagkas

David Vázquez

Joao Monteiro

Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs.… (see more) However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

doi.org

openreview.net

Mila Ventures Founder in Residence

TRAIL: Responsible AI for Professionals and Leaders

AI Advantage: Productivity in Public Service

Ioannis Mitliagkas

Biography

Current Students

Blog Posts

Publications

Mila Ventures Founder in Residence

TRAIL: Responsible AI for Professionals and Leaders

AI Advantage: Productivity in Public Service

Popular keywords:

Ioannis Mitliagkas

Biography

Current Students

Blog Posts

Publications