Portrait de Gauthier Gidel

Gauthier Gidel

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, Université de Montréal, Département d'informatique et de recherche opérationnelle
Sujets de recherche
Apprentissage par renforcement
Modèles génératifs
Optimisation
Théorie de l'apprentissage automatique

Biographie

Je suis professeur adjoint au Département d’informatique et de recherche opérationnelle (DIRO) de l'Université de Montréal et membre académique principal de Mila – Institut québécois d’intelligence artificielle. J'ai obtenu une bourse Borealis AI destinée aux étudiant·e·s des cycles supérieurs et je suis actuellement titulaire d'une chaire en IA Canada-CIFAR. J'ai travaillé chez DeepMind et Element AI, et j'ai récemment été un visiteur de longue durée au Simons Institute de l’Université de Californie à Berkeley. Mes intérêts de recherche se situent à l'intersection de la théorie des jeux, de l'optimisation et de l'apprentissage automatique.

Étudiants actuels

Maîtrise recherche - UdeM
Collaborateur·rice de recherche - UdeM
Doctorat - UdeM
Visiteur de recherche indépendant - N/A
Doctorat - UdeM
Co-superviseur⋅e :
Stagiaire de recherche - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Co-superviseur⋅e :
Collaborateur·rice de recherche - UdeM
Visiteur de recherche indépendant - Technical Univeristy of Munich
Stagiaire de recherche - UdeM
Postdoctorat - UdeM
Doctorat - UdeM
Co-superviseur⋅e :
Collaborateur·rice alumni - N/A

Publications

Stochastic Frank-Wolfe: Unified Analysis and Zoo of Special Cases
Ruslan Nazykov
Aleksandr Shestakov
Vladimir Solodkin
Aleksandr Beznosikov
Alexander Gasnikov
The Conditional Gradient (or Frank-Wolfe) method is one of the most well-known methods for solving constrained optimization problems appeari… (voir plus)ng in various machine learning tasks. The simplicity of iteration and applicability to many practical problems helped the method to gain popularity in the community. In recent years, the Frank-Wolfe algorithm received many different extensions, including stochastic modifications with variance reduction and coordinate sampling for training of huge models or distributed variants for big data problems. In this paper, we present a unified convergence analysis of the Stochastic Frank-Wolfe method that covers a large number of particular practical cases that may have completely different nature of stochasticity, intuitions and application areas. Our analysis is based on a key parametric assumption on the variance of the stochastic gradients. But unlike most works on unified analysis of other methods, such as SGD, we do not assume an unbiasedness of the real gradient estimation. We conduct analysis for convex and non-convex problems due to the popularity of both cases in machine learning. With this general theoretical framework, we not only cover rates of many known methods, but also develop numerous new methods. This shows the flexibility of our approach in developing new algorithms based on the Conditional Gradient approach. We also demonstrate the properties of the new methods through numerical experiments.
Iterated Denoising Energy Matching for Sampling from Boltzmann Densities
Efficiently generating statistically independent samples from an unnormalized probability distribution, such as equilibrium samples of many-… (voir plus)body systems, is a foundational problem in science. In this paper, we propose Iterated Denoising Energy Matching (iDEM), an iterative algorithm that uses a novel stochastic score matching objective leveraging solely the energy function and its gradient -- and no data samples -- to train a diffusion-based sampler. Specifically, iDEM alternates between (I) sampling regions of high model density from a diffusion-based sampler and (II) using these samples in our stochastic matching objective to further improve the sampler. iDEM is scalable to high dimensions as the inner matching objective, is simulation-free, and requires no MCMC samples. Moreover, by leveraging the fast mode mixing behavior of diffusion, iDEM smooths out the energy landscape enabling efficient exploration and learning of an amortized sampler. We evaluate iDEM on a suite of tasks ranging from standard synthetic energy functions to invariant
Deep Equilibrium Models For Algorithmic Reasoning
In this blogpost we discuss the idea of teaching neural networks to reach fixed points when reasoning. Specifically, on the algorithmic reas… (voir plus)oning benchmark CLRS the current neural networks are told the number of reasoning steps they need. While a quick fix is to add a termination network that predicts when to stop, a much more salient inductive bias is that the neural network shouldn't change it's answer any further once the answer is correct, i.e. it should reach a fixed point. This is supported by denotational semantics, which tells us that while loops that terminate are the minimum fixed points of a function. We implement this idea with the help of deep equilibrium models and discuss several hurdles one encounters along the way. We show on several algorithms from the CLRS benchmark the partial success of this approach and the difficulty in making it work robustly across all algorithms.
In-Context Learning Can Re-learn Forbidden Tasks
Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vuln… (voir plus)erabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.
When is Momentum Extragradient Optimal? A Polynomial-Based Analysis
Junhyung Lyle Kim
Anastasios Kyrillidis
Fabian Pedregosa
The extragradient method has gained popularity due to its robust convergence properties for differentiable games. Unlike single-objective op… (voir plus)timization, game dynamics involve complex interactions reflected by the eigenvalues of the game vector field's Jacobian scattered across the complex plane. This complexity can cause the simple gradient method to diverge, even for bilinear games, while the extragradient method achieves convergence. Building on the recently proven accelerated convergence of the momentum extragradient method for bilinear games \citep{azizian2020accelerating}, we use a polynomial-based analysis to identify three distinct scenarios where this method exhibits further accelerated convergence. These scenarios encompass situations where the eigenvalues reside on the (positive) real line, lie on the real line alongside complex conjugates, or exist solely as complex conjugates. Furthermore, we derive the hyperparameters for each scenario that achieve the fastest convergence rate.
Expected Flow Networks in Stochastic Environments and Two-Player Zero-Sum Games
Generative flow networks (GFlowNets) are sequential sampling models trained to match a given distribution. GFlowNets have been successfully … (voir plus)applied to various structured object generation tasks, sampling a diverse set of high-reward objects quickly. We propose expected flow networks (EFlowNets), which extend GFlowNets to stochastic environments. We show that EFlowNets outperform other GFlowNet formulations in stochastic tasks such as protein design. We then extend the concept of EFlowNets to adversarial environments, proposing adversarial flow networks (AFlowNets) for two-player zero-sum games. We show that AFlowNets learn to find above 80% of optimal moves in Connect-4 via self-play and outperform AlphaZero in tournaments.
Synaptic Weight Distributions Depend on the Geometry of Plasticity
A growing literature in computational neuroscience leverages gradient descent and learning algorithms that approximate it to study synaptic … (voir plus)plasticity in the brain. However, the vast majority of this work ignores a critical underlying assumption: the choice of distance for synaptic changes - i.e. the geometry of synaptic plasticity. Gradient descent assumes that the distance is Euclidean, but many other distances are possible, and there is no reason that biology necessarily uses Euclidean geometry. Here, using the theoretical tools provided by mirror descent, we show that the distribution of synaptic weights will depend on the geometry of synaptic plasticity. We use these results to show that experimentally-observed log-normal weight distributions found in several brain areas are not consistent with standard gradient descent (i.e. a Euclidean geometry), but rather with non-Euclidean distances. Finally, we show that it should be possible to experimentally test for different synaptic geometries by comparing synaptic weight distributions before and after learning. Overall, our work shows that the current paradigm in theoretical work on synaptic plasticity that assumes Euclidean synaptic geometry may be misguided and that it should be possible to experimentally determine the true geometry of synaptic plasticity in the brain.
On the Stability of Iterative Retraining of Generative Models on Their Own Data
Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical … (voir plus)human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise.
Abdurakhmon Sadiev
Marina Danilova
Samuel Horváth
Pavel Dvurechensky
Alexander Gasnikov
Peter Richtárik
Proving Linear Mode Connectivity of Neural Networks via Optimal Transport
The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neura… (voir plus)l network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.
Sarah Frank-Wolfe: Methods for Constrained Optimization with Best Rates and Practical Features
Aleksandr Beznosikov
David Dobre
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
David Dobre
Stephan Günnemann
Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastl… (voir plus)y unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.