Quentin Bertrand

Self-Play Q-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Juan Agustin Duque

Emilio Calvano

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such… (see more) as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

2025-04-30

ICML.cc/2025/Conference (poster)

doi.org

proceedings.mlr.press

Self-Consuming Generative Models with Curated Data Provably Optimize Human Preferences

Damien Ferbach

Quentin Bertrand

Avishek Joey Bose

Gauthier Gidel

The rapid progress in generative models has resulted in impressive leaps in generation quality, blurring the lines between synthetic and rea… (see more)l data. Web-scale datasets are now prone to the inevitable contamination by synthetic data, directly impacting the training of future generated models. Already, some theoretical results on self-consuming generative models (a.k.a., iterative retraining) have emerged in the literature, showcasing that either model collapse or stability could be possible depending on the fraction of generated data used at each retraining step. However, in practice, synthetic data is often subject to human feedback and curated by users before being used and uploaded online. For instance, many interfaces of popular text-to-image generative models, such as Stable Diffusion or Midjourney, produce several variations of an image for a given query which can eventually be curated by the users. In this paper, we theoretically study the impact of data curation on iterated retraining of generative models and show that it can be seen as an \emph{implicit preference optimization mechanism}. However, unlike standard preference optimization, the generative model does not have access to the reward function or negative samples needed for pairwise comparisons. Moreover, our study doesn't require access to the density function, only to samples. We prove that, if the data is curated according to a reward model, then the expected reward of the iterative retraining procedure is maximized. We further provide theoretical results on the stability of the retraining loop when using a positive fraction of real data at each step. Finally, we conduct illustrative experiments on both synthetic datasets and on CIFAR10 showing that such a procedure amplifies biases of the reward model.

2024-09-24

NeurIPS.cc/2024/Conference (spotlight)

doi.org

openreview.net

On the Stability of Iterative Retraining of Generative Models on Their Own Data

Avishek (Joey) Bose

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical … (see more)human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

2024-01-15

ICLR.cc/2024/Conference (spotlight)

doi.org

openreview.net

Synergies between Disentanglement and Sparsity: Generalization and Identifiability in Multi-Task Learning

Although disentangled representations are often said to be beneficial for downstream tasks, current empirical and theoretical understanding … (see more)is limited. In this work, we provide evidence that disentangled representations coupled with sparse base-predictors improve generalization. In the context of multi-task learning, we prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations. Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem. Finally, we explore a meta-learning version of this algorithm based on group Lasso multiclass SVM base-predictors, for which we derive a tractable dual formulation. It obtains competitive results on standard few-shot classification benchmarks, while each task is using only a fraction of the learned representations.

2023-07-24

International Conference on Machine Learning (Accept (Poster))

doi.org

proceedings.mlr.press

Omega: Optimistic EMA Gradients

Stochastic min-max optimization has gained interest in the machine learning community with the advancements in GANs and adversarial training… (see more). Although game optimization is fairly well understood in the deterministic setting, some issues persist in the stochastic regime. Recent work has shown that stochastic gradient descent-ascent methods such as the optimistic gradient are highly sensitive to noise or can fail to converge. Although alternative strategies exist, they can be prohibitively expensive. We introduce Omega, a method with optimistic-like updates that mitigates the impact of noise by incorporating an EMA of historic gradients in its update rule. We also explore a variation of this algorithm that incorporates momentum. Although we do not provide convergence guarantees, our experiments on stochastic games show that Omega outperforms the optimistic gradient method when applied to linear players.

2023-07-01

ICML.cc/2023/Workshop/LXAI_Regular_Deadline (oral)

doi.org

openreview.net

On the Limitations of Elo: Real-World Games, are Transitive, not Additive

Quentin Bertrand

Wojciech M. Czarnecki

Gauthier Gidel

Real-world competitive games, such as chess, go, or StarCraft II, rely on Elo models to measure the strength of their players. Since these g… (see more)ames are not fully transitive, using Elo implicitly assumes they have a strong transitive component that can correctly be identified and extracted. In this study, we investigate the challenge of identifying the strength of the transitive component in games. First, we show that Elo models can fail to extract this transitive component, even in elementary transitive games. Then, based on this observation, we propose an extension of the Elo score: we end up with a disc ranking system that assigns each player two scores, which we refer to as skill and consistency. Finally, we propose an empirical validation on payoff matrices coming from real-world games played by bots and humans.

2022-12-31

AISTATS (published)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Quentin Bertrand

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Quentin Bertrand

Publications