Gauthier Gidel

Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Juan Agustin Duque

Emilio Calvano

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such… (see more) as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

David Dobre

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

David Dobre

Most safety training methods for large language models (LLMs) are based on fine-tuning that forces models to shift from an unsafe answer to … (see more)refusal when faced with harmful requests. Unfortunately, these drastic distribution shifts generally compromise model capabilities. To avoid that, we propose to expand the model's vocabulary with a special token we call *red flag token* (

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

Discrete Compositional Generation via General Soft Operators and Robust Reinforcement Learning

Bilun Sun

A major bottleneck in scientific discovery consists of narrowing an exponentially large set of objects, such as proteins or molecules, to a … (see more)small set of promising candidates with desirable properties. While this process can rely on expert knowledge, recent methods leverage reinforcement learning (RL) guided by a proxy reward function to enable this filtering. By employing various forms of entropy regularization, these methods aim to learn samplers that generate diverse candidates that are highly rated by the proxy function. In this work, we make two main contributions. First, we show that these methods are liable to generate overly diverse, suboptimal candidates in large search spaces. To address this issue, we introduce a novel unified operator that combines several regularized RL operators into a general framework that better targets peakier sampling distributions. Secondly, we offer a novel, robust RL perspective of this filtering process. The regularization can be interpreted as robustness to a compositional form of uncertainty in the proxy function (i.e., the true evaluation of a candidate differs from the proxy's evaluation). Our analysis leads us to a novel, easy-to-use algorithm we name trajectory general mellowmax (TGM): we show it identifies higher quality, diverse candidates than baselines in both synthetic and real-world tasks. Code: https://github.com/marcojira/tgm.

2025-06-20

ArXiv (preprint)

arxiv.org

Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators

Bilun Sun

A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a smal… (see more)l set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.

2025-06-20

ArXiv (preprint)

arxiv.org

Robust Reinforcement Learning for Discrete Compositional Generation via General Soft Operators

Bilun Sun

A major bottleneck in scientific discovery involves narrowing a large combinatorial set of objects, such as proteins or molecules, to a smal… (see more)l set of promising candidates. While this process largely relies on expert knowledge, recent methods leverage reinforcement learning (RL) to enhance this filtering. They achieve this by estimating proxy reward functions from available datasets and using regularization to generate more diverse candidates. These reward functions are inherently uncertain, raising a particularly salient challenge for scientific discovery. In this work, we show that existing methods, often framed as sampling proportional to a reward function, are inadequate and yield suboptimal candidates, especially in large search spaces. To remedy this issue, we take a robust RL approach and introduce a unified operator that seeks robustness to the uncertainty of the proxy reward function. This general operator targets peakier sampling distributions while encompassing known soft RL operators. It also leads us to a novel algorithm that identifies higher-quality, diverse candidates in both synthetic and real-world tasks. Ultimately, our work offers a new, flexible perspective on discrete compositional generation tasks. Code: https://github.com/marcojira/tgm.

2025-06-01

arXiv (published)

doi.org

arxiv.org

Jailbreak Distillation: Renewable Safety Benchmarking

Jingyu Zhang

Ahmed Elgohary

Xiawei Wang

A S M Iftekhar

Ahmed Magooda

Benjamin Van Durme

Daniel Khashabi

Kyle Jackson

JBDistill Benchmark JBDistill Benchmark

Marah Ihab Abdin

Jyoti Aneja

Harkirat Singh Behl

Sébastien Bubeck

Ronen Eldan

S. Gunasekar

Michael Harrison

Russell J. Hewett

Mojan Javaheripi

Piero Kauffmann

James R. Lee … (see 484 more)

Yin Tat Lee

Yuanzhi Li

Weishung Liu

C. C. T. Mendes

Anh Nguyen

Eric Price

Gustavo de Rosa

Olli Saarikivi

Adil Salim

Tim Beyer

Sophie Xhonneux

Simon Geisler

Gauthier Gidel

Leo Schwinn

Stephan Günnemann. 2025

Blake Bullwinkel

Amanda Minnich

Shiven Chawla

Gary Lopez

Martin Pouliot

Whitney Maxwell

Patrick Chao

Edoardo Debenedetti

Alexander Robey

Maksym Andriushchenko

Francesco Croce

Vikash Sehwag

Edgar Dobriban

Nicolas Flammarion

George J. Pappas

Florian Tramèr

Hamed Hassani

Eric Wong

Jailbreakbench

Zora Che

Stephen Casper

Robert Kirk

Anirudh Satheesh

Stewart Slocum

Lev E McKinney

Rohit Gandikota

Aidan Ewart

Domenic Rosati

Zichu Wu

Zikui Cai

Daya Guo

Dejian Yang

Haowei Zhang

Jun-Mei Song

Ruoyu Zhang

Runxin Xu

Qihao Zhu

Shirong Ma

Peiyi Wang

Xiaoling Bi

Xiaokang Zhang

Xingkai Yu

Yu Wu

Z. F. Wu

Zhibin Gou

Zhihong Shao

Zhuoshu Li

Ziyi Gao

A. Liu

Bing Xue

Bingxuan Wang

Bo WU

Bei Feng

Chenggang Lu

Chenggang Zhao

Chengqi Deng

Chenyu Zhang

C. Ruan

Damai Dai

Deli Chen

Dong-Li Ji

Erhang Li

Fangyun Lin

Fucong Dai

Fuli Luo

Guangbo Hao

Guanting Chen

Guowei Li

Han Bao

Hanwei Xu

Haocheng Wang

Honghui Ding

Huajian Xin

Huazuo Gao

Hui Qu

Hui Li

Jianzhong Guo

Jiashi Li

Jiawei Wang

Jingchang Chen

Jingyang Yuan

Junjie Qiu

Junlong Li

J. Cai

J. Ni

Jian Liang

Jin Chen

Kai Dong

Kai Hu

Kaige Gao

Kang Guan

Kexin Huang

Kuai Yu

Lean Wang

Lecong Zhang

Liang Zhao

Litong Wang

Liyue Zhang

Lei Xu

Leyi Xia

Mingchuan Zhang

Minghua Zhang

Min Tang

Meng Li

Miaojun Wang

Mingming Li

Ning Tian

Panpan Huang

Meng Wang

Qiancheng Wang

Qinyu Chen

Qiushi Du

Ruiqi Ge

Ruisong Zhang

Ruizhe Pan

Runji Wang

R. J. Chen

Rong Jin

Ruyi Chen

Shanghao Lu

Shangyan Zhou

Shanhuang Chen

Shengfeng Ye

Shiyu Wang

Shuiping Yu

Shunfeng Zhou

Shuting Pan

S. S. Li

Shuang Zhou

Shao-Ping Wu

Tao Yun

Tian Pei

Tianyu Sun

T. Wang

Wangding Zeng

Wanjia Zhao

Wen Liu

Wenfeng Liang

Wenjun Gao

Wen-Xuan Yu

Wentao Zhang

Wei Xiao

Wei An

Xiaodong Liu

Xiaohan Wang

Xiaokang Chen

Xiaotao Nie

Xin Cheng

Jian Li

Xinfeng Xie

Xingchao Liu

Xinyu Yang

Xinyuan Li

Xuecheng Su

Xuheng Lin

Xiangyu Jin

Xi-Cheng Shen

Xiaosha Chen

Xiaowen Sun

Xiaoxi-ang Wang

Xinnan Song

Xinyi Zhou

Xianzu Wang

Xinxia Shan

Y. K. Li

Y. Q. Wang

Y. X. Wei

Yang Zhang

Yan-Hong Xu

Yao Zhao

Yaofeng Sun

Yaohui Wang

Yi Yu

Yichao Zhang

Yifan Shi

Yi Xiong

Ying He

Yishi Piao

Yisong Wang

Yi Chern Tan

Yiyang Ma

Yiyuan Liu

Yongqiang Guo

Yuan Ou

Yuduan Wang

Yue Gong

Yuheng Zou

Yuzi He

Yunfan Xiong

Yuxiang Luo

Yuxiang You

Yu-mei You

Yuxuan Liu

Yuyang Zhou

Y. X. Zhu

Yanping Huang

Yaohui Li

Yang Li

Yi Zheng

Yunxiang Ma

Ying Tang

Yukun Zha

Yuting Yan

Z. Z. Ren

Zehui Ren

Zhangli Sha

Zhe Fu

Zhean Xu

Zhenda Xie

Zhengyan Zhang

Zhewen Hao

Zhicheng Ma

Zhigang Yan

Zhiyu Wu

Zihui Gu

Zijia Zhu

Zijun Liu

Zi-An Li

Ziwei Xie

Ziyang Song

Deep Ganguli

Liane Lovitt

Jackson Kernion

Amanda Askell

Yuntao Bai

Saurav Kadavath

Benjamin Mann

Ethan Perez

Nicholas Schiefer

Kamal Ndousse

Andy Jones

Sam Bowman

Anna Chen

Tom Con-erly

Nova Dassarma

Dawn Drain

Nelson Elhage Sheer

Stanislav Fort

Zac Hatfield-Dodds

T. Henighan

Danny Hernandez

Tristan Hume

Josh Jacobson

Scott Johnston

Shauna Kravec

Catherine Olsson

Sam Ringer

Eli Tran-Johnson

Dario Amodei

Tom Brown

Nicholas Joseph

Sam McCandlish

Chris Olah

Jared Kaplan

Jack Clark. 2022. Red

Aaron Grattafiori

Abhimanyu Dubey

Abhinav Jauhri

Abhinav Pandey

Abhishek Kadian

Ahmad Al-Dahle

Aiesha Letman

Akhil Mathur

Alan Schel-ten

Alex Vaughan

Amy Yang

Angela Fan

Anirudh Goyal

A. Hartshorn

Aobo Yang

Archi Mitra

Archie Sravankumar

Artem Korenev

Arthur Hinsvark

Arun Rao

Aston Zhang

Aurelien Ro-driguez

Austen Gregerson

Ava Spataru

Baptiste Rozière

Bethany Biron

Binh Tang

Bobbie Chern

Charlotte Caucheteux

Chaya Nayak

Chloe Bi

Chris Marra

Chris McConnell

Christian Keller

Christophe Touret

Chunyang Wu

Corinne Wong

Cris-tian Cantón Ferrer

Cyrus Nikolaidis

Damien Al-lonsius

Daniel Song

Danielle Pintz

Danny Livshits

Danny Wyatt

David Esiobu

Dhruv Choudhary

Dhruv Mahajan 0001

Diego Garcia-Olano

Diego Perino

Dieuwke Hupkes

Egor Lakomkin

Ehab A. AlBadawy

Elina Lobanova

Emily Dinan

Eric Michael Smith

Filip Radenovic

Francisco Guzmán

Frank Zhang

Gabriele Synnaeve

Gabrielle Lee

Georgia Lewis

G. Thattai

Graeme Nail

Gregoire Mi-alon

Guan Pang

Guillem Cucurell

Hailey Nguyen

Han-nah Korevaar

Hu Xu

Hugo Touvron

Imanol Iliyan Zarov

Arrieta Ibarra

Is-abel Kloumann

Ishan Misra

Ivan Evtimov

Jack Zhang

Jade Copet

Jaewon Lee

Jan Geffert

Jana Vranes

Jason Park

Jay Mahadeokar

Jeet Shah

Jelmer van der Linde

Jennifer Billock

Jenny Hong

Jenya Lee

Jeremy Fu

J. Fu

Jianfeng Chi

Jianyu Huang

Jiawen Liu

Jie Wang

Jiecao Yu

Joanna Bitton

Joe Spisak

Jongsoo Park

Joseph Rocca

J. Johnstun

Joshua Saxe

Junteng Jia

Kalyan Vasuden Alwala

Karthik Prasad

Kartikeya Upasani

Kate Plawiak

Keqian Li

Kenneth Heafield

Kevin R. Stone

Khalid El-Arini

Krithika Iyer

Kshitiz Malik

Kuen-ley Chiu

Kunal Bhalla

Kushal Lakhotia

Lauren Rantala-Yeary

Laurens van der Maaten

Lawrence Chen

Liang Tan

Liz Jenkins

Louis Martin

Lovish Madaan

Lubo Malo

Lukas Blecher

Lukas Landzaat

Luke de Oliveira

Madeline Muzzi

Mahesh Pasupuleti

Mannat Singh

Manohar Paluri

Marcin Kardas

Maria Tsimpoukelli

Mathew Oldham

Mathieu Rita

Maya Pavlova

Melanie Kam-badur

Mike Lewis

Mitesh Min Si

Kumar Singh

Mona Hassan

Naman Goyal

Narjes Torabi

Niko-lay Bashlykov

Nikolay Bogoychev

Niladri S. Chatterji

Ning Zhang

Olivier Duchenne

Onur Çelebi

Patrick Alrassy

Petar Pengwei Li

Peter Weng

Prajjwal Bhargava

Pratik Dubal

Punit Praveen Krishnan

Singh Koura

Puxin Xu

Qing He

Qingxiao Dong

Ragavan Srinivasan

Raj Ganapathy

Ramon Calderer

Ricardo Silveira Cabral

Robert Stojnic

Roberta Raileanu

Rohan Maheswari

Rohit Girdhar

Rohit Patel

Ro-main Sauvestre

Ron-nie Polidoro

Roshan Sumbaly

Ross Taylor

Ruan Silva

Rui Hou

Rui Wang

S. Hosseini

Sa-hana Chennabasappa

Sanjay Singh

Sean Bell

Seo-hyun Sonia Kim

Sergey Edunov

Shaoliang Nie

Sharan Narang

Sharath Chandra Raparthy

Sheng Shen

Shengye Wan

Shruti Bhosale

Shun Zhang

Simon Van-denhende

Soumya Batra

Spencer Whitman

Sten Sootla

Stephane Collot

Suchin Gururangan

S. Borodinsky

Tamar Herman

Tara Fowler

Tarek Sheasha

Thomas Georgiou

Thomas Scialom

Tobias Speckbacher

Todor Mihaylov

Tong Xiao

Ujjwal Karn

Vedanuj Goswami

Vibhor Gupta

Vignesh Ramanathan

Viktor Kerkez

Vincent Gonguet

Vir-ginie Do

Vish Vogeti

Vitor Albiero

Vladan Petro-vic

Weiwei Chu

Wenhan Xiong

Wenyin Fu

2025-05-28

ArXiv (preprint)

doi.org

arxiv.org

Dimension-adapted Momentum Outscales SGD

2025-05-22

ArXiv (preprint)

arxiv.org

Dimension-adapted Momentum Outscales SGD

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by dat… (see more)a complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions and large-scale text experiments with LSTMs show DANA's improved loss exponents over SGD hold in a practical setting.

2025-05-01

arXiv (published)

doi.org

arxiv.org

Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Quentin Bertrand

Juan Agustin Duque

Emilio Calvano

Gauthier Gidel

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such… (see more) as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

2025-05-01

ICML.cc/2025/Conference (poster)

proceedings.mlr.press

openreview.net

Performative Prediction on Games and Mechanism Design

Fernando P. Santos

2025-04-23

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (published)

doi.org

openreview.net

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

David Dobre

Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (

2025-03-05

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

openreview.net

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Gauthier Gidel

Biography

Current Students

Blog Posts

Publications

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Popular keywords:

Gauthier Gidel

Biography

Current Students

Blog Posts

Publications