Leo Schwinn

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Moritz Ladenburger

Tim Beyer

Stephan Günnemann

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. … (see more)For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. (Data in supplement).

2025-12-31

International Conference on Machine Learning (Accept (regular))

doi.org

openreview.net

Position: LLM-Safety Evaluations Lack Robustness

Tim Beyer

Sophie Xhonneux

Simon Geisler

Gauthier Gidel

Leo Schwinn

Stephan Günnemann

In this position paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined s… (see more)ources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing research progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field’s ability to generate easily comparable results and make measurable progress.

2025-12-31

International Conference on Machine Learning (Accept (regular))

openreview.net

When to retrain a machine learning model

Florence Regol

Leo Schwinn

Kyle Sprague

Mark J. Coates

Thomas Markovich

A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of dat… (see more)a. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Jailbreak Distillation: Renewable Safety Benchmarking

Jingyu Zhang

Ahmed Elgohary

Xiawei Wang

A S M Iftekhar

Ahmed Magooda

Benjamin Van Durme

Daniel Khashabi

Kyle Jackson

JBDistill Benchmark JBDistill Benchmark

Marah Ihab Abdin

Jyoti Aneja

Harkirat Singh Behl

Sébastien Bubeck

Ronen Eldan

S. Gunasekar

Michael Harrison

Russell J. Hewett

Mojan Javaheripi

Piero Kauffmann

James R. Lee … (see 484 more)

Yin Tat Lee

Yuanzhi Li

Weishung Liu

C. C. T. Mendes

Anh Nguyen

Eric Price

Gustavo de Rosa

Olli Saarikivi

Adil Salim

Tim Beyer

Sophie Xhonneux

Simon Geisler

Gauthier Gidel

Leo Schwinn

Stephan Günnemann. 2025

Blake Bullwinkel

Amanda Minnich

Shiven Chawla

Gary Lopez

Martin Pouliot

Whitney Maxwell

Patrick Chao

Edoardo Debenedetti

Alexander Robey

Maksym Andriushchenko

Francesco Croce

Vikash Sehwag

Edgar Dobriban

Nicolas Flammarion

George J. Pappas

Florian Tramèr

Hamed Hassani

Eric Wong

Jailbreakbench

Zora Che

Stephen Casper

Robert Kirk

Anirudh Satheesh

Stewart Slocum

Lev E McKinney

Rohit Gandikota

Aidan Ewart

Domenic Rosati

Zichu Wu

Zikui Cai

Daya Guo

Dejian Yang

Haowei Zhang

Jun-Mei Song

Ruoyu Zhang

Runxin Xu

Qihao Zhu

Shirong Ma

Peiyi Wang

Xiaoling Bi

Xiaokang Zhang

Xingkai Yu

Yu Wu

Z. F. Wu

Zhibin Gou

Zhihong Shao

Zhuoshu Li

Ziyi Gao

A. Liu

Bing Xue

Bingxuan Wang

Bo WU

Bei Feng

Chenggang Lu

Chenggang Zhao

Chengqi Deng

Chenyu Zhang

C. Ruan

Damai Dai

Deli Chen

Dong-Li Ji

Erhang Li

Fangyun Lin

Fucong Dai

Fuli Luo

Guangbo Hao

Guanting Chen

Guowei Li

Han Bao

Hanwei Xu

Haocheng Wang

Honghui Ding

Huajian Xin

Huazuo Gao

Hui Qu

Hui Li

Jianzhong Guo

Jiashi Li

Jiawei Wang

Jingchang Chen

Jingyang Yuan

Junjie Qiu

Junlong Li

Jinbo Cai

Jia Ni

Jian Liang

Jin Chen

Kai Dong

Kai Hu

Kaige Gao

Kang Guan

Kexin Huang

Kuai Yu

Lean Wang

Lecong Zhang

Liang Zhao

Litong Wang

Liyue Zhang

Lei Xu

Leyi Xia

Mingchuan Zhang

Minghua Zhang

Min Tang

Meng Li

Miaojun Wang

Mingming Li

Ning Tian

Panpan Huang

Meng Wang

Qiancheng Wang

Qinyu Chen

Qiushi Du

Ruiqi Ge

Ruisong Zhang

Ruizhe Pan

Runji Wang

R. J. Chen

Rong Jin

Ruyi Chen

Shanghao Lu

Shangyan Zhou

Shanhuang Chen

Shengfeng Ye

Shiyu Wang

Shuiping Yu

Shunfeng Zhou

Shuting Pan

S. S. Li

Shuang Zhou

Shao-Ping Wu

Tao Yun

Tian Pei

Tianyu Sun

T. Wang

Wangding Zeng

Wanjia Zhao

Wen Liu

Wenfeng Liang

Wenjun Gao

Wen-Xuan Yu

Wentao Zhang

Wei Xiao

Wei An

Xiaodong Liu

Xiaohan Wang

Xiaokang Chen

Xiaotao Nie

Xin Cheng

Jian Li

Xinfeng Xie

Xingchao Liu

Xinyu Yang

Xinyuan Li

Xuecheng Su

Xuheng Lin

Xiangyu Jin

Xi-Cheng Shen

Xiaosha Chen

Xiaowen Sun

Xiaoxi-ang Wang

Xinnan Song

Xinyi Zhou

Xianzu Wang

Xinxia Shan

Y. K. Li

Y. Q. Wang

Y. X. Wei

Yang Zhang

Yan-Hong Xu

Yao Zhao

Yaofeng Sun

Yaohui Wang

Yi Yu

Yichao Zhang

Yifan Shi

Yi Xiong

Ying He

Yishi Piao

Yisong Wang

Yi Chern Tan

Yiyang Ma

Yiyuan Liu

Yongqiang Guo

Yuan Ou

Yuduan Wang

Yue Gong

Yuheng Zou

Yuzi He

Yunfan Xiong

Yuxiang Luo

Yuxiang You

Yu-mei You

Yuxuan Liu

Yuyang Zhou

Y. X. Zhu

Yanping Huang

Yaohui Li

Yang Li

Yi Zheng

Yunxiang Ma

Ying Tang

Yukun Zha

Yuting Yan

Z. Z. Ren

Zehui Ren

Zhangli Sha

Zhe Fu

Zhean Xu

Zhenda Xie

Zhengyan Zhang

Zhewen Hao

Zhicheng Ma

Zhigang Yan

Zhiyu Wu

Zihui Gu

Zijia Zhu

Zijun Liu

Zi-An Li

Ziwei Xie

Ziyang Song

Deep Ganguli

Liane Lovitt

Jackson Kernion

Amanda Askell

Yuntao Bai

Saurav Kadavath

Benjamin Mann

Ethan Perez

Nicholas Schiefer

Kamal Ndousse

Andy Jones

Sam Bowman

Anna Chen

Tom Con-erly

Nova Dassarma

Dawn Drain

Nelson Elhage Sheer

Stanislav Fort

Zac Hatfield-Dodds

T. Henighan

Danny Hernandez

Tristan Hume

Josh Jacobson

Scott Johnston

Shauna Kravec

Catherine Olsson

Sam Ringer

Eli Tran-Johnson

Dario Amodei

Tom Brown

Nicholas Joseph

Sam McCandlish

Chris Olah

Jared Kaplan

Jack Clark. 2022. Red

Aaron Grattafiori

Abhimanyu Dubey

Abhinav Jauhri

Abhinav Pandey

Abhishek Kadian

Ahmad Al-Dahle

Aiesha Letman

Akhil Mathur

Alan Schel-ten

Alex Vaughan

Amy Yang

Angela Fan

Anirudh Goyal

A. Hartshorn

Aobo Yang

Archi Mitra

Archie Sravankumar

Artem Korenev

Arthur Hinsvark

Arun Rao

Aston Zhang

Aurelien Ro-driguez

Austen Gregerson

Ava Spataru

Baptiste Rozière

Bethany Biron

Binh Tang

Bobbie Chern

Charlotte Caucheteux

Chaya Nayak

Chloe Bi

Chris Marra

Chris McConnell

Christian Keller

Christophe Touret

Chunyang Wu

Corinne Wong

Cris-tian Cantón Ferrer

Cyrus Nikolaidis

Damien Al-lonsius

Daniel Song

Danielle Pintz

Danny Livshits

Danny Wyatt

David Esiobu

Dhruv Choudhary

Dhruv Mahajan 0001

Diego Garcia-Olano

Diego Perino

Dieuwke Hupkes

Egor Lakomkin

Ehab A. AlBadawy

Elina Lobanova

Emily Dinan

Eric Michael Smith

Filip Radenovic

Francisco Guzmán

Frank Zhang

Gabriele Synnaeve

Gabrielle Lee

Georgia Lewis

G. Thattai

Graeme Nail

Gregoire Mi-alon

Guan Pang

Guillem Cucurell

Hailey Nguyen

Han-nah Korevaar

Hu Xu

Hugo Touvron

Imanol Iliyan Zarov

Arrieta Ibarra

Is-abel Kloumann

Ishan Misra

Ivan Evtimov

Jack Zhang

Jade Copet

Jaewon Lee

Jan Geffert

Jana Vranes

Jason Park

Jay Mahadeokar

Jeet Shah

Jelmer van der Linde

Jennifer Billock

Jenny Hong

Jenya Lee

Jeremy Fu

J. Fu

Jianfeng Chi

Jianyu Huang

Jiawen Liu

Jie Wang

Jiecao Yu

Joanna Bitton

Joe Spisak

Jongsoo Park

Joseph Rocca

J. Johnstun

Joshua Saxe

Junteng Jia

Kalyan Vasuden Alwala

Karthik Prasad

Kartikeya Upasani

Kate Plawiak

Keqian Li

Kenneth Heafield

Kevin R. Stone

Khalid El-Arini

Krithika Iyer

Kshitiz Malik

Kuen-ley Chiu

Kunal Bhalla

Kushal Lakhotia

Lauren Rantala-Yeary

Laurens van der Maaten

Lawrence Chen

Liang Tan

Liz Jenkins

Louis Martin

Lovish Madaan

Lubo Malo

Lukas Blecher

Lukas Landzaat

Luke de Oliveira

Madeline Muzzi

Mahesh Pasupuleti

Mannat Singh

Manohar Paluri

Marcin Kardas

Maria Tsimpoukelli

Mathew Oldham

Mathieu Rita

Maya Pavlova

Melanie Kam-badur

Mike Lewis

Mitesh Min Si

Kumar Singh

Mona Hassan

Naman Goyal

Narjes Torabi

Niko-lay Bashlykov

Nikolay Bogoychev

Niladri S. Chatterji

Ning Zhang

Olivier Duchenne

Onur Çelebi

Patrick Alrassy

Petar Pengwei Li

Peter Weng

Prajjwal Bhargava

Pratik Dubal

Punit Praveen Krishnan

Singh Koura

Puxin Xu

Qing He

Qingxiao Dong

Ragavan Srinivasan

Raj Ganapathy

Ramon Calderer

Ricardo Silveira Cabral

Robert Stojnic

Roberta Raileanu

Rohan Maheswari

Rohit Girdhar

Rohit Patel

Ro-main Sauvestre

Ron-nie Polidoro

Roshan Sumbaly

Ross Taylor

Ruan Silva

Rui Hou

Rui Wang

S. Hosseini

Sa-hana Chennabasappa

Sanjay Singh

Sean Bell

Seo-hyun Sonia Kim

Sergey Edunov

Shaoliang Nie

Sharan Narang

Sharath Chandra Raparthy

Sheng Shen

Shengye Wan

Shruti Bhosale

Shun Zhang

Simon Van-denhende

Soumya Batra

Spencer Whitman

Sten Sootla

Stephane Collot

Suchin Gururangan

S. Borodinsky

Tamar Herman

Tara Fowler

Tarek Sheasha

Thomas Georgiou

Thomas Scialom

Tobias Speckbacher

Todor Mihaylov

Tong Xiao

Ujjwal Karn

Vedanuj Goswami

Vibhor Gupta

Vignesh Ramanathan

Viktor Kerkez

Vincent Gonguet

Vir-ginie Do

Vish Vogeti

Vitor Albiero

Vladan Petro-vic

Weiwei Chu

Wenhan Xiong

Wenyin Fu

2025-05-27

ArXiv (preprint)

doi.org

arxiv.org

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (

2025-03-04

ICLR.cc/2025/Workshop/BuildingTrust (accepted)

openreview.net

LLM-Safety Evaluations Lack Robustness

Tim Beyer

Sophie Xhonneux

Simon Geisler

Gauthier Gidel

Leo Schwinn

Stephan Günnemann

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

2025-03-03

ArXiv (preprint)

doi.org

arxiv.org

Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives

Leo Schwinn

Yan Scholten

Tom Wollschlager

Sophie Xhonneux

Stephen Casper

Stephan Günnemann

Gauthier Gidel

2025-02-16

ArXiv (preprint)

doi.org

arxiv.org

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

2024-12-31

arXiv.org (preprint)

doi.org

openreview.net

Efficient Adversarial Training in LLMs with Continuous Attacks

Stephan Günnemann

Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial tra… (see more)ining has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.

2024-09-24

NeurIPS.cc/2024/Conference (spotlight)

doi.org

openreview.net

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

Stephan Günnemann

Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be direc… (see more)tly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

doi.org

openreview.net

On the Scalability of Certified Adversarial Robustness with Generated Data

Thomas Altstidl

David Dobre

Arthur Kosmala

Björn Eskofier

Gauthier Gidel

Leo Schwinn

Certified defenses against adversarial attacks offer formal guarantees on the robustness of a model, making them more reliable than empirica… (see more)l methods such as adversarial training, whose effectiveness is often later reduced by unseen attacks. Still, the limited certified robustness that is currently achievable has been a bottleneck for their practical adoption. Gowal et al. and Wang et al. have shown that generating additional training data using state-of-the-art diffusion models can considerably improve the robustness of adversarial training. In this work, we demonstrate that a similar approach can substantially improve deterministic certified defenses but also reveal notable differences in the scaling behavior between certified and empirical methods. In addition, we provide a list of recommendations to scale the robustness of certified training approaches. Our approach achieves state-of-the-art deterministic robustness certificates on CIFAR-10 for the ℓ 2 ( ϵ = 36 / 255 ) and ℓ ∞ ( ϵ = 8 / 255 ) threat models, outperforming the previous results by +3 . 95 and +1 . 39 percentage points, respectively. Furthermore, we report similar improvements for CIFAR-100.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

doi.org

openreview.net

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Leo Schwinn

David Dobre

Stephan Günnemann

Gauthier Gidel

Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastl… (see more)y unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic's Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.

2023-10-26

NeurIPS.cc/2023/Workshop/ICBINB (published)

doi.org

proceedings.mlr.press

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Leo Schwinn

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Leo Schwinn

Publications