Portrait of Leo Schwinn

Leo Schwinn

Independent visiting researcher - Technical Univeristy of Munich
Supervisor
Research Topics
Deep Learning

Publications

Jailbreak Distillation: Renewable Safety Benchmarking
Jingyu Zhang
Ahmed Elgohary
Xiawei Wang
A S M Iftekhar
Ahmed Magooda
Benjamin Van Durme
Daniel Khashabi
Kyle Jackson
JBDistill Benchmark JBDistill Benchmark
Marah Ihab Abdin
Jyoti Aneja
Harkirat Singh Behl
Sébastien Bubeck
Ronen Eldan
S. Gunasekar
Michael Harrison
Russell J. Hewett
Mojan Javaheripi
Piero Kauffmann
James R. Lee … (see 484 more)
Yin Tat Lee
Yuanzhi Li
Weishung Liu
Caio C. T. Mendes
Anh Nguyen
Eric Price
Gustavo de Rosa
Olli Saarikivi
Adil Salim
Tim Beyer
Simon Geisler
Stephan Günnemann. 2025
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
Whitney Maxwell
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
Vikash Sehwag
Edgar Dobriban
Nicolas Flammarion
George J. Pappas
Florian Tramèr
Hamed Hassani
Eric Wong
Jailbreakbench
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
Lev E McKinney
Rohit Gandikota
Aidan Ewart
Domenic Rosati
Zichu Wu
Zikui Cai
Daya Guo
Dejian Yang
Haowei Zhang
Jun-Mei Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiaoling Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z. F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
A. Liu
Bing Xue
Bingxuan Wang
Bo WU
Bei Feng
Cheng Lu
Chenggang Zhao
Chengqi Deng
Chenyu Zhang
C. Ruan
Damai Dai
Deli Chen
Dong-Li Ji
Erhang Li
Fangyun Lin
Fucong Dai
Fuli Luo
Guangbo Hao
Guanting Chen
Guowei Li
Han Bao
Hanwei Xu
Haocheng Wang
Honghui Ding
Huajian Xin
Huazuo Gao
Hui Qu
Hui Li
Jianzhong Guo
Jiashi Li
Jiawei Wang
Jingchang Chen
Jingyang Yuan
Junjie Qiu
Junlong Li
J. Cai
J. Ni
Jian Liang
Jin Chen
Kai Dong
Kai Hu
Kaige Gao
Kang Guan
Kexin Huang
Kuai Yu
Lean Wang
Lecong Zhang
Liang Zhao
Litong Wang
Liyue Zhang
Lei Xu
Leyi Xia
Mingchuan Zhang
Minghua Zhang
Min Tang
Meng Li
Miaojun Wang
Mingming Li
Ning Tian
Panpan Huang
Peng Zhang
Qiancheng Wang
Qinyu Chen
Qiushi Du
Ruiqi Ge
Ruisong Zhang
Ruizhe Pan
Runji Wang
R. J. Chen
Rong Jin
Ruyi Chen
Shanghao Lu
Shangyan Zhou
Shanhuang Chen
Shengfeng Ye
Shiyu Wang
Shuiping Yu
Shunfeng Zhou
Shuting Pan
S. S. Li
Shuang Zhou
Shao-Ping Wu
Tao Yun
Tian Pei
Tianyu Sun
T. Wang
Wangding Zeng
Wanjia Zhao
Wen Liu
Wenfeng Liang
Wenjun Gao
Wen-Xuan Yu
Wentao Zhang
Wei Xiao
Wei An
Xiaodong Liu
Xiaohan Wang
Xiaokang Chen
Xiaotao Nie
Xin Cheng
Xin Liu
Xinfeng Xie
Xingchao Liu
Xinyu Yang
Xinyuan Li
Xuecheng Su
Xuheng Lin
Xiangyu Jin
Xi-Cheng Shen
Xiaosha Chen
Xiaowen Sun
Xiaoxi-ang Wang
Xinnan Song
Xinyi Zhou
Xianzu Wang
Xinxia Shan
Y. K. Li
Y. Q. Wang
Y. X. Wei
Yang Zhang
Yanhong Xu
Yao Zhao
Yaofeng Sun
Yaohui Wang
Yi Yu
Yichao Zhang
Yifan Shi
Yi Xiong
Ying He
Yishi Piao
Yisong Wang
Yi Chern Tan
Yiyang Ma
Yiyuan Liu
Yongqiang Guo
Yuan Ou
Yuduan Wang
Yue Gong
Yuheng Zou
Yuzi He
Yunfan Xiong
Yuxiang Luo
Yuxiang You
Yu-mei You
Yuxuan Liu
Yuyang Zhou
Y. X. Zhu
Yanping Huang
Yaohui Li
Yao Li
Yi Zheng
Yunxiang Ma
Ying Tang
Yukun Zha
Yuting Yan
Z. Z. Ren
Zehui Ren
Zhangli Sha
Zhe Fu
Zhean Xu
Zhenda Xie
Zhengyan Zhang
Zhewen Hao
Zhicheng Ma
Zhigang Yan
Zhiyu Wu
Zihui Gu
Zijia Zhu
Zijun Liu
Zi-An Li
Ziwei Xie
Deep Ganguli
Liane Lovitt
Jackson Kernion
Amanda Askell
Yuntao Bai
Saurav Kadavath
Benjamin Mann
Ethan Perez
Nicholas Schiefer
Kamal Ndousse
Andy Jones
Sam Bowman
Anna Chen
Tom Con-erly
Nova Dassarma
Dawn Drain
Nelson Elhage Sheer
Stanislav Fort
Zac Hatfield-Dodds
T. Henighan
Danny Hernandez
Tristan Hume
Josh Jacobson
Scott Johnston
Shauna Kravec
Catherine Olsson
Sam Ringer
Eli Tran-Johnson
Dario Amodei
Tom Brown
Nicholas Joseph
Sam McCandlish
Chris Olah
Jared Kaplan
Jack Clark. 2022. Red
Aaron Grattafiori
Abhimanyu Dubey
Abhinav Jauhri
Abhinav Pandey
Abhishek Kadian
Ahmad Al-Dahle
Aiesha Letman
Akhil Mathur
Alan Schel-ten
Alex Vaughan
Amy Yang
Angela Fan
Anirudh Goyal
A. Hartshorn
Aobo Yang
Archi Mitra
Archie Sravankumar
Artem Korenev
Arthur Hinsvark
Arun Rao
Aston Zhang
Aurelien Ro-driguez
Austen Gregerson
Ava Spataru
Baptiste Rozière
Bethany Biron
Binh Tang
Bobbie Chern
Charlotte Caucheteux
Chaya Nayak
Chloe Bi
Chris Marra
Chris McConnell
Christian Keller
Christophe Touret
Chunyang Wu
Corinne Wong
Cris-tian Cantón Ferrer
Cyrus Nikolaidis
Damien Al-lonsius
Daniel Song
Danielle Pintz
Danny Livshits
Danny Wyatt
David Esiobu
Dhruv Choudhary
Dhruv Mahajan 0001
Diego Garcia-Olano
Diego Perino
Dieuwke Hupkes
Egor Lakomkin
Ehab A. AlBadawy
Elina Lobanova
Emily Dinan
Eric Michael Smith
Filip Radenovic
Francisco Guzmán
Frank Zhang
Gabriele Synnaeve
Gabrielle Lee
Georgia Lewis
G. Thattai
Graeme Nail
Gregoire Mi-alon
Guan Pang
Guillem Cucurell
Hailey Nguyen
Han-nah Korevaar
Hu Xu
Hugo Touvron
Imanol Iliyan Zarov
Arrieta Ibarra
Is-abel Kloumann
Ishan Misra
Ivan Evtimov
Jack Zhang
Jade Copet
Jaewon Lee
Jan Geffert
Jana Vranes
Jason Park
Jay Mahadeokar
Jeet Shah
Jelmer van der Linde
Jennifer Billock
Jenny Hong
Jenya Lee
Jeremy Fu
J. Fu
Jianfeng Chi
Jianyu Huang
Jiawen Liu
Jie Wang
Jiecao Yu
Joanna Bitton
Joe Spisak
Jongsoo Park
Joseph Rocca
J. Johnstun
Joshua Saxe
Junteng Jia
Kalyan Vasuden Alwala
Karthik Prasad
Kartikeya Upasani
Kate Plawiak
Keqian Li
K. Heafield
Kevin R. Stone
Khalid El-Arini
Krithika Iyer
Kshitiz Malik
Kuen-ley Chiu
Kunal Bhalla
Kushal Lakhotia
Lauren Rantala-Yeary
Laurens van der Maaten
Lawrence Chen
Liang Tan
Liz Jenkins
Louis Martin
Lovish Madaan
Lubo Malo
Lukas Blecher
Lukas Landzaat
Luke de Oliveira
Madeline Muzzi
Mahesh Pasupuleti
Mannat Singh
Manohar Paluri
Marcin Kardas
Maria Tsimpoukelli
Mathew Oldham
Mathieu Rita
Maya Pavlova
Melanie Kam-badur
Mike Lewis
Mitesh Min Si
Kumar Singh
Mona Hassan
Naman Goyal
Narjes Torabi
Niko-lay Bashlykov
Nikolay Bogoychev
Niladri S. Chatterji
Ning Zhang
Olivier Duchenne
Onur Çelebi
Patrick Alrassy
Petar Pengwei Li
Peter Weng
Prajjwal Bhargava
Pratik Dubal
Punit Praveen Krishnan
Singh Koura
Puxin Xu
Qing He
Qingxiao Dong
Ragavan Srinivasan
Raj Ganapathy
Ramon Calderer
Ricardo Silveira Cabral
Robert Stojnic
Roberta Raileanu
Rohan Maheswari
Rohit Girdhar
Rohit Patel
Ro-main Sauvestre
Ron-nie Polidoro
Roshan Sumbaly
Ross Taylor
Ruan Silva
Rui Hou
Rui Wang
S. Hosseini
Sa-hana Chennabasappa
Sanjay Singh
Sean Bell
Seo-hyun Sonia Kim
Sergey Edunov
Shaoliang Nie
Sharan Narang
Sharath Chandra Raparthy
Sheng Shen
Shengye Wan
Shruti Bhosale
Shun Zhang
Simon Van-denhende
Soumya Batra
Spencer Whitman
Sten Sootla
Stephane Collot
Suchin Gururangan
S. Borodinsky
Tamar Herman
Tara Fowler
Tarek Sheasha
Thomas Georgiou
Thomas Scialom
Tobias Speckbacher
Todor Mihaylov
Tong Xiao
Ujjwal Karn
Vedanuj Goswami
Vibhor Gupta
Vignesh Ramanathan
Viktor Kerkez
Vincent Gonguet
Vir-ginie Do
Vish Vogeti
Vitor Albiero
Vladan Petro-vic
Weiwei Chu
Wenhan Xiong
Wenyin Fu
When to retrain a machine learning model
Florence Regol
Kyle Sprague
Thomas Markovich
A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of dat… (see more)a. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance evaluated with a bounded metric. Our experiments, addressing classification tasks, show that the method consistently outperforms existing baselines on 7 datasets. We thoroughly assess its robustness to varying cost trade-off values and mis-specified cost trade-offs.
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Simon Geisler
Stephan Günnemann
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Simon Geisler
Stephan Günnemann
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
A generative approach to LLM harmfulness detection with special red flag tokens
A generative approach to LLM harmfulness detection with special red flag tokens
Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives
Yan Scholten
Tom Wollschlager
Stephen Casper
Stephan Günnemann
Misaligned research objectives have considerably hindered progress in adversarial robustness research over the past decade. For instance, an… (see more) extensive focus on optimizing target metrics, while neglecting rigorous standardized evaluation, has led researchers to pursue ad-hoc heuristic defenses that were seemingly effective. Yet, most of these were exposed as flawed by subsequent evaluations, ultimately contributing little measurable progress to the field. In this position paper, we illustrate that current research on the robustness of large language models (LLMs) risks repeating past patterns with potentially worsened real-world implications. To address this, we argue that realigned objectives are necessary for meaningful progress in adversarial alignment. To this end, we build on established cybersecurity taxonomy to formally define differences between past and emerging threat models that apply to LLMs. Using this framework, we illustrate that progress requires disentangling adversarial alignment into addressable sub-problems and returning to core academic principles, such as measureability, reproducibility, and comparability. Although the field presents significant challenges, the fresh start on adversarial robustness offers the unique opportunity to build on past experience while avoiding previous mistakes.
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives
Yan Scholten
Tom Wollschlager
Stephen Casper
Stephan Günnemann
Efficient Adversarial Training in LLMs with Continuous Attacks
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. In many domains, adversarial tra… (see more)ining has proven to be one of the most promising methods to reliably improve robustness against such attacks. Yet, in the context of LLMs, current methods for adversarial training are hindered by the high computational costs required to perform discrete adversarial attacks at each training iteration. We address this problem by instead calculating adversarial attacks in the continuous embedding space of the LLM, which is orders of magnitudes more efficient. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses: the first makes the model robust on continuous embedding attacks computed on an adversarial behaviour dataset; the second ensures the usefulness of the final model by fine-tuning on utility data. Moreover, we introduce C-AdvIPO, an adversarial variant of IPO that does not require utility data for adversarially robust alignment. Our empirical evaluation on five models from different families (Gemma, Phi3, Mistral, Zephyr, Llama2) and at different scales (2B, 3.8B, 7B) shows that both algorithms substantially enhance LLM robustness against discrete attacks (GCG, AutoDAN, PAIR), while maintaining utility. Our results demonstrate that robustness to continuous perturbations can extrapolate to discrete threat models. Thereby, we present a path toward scalable adversarial training algorithms for robustly aligning LLMs.
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be direc… (see more)tly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.
On the Scalability of Certified Adversarial Robustness with Generated Data
Thomas Altstidl
Arthur Kosmala
Bjoern Eskofier
Certified defenses against adversarial attacks offer formal guarantees on the robustness of a model, making them more reliable than empirica… (see more)l methods such as adversarial training, whose effectiveness is often later reduced by unseen attacks. Still, the limited certified robustness that is currently achievable has been a bottleneck for their practical adoption. Gowal et al. and Wang et al. have shown that generating additional training data using state-of-the-art diffusion models can considerably improve the robustness of adversarial training. In this work, we demonstrate that a similar approach can substantially improve deterministic certified defenses but also reveal notable differences in the scaling behavior between certified and empirical methods. In addition, we provide a list of recommendations to scale the robustness of certified training approaches. Our approach achieves state-of-the-art deterministic robustness certificates on CIFAR-10 for the