Portrait of Sophie Xhonneux

Sophie Xhonneux

PhD - Université de Montréal
Supervisor
Co-supervisor
Research Topics
Deep Learning
Generative Models
Graph Neural Networks
Knowledge Graphs
Large Language Models (LLM)

Publications

Jailbreak Distillation: Renewable Safety Benchmarking
Jingyu Zhang
Ahmed Elgohary
Xiawei Wang
A S M Iftekhar
Ahmed Magooda
Benjamin Van Durme
Daniel Khashabi
Kyle Jackson
JBDistill Benchmark JBDistill Benchmark
Marah Ihab Abdin
Jyoti Aneja
Harkirat Singh Behl
Sébastien Bubeck
Ronen Eldan
S. Gunasekar
Michael Harrison
Russell J. Hewett
Mojan Javaheripi
Piero Kauffmann
James R. Lee … (see 484 more)
Yin Tat Lee
Yuanzhi Li
Weishung Liu
Caio C. T. Mendes
Anh Nguyen
Eric Price
Gustavo de Rosa
Olli Saarikivi
Adil Salim
Tim Beyer
Simon Geisler
Stephan Günnemann. 2025
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
Whitney Maxwell
Patrick Chao
Edoardo Debenedetti
Alexander Robey
Maksym Andriushchenko
Francesco Croce
Vikash Sehwag
Edgar Dobriban
Nicolas Flammarion
George J. Pappas
Florian Tramèr
Hamed Hassani
Eric Wong
Jailbreakbench
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
Lev E McKinney
Rohit Gandikota
Aidan Ewart
Domenic Rosati
Zichu Wu
Zikui Cai
Daya Guo
Dejian Yang
Haowei Zhang
Jun-Mei Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiaoling Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z. F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
A. Liu
Bing Xue
Bingxuan Wang
Bo WU
Bei Feng
Cheng Lu
Chenggang Zhao
Chengqi Deng
Chenyu Zhang
C. Ruan
Damai Dai
Deli Chen
Dong-Li Ji
Erhang Li
Fangyun Lin
Fucong Dai
Fuli Luo
Guangbo Hao
Guanting Chen
Guowei Li
Han Bao
Hanwei Xu
Haocheng Wang
Honghui Ding
Huajian Xin
Huazuo Gao
Hui Qu
Hui Li
Jianzhong Guo
Jiashi Li
Jiawei Wang
Jingchang Chen
Jingyang Yuan
Junjie Qiu
Junlong Li
J. Cai
J. Ni
Jian Liang
Jin Chen
Kai Dong
Kai Hu
Kaige Gao
Kang Guan
Kexin Huang
Kuai Yu
Lean Wang
Lecong Zhang
Liang Zhao
Litong Wang
Liyue Zhang
Lei Xu
Leyi Xia
Mingchuan Zhang
Minghua Zhang
Min Tang
Meng Li
Miaojun Wang
Mingming Li
Ning Tian
Panpan Huang
Peng Zhang
Qiancheng Wang
Qinyu Chen
Qiushi Du
Ruiqi Ge
Ruisong Zhang
Ruizhe Pan
Runji Wang
R. J. Chen
Rong Jin
Ruyi Chen
Shanghao Lu
Shangyan Zhou
Shanhuang Chen
Shengfeng Ye
Shiyu Wang
Shuiping Yu
Shunfeng Zhou
Shuting Pan
S. S. Li
Shuang Zhou
Shao-Ping Wu
Tao Yun
Tian Pei
Tianyu Sun
T. Wang
Wangding Zeng
Wanjia Zhao
Wen Liu
Wenfeng Liang
Wenjun Gao
Wen-Xuan Yu
Wentao Zhang
Wei Xiao
Wei An
Xiaodong Liu
Xiaohan Wang
Xiaokang Chen
Xiaotao Nie
Xin Cheng
Xin Liu
Xinfeng Xie
Xingchao Liu
Xinyu Yang
Xinyuan Li
Xuecheng Su
Xuheng Lin
Xiangyu Jin
Xi-Cheng Shen
Xiaosha Chen
Xiaowen Sun
Xiaoxi-ang Wang
Xinnan Song
Xinyi Zhou
Xianzu Wang
Xinxia Shan
Y. K. Li
Y. Q. Wang
Y. X. Wei
Yang Zhang
Yanhong Xu
Yao Zhao
Yaofeng Sun
Yaohui Wang
Yi Yu
Yichao Zhang
Yifan Shi
Yi Xiong
Ying He
Yishi Piao
Yisong Wang
Yi Chern Tan
Yiyang Ma
Yiyuan Liu
Yongqiang Guo
Yuan Ou
Yuduan Wang
Yue Gong
Yuheng Zou
Yuzi He
Yunfan Xiong
Yuxiang Luo
Yuxiang You
Yu-mei You
Yuxuan Liu
Yuyang Zhou
Y. X. Zhu
Yanping Huang
Yaohui Li
Yao Li
Yi Zheng
Yunxiang Ma
Ying Tang
Yukun Zha
Yuting Yan
Z. Z. Ren
Zehui Ren
Zhangli Sha
Zhe Fu
Zhean Xu
Zhenda Xie
Zhengyan Zhang
Zhewen Hao
Zhicheng Ma
Zhigang Yan
Zhiyu Wu
Zihui Gu
Zijia Zhu
Zijun Liu
Zi-An Li
Ziwei Xie
Deep Ganguli
Liane Lovitt
Jackson Kernion
Amanda Askell
Yuntao Bai
Saurav Kadavath
Benjamin Mann
Ethan Perez
Nicholas Schiefer
Kamal Ndousse
Andy Jones
Sam Bowman
Anna Chen
Tom Con-erly
Nova Dassarma
Dawn Drain
Nelson Elhage Sheer
Stanislav Fort
Zac Hatfield-Dodds
T. Henighan
Danny Hernandez
Tristan Hume
Josh Jacobson
Scott Johnston
Shauna Kravec
Catherine Olsson
Sam Ringer
Eli Tran-Johnson
Dario Amodei
Tom Brown
Nicholas Joseph
Sam McCandlish
Chris Olah
Jared Kaplan
Jack Clark. 2022. Red
Aaron Grattafiori
Abhimanyu Dubey
Abhinav Jauhri
Abhinav Pandey
Abhishek Kadian
Ahmad Al-Dahle
Aiesha Letman
Akhil Mathur
Alan Schel-ten
Alex Vaughan
Amy Yang
Angela Fan
Anirudh Goyal
A. Hartshorn
Aobo Yang
Archi Mitra
Archie Sravankumar
Artem Korenev
Arthur Hinsvark
Arun Rao
Aston Zhang
Aurelien Ro-driguez
Austen Gregerson
Ava Spataru
Baptiste Rozière
Bethany Biron
Binh Tang
Bobbie Chern
Charlotte Caucheteux
Chaya Nayak
Chloe Bi
Chris Marra
Chris McConnell
Christian Keller
Christophe Touret
Chunyang Wu
Corinne Wong
Cris-tian Cantón Ferrer
Cyrus Nikolaidis
Damien Al-lonsius
Daniel Song
Danielle Pintz
Danny Livshits
Danny Wyatt
David Esiobu
Dhruv Choudhary
Dhruv Mahajan 0001
Diego Garcia-Olano
Diego Perino
Dieuwke Hupkes
Egor Lakomkin
Ehab A. AlBadawy
Elina Lobanova
Emily Dinan
Eric Michael Smith
Filip Radenovic
Francisco Guzmán
Frank Zhang
Gabriele Synnaeve
Gabrielle Lee
Georgia Lewis
G. Thattai
Graeme Nail
Gregoire Mi-alon
Guan Pang
Guillem Cucurell
Hailey Nguyen
Han-nah Korevaar
Hu Xu
Hugo Touvron
Imanol Iliyan Zarov
Arrieta Ibarra
Is-abel Kloumann
Ishan Misra
Ivan Evtimov
Jack Zhang
Jade Copet
Jaewon Lee
Jan Geffert
Jana Vranes
Jason Park
Jay Mahadeokar
Jeet Shah
Jelmer van der Linde
Jennifer Billock
Jenny Hong
Jenya Lee
Jeremy Fu
J. Fu
Jianfeng Chi
Jianyu Huang
Jiawen Liu
Jie Wang
Jiecao Yu
Joanna Bitton
Joe Spisak
Jongsoo Park
Joseph Rocca
J. Johnstun
Joshua Saxe
Junteng Jia
Kalyan Vasuden Alwala
Karthik Prasad
Kartikeya Upasani
Kate Plawiak
Keqian Li
K. Heafield
Kevin R. Stone
Khalid El-Arini
Krithika Iyer
Kshitiz Malik
Kuen-ley Chiu
Kunal Bhalla
Kushal Lakhotia
Lauren Rantala-Yeary
Laurens van der Maaten
Lawrence Chen
Liang Tan
Liz Jenkins
Louis Martin
Lovish Madaan
Lubo Malo
Lukas Blecher
Lukas Landzaat
Luke de Oliveira
Madeline Muzzi
Mahesh Pasupuleti
Mannat Singh
Manohar Paluri
Marcin Kardas
Maria Tsimpoukelli
Mathew Oldham
Mathieu Rita
Maya Pavlova
Melanie Kam-badur
Mike Lewis
Mitesh Min Si
Kumar Singh
Mona Hassan
Naman Goyal
Narjes Torabi
Niko-lay Bashlykov
Nikolay Bogoychev
Niladri S. Chatterji
Ning Zhang
Olivier Duchenne
Onur Çelebi
Patrick Alrassy
Petar Pengwei Li
Peter Weng
Prajjwal Bhargava
Pratik Dubal
Punit Praveen Krishnan
Singh Koura
Puxin Xu
Qing He
Qingxiao Dong
Ragavan Srinivasan
Raj Ganapathy
Ramon Calderer
Ricardo Silveira Cabral
Robert Stojnic
Roberta Raileanu
Rohan Maheswari
Rohit Girdhar
Rohit Patel
Ro-main Sauvestre
Ron-nie Polidoro
Roshan Sumbaly
Ross Taylor
Ruan Silva
Rui Hou
Rui Wang
S. Hosseini
Sa-hana Chennabasappa
Sanjay Singh
Sean Bell
Seo-hyun Sonia Kim
Sergey Edunov
Shaoliang Nie
Sharan Narang
Sharath Chandra Raparthy
Sheng Shen
Shengye Wan
Shruti Bhosale
Shun Zhang
Simon Van-denhende
Soumya Batra
Spencer Whitman
Sten Sootla
Stephane Collot
Suchin Gururangan
S. Borodinsky
Tamar Herman
Tara Fowler
Tarek Sheasha
Thomas Georgiou
Thomas Scialom
Tobias Speckbacher
Todor Mihaylov
Tong Xiao
Ujjwal Karn
Vedanuj Goswami
Vibhor Gupta
Vignesh Ramanathan
Viktor Kerkez
Vincent Gonguet
Vir-ginie Do
Vish Vogeti
Vitor Albiero
Vladan Petro-vic
Weiwei Chu
Wenhan Xiong
Wenyin Fu
A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens
Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Simon Geisler
Stephan Günnemann
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
LLM-Safety Evaluations Lack Robustness
Tim Beyer
Simon Geisler
Stephan Günnemann
In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of… (see more) noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
A generative approach to LLM harmfulness detection with special red flag tokens
A generative approach to LLM harmfulness detection with special red flag tokens
Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (see more)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives
Yan Scholten
Tom Wollschlager
Stephen Casper
Stephan Günnemann
Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives
Yan Scholten
Tom Wollschlager
Stephen Casper
Stephan Günnemann
Misaligned research objectives have considerably hindered progress in adversarial robustness research over the past decade. For instance, an… (see more) extensive focus on optimizing target metrics, while neglecting rigorous standardized evaluation, has led researchers to pursue ad-hoc heuristic defenses that were seemingly effective. Yet, most of these were exposed as flawed by subsequent evaluations, ultimately contributing little measurable progress to the field. In this position paper, we illustrate that current research on the robustness of large language models (LLMs) risks repeating past patterns with potentially worsened real-world implications. To address this, we argue that realigned objectives are necessary for meaningful progress in adversarial alignment. To this end, we build on established cybersecurity taxonomy to formally define differences between past and emerging threat models that apply to LLMs. Using this framework, we illustrate that progress requires disentangling adversarial alignment into addressable sub-problems and returning to core academic principles, such as measureability, reproducibility, and comparability. Although the field presents significant challenges, the fresh start on adversarial robustness offers the unique opportunity to build on past experience while avoiding previous mistakes.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling wi… (see more)th a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (see more)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.