Mehrnaz Mofakhami

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

Most safety training methods for large language models (LLMs) are based on fine-tuning that forces models to shift from an unsafe answer to … (voir plus)refusal when faced with harmful requests. Unfortunately, these drastic distribution shifts generally compromise model capabilities. To avoid that, we propose to expand the model's vocabulary with a special token we call *red flag token* (

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

Performative Prediction on Games and Mechanism Design

Fernando P. Santos

2025-04-23

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (publié)

doi.org

openreview.net

A Generative Approach to LLM Harmfulness Detection with Red Flag Tokens

Most safety training methods for large-language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (voir plus)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call a *red flag token* (

2025-03-05

ICLR.cc/2025/Workshop/BuildingTrust (accepté)

openreview.net

A generative approach to LLM harmfulness detection with special red flag tokens

Most safety training methods for large language models (LLMs) based on fine-tuning rely on dramatically changing the output distribution of … (voir plus)the model when faced with a harmful request, shifting it from an unsafe answer to a refusal to respond. These methods inherently compromise model capabilities and might make auto-regressive models vulnerable to attacks that make likely an initial token of affirmative response. To avoid that, we propose to expand the model's vocabulary with a special token we call red flag token () and propose to fine-tune the model to generate this token at any time harmful content is generated or about to be generated. This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. This method offers several advantages: it enables the model to explicitly learn the concept of harmfulness while marginally affecting the generated distribution, thus maintaining the model's utility. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks. In addition, it simplifies the evaluation of the model's robustness and reduces correlated failures when combined with a classifier. We further show an increased robustness to long contexts, and supervised fine-tuning attacks.

2025-02-22

ArXiv (prépublication)

arxiv.org

A generative approach to LLM harmfulness detection with special red flag tokens

2025-01-01

arXiv.org (prépublication)

doi.org

arxiv.org

Tight Lower Bounds and Improved Convergence in Performative Prediction

Pedram J. Khorsandi

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (voir plus)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2024-12-04

ArXiv (prépublication)

arxiv.org

Tight Lower Bounds and Improved Convergence in Performative Prediction

Pedram J. Khorsandi

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (voir plus)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2024-12-04

ArXiv (prépublication)

doi.org

arxiv.org

Tight Lower Bounds and Improved Convergence in Performative Prediction

Performative prediction is a framework accounting for the shift in the data distribution induced by the prediction of a model deployed in th… (voir plus)e real world. Ensuring rapid convergence to a stable solution where the data distribution remains the same after the model deployment is crucial, especially in evolving environments. This paper extends the Repeated Risk Minimization (RRM) framework by utilizing historical datasets from previous retraining snapshots, yielding a class of algorithms that we call Affine Risk Minimizers and enabling convergence to a performatively stable point for a broader class of problems. We introduce a new upper bound for methods that use only the final iteration of the dataset and prove for the first time the tightness of both this new bound and the previous existing bounds within the same regime. We also prove that utilizing historical datasets can surpass the lower bound for last iterate RRM, and empirically observe faster convergence to the stable point on various performative prediction benchmarks. We offer at the same time the first lower bound analysis for RRM within the class of Affine Risk Minimizers, quantifying the potential improvements in convergence speed that could be achieved with other variants in our framework.

2024-10-10

NeurIPS.cc/2024/Workshop/OPT (publié)

doi.org

openreview.net

Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

Mehrnaz Mofakhami

Reza Bayat

Ioannis Mitliagkas

Joao Monteiro

Valentina Zantedeschi

2024-06-20

ICML.cc/2024/Workshop/ES-FoMo-II (poster)

openreview.net

Performative Prediction with Neural Networks

Mehrnaz Mofakhami

Ioannis Mitliagkas

Gauthier Gidel

Performative prediction is a framework for learning models that influence the data they intend to predict. We focus on finding classifiers t… (voir plus)hat are performatively stable, i.e. optimal for the data distribution they induce. Standard convergence results for finding a performatively stable classifier with the method of repeated risk minimization assume that the data distribution is Lipschitz continuous to the model's parameters. Under this assumption, the loss must be strongly convex and smooth in these parameters; otherwise, the method will diverge for some problems. In this work, we instead assume that the data distribution is Lipschitz continuous with respect to the model's predictions, a more natural assumption for performative systems. As a result, we are able to significantly relax the assumptions on the loss function. In particular, we do not need to assume convexity with respect to the model's parameters. As an illustration, we introduce a resampling procedure that models realistic distribution shifts and show that it satisfies our assumptions. We support our theory by showing that one can learn performatively stable classifiers with neural networks making predictions about real data that shift according to our proposed procedure.

2023-01-01

AISTATS (publié)

doi.org

openreview.net

Performative Prediction with Neural Networks

Mehrnaz Mofakhami

Ioannis Mitliagkas

Gauthier Gidel

2023-01-01

AISTATS (publié)

doi.org

openreview.net