Mohamed Amine Merzouk

Postdoctorat - McGill

Superviseur⋅e principal⋅e

Adam M. Oberman

Sujets de recherche

Apprentissage fédéré

Apprentissage par renforcement

Apprentissage profond

Cybersécurité

Détection d'anomalies

Grands modèles de langage (LLM)

IA digne de confiance

Modèles de diffusion

Robustesse antagoniste

Sécurité de l'IA

Site web

Google Scholar

GitHub

Publications

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Linh Le

David Williams-King

Mohamed Amine Merzouk

Aton Kamanda

Adam Oberman

Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of tho… (voir plus)usands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks—without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.

2026-03-01

Trustworthy AI @ International Conference on Learning Representations (publié)

openreview.net

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs

Mina Taraghi

Yann Batiste Pequignot

Amin Nikanjam

Mohamed Amine Merzouk

Foutse Khomh

Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Althoug… (voir plus)h these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.

2025-10-31

ArXiv (prépublication)

doi.org

arxiv.org

Diffusion-Based Adversarial Purification for Intrusion Detection

Mohamed Amine Merzouk

Erwan Beurier

Reda Yaich

N. Cuppens-Boulahia

Frédéric Cuppens

Foutse Khomh

2024-12-31