Portrait of Mohamed Amine Merzouk

Mohamed Amine Merzouk

Postdoctorate - McGill University
Supervisor
Research Topics
Adversarial Robustness
AI Safety
Anomaly Detection
Cybersecurity
Deep Learning
Diffusion Models
Federated Learning
Large Language Models (LLM)
Reinforcement Learning
Trustworthy AI

Publications

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
David Williams-King
Aton Kamanda
Adam Oberman
Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of tho… (see more)usands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks—without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.
Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs
Mina Taraghi
Yann Batiste Pequignot
Amin Nikanjam
Organizations are increasingly adopting and adapting Large Language Models (LLMs) hosted on public repositories such as HuggingFace. Althoug… (see more)h these adaptations often improve performance on specialized downstream tasks, recent evidence indicates that they can also degrade a model's safety or fairness. Since different fine-tuning techniques may exert distinct effects on these critical dimensions, this study undertakes a systematic assessment of their trade-offs. Four widely used Parameter-Efficient Fine-Tuning methods, LoRA, IA3, Prompt-Tuning, and P-Tuning, are applied to four instruction-tuned model families (Meta-Llama-3-8B, Qwen2.5-7B, Mistral-7B, and Gemma-7B). In total, 235 fine-tuned variants are evaluated across eleven safety hazard categories and nine demographic fairness dimensions. The results show that adapter-based approaches (LoRA, IA3) tend to improve safety scores and are the least disruptive to fairness, retaining higher accuracy and lower bias scores. In contrast, prompt-based methods (Prompt-Tuning and P-Tuning) generally reduce safety and cause larger fairness regressions, with decreased accuracy and increased bias. Alignment shifts are strongly moderated by base model type: LLaMA remains stable, Qwen records modest gains, Gemma experiences the steepest safety decline, and Mistral, which is released without an internal moderation layer, displays the greatest variance. Improvements in safety do not necessarily translate into improvements in fairness, and no single configuration optimizes all fairness metrics simultaneously, indicating an inherent trade-off between these objectives. These findings suggest a practical guideline for safety-critical deployments: begin with a well-aligned base model, favour adapter-based PEFT, and conduct category-specific audits of both safety and fairness.
Diffusion-Based Adversarial Purification for Intrusion Detection
Erwan Beurier
Reda Yaich
N. Cuppens-Boulahia
Frédéric Cuppens