Linh Le

Visiteur de recherche indépendant - University of Technology Sydney

Superviseur⋅e principal⋅e

Adam M. Oberman

Sujets de recherche

Sécurité de l'IA

Google Scholar

GitHub

Publications

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Linh Le

David Williams-King

Mohamed Amine Merzouk

Aton Kamanda

Adam Oberman

Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of tho… (voir plus)usands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks—without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.

2026-03-01

Trustworthy AI @ International Conference on Learning Representations (publié)

openreview.net

Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity

David Williams-King

Linh Le

Adam Oberman

Yoshua Bengio

As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certa… (voir plus)in model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we draw analogies with historical examples and develop lessons learned that can be applied to LLM safety. These arguments support the need for new and more principled approaches to designing safe models, which are architected for security from the beginning. We describe several such approaches from the AI literature.

2024-10-11

NeurIPS.cc/2024/Workshop/SafeGenAi (poster)

doi.org

openreview.net

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Linh Le

Publications

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Boussole des politiques en IA

Mots-clés populaires:

Linh Le

Publications