Portrait of Linh Le is unavailable

Linh Le

Independent visiting researcher - University of Technology Sydney
Supervisor
Research Topics
AI Safety

Publications

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
David Williams-King
Aton Kamanda
Adam Oberman
Current adversarial robustness methods for large language models require extensive datasets of harmful prompts (thousands to hundreds of tho… (see more)usands of examples), yet remain vulnerable to novel attack vectors and distributional shifts. We propose Latent Personality Alignment (LPA), a sample-efficient defense that achieves robustness by training models on abstract personality traits rather than specific harmful behaviors. Using fewer than 100 trait statements and latent adversarial training, LPA achieves comparable attack success rates to methods trained on 150k+ examples, while maintaining superior utility. Critically, LPA generalizes better to unseen attack distributions, reducing misclassification rates by 2.6x compared to baseline across six harm benchmarks—without ever seeing harmful examples during training. Our results demonstrate that personality-based alignment offers a principled approach to building robust defenses with minimal cost.
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
David Williams-King
Adam Oberman
As LLMs develop increasingly advanced capabilities, there is an increased need to minimize the harm that could be caused to society by certa… (see more)in model outputs; hence, most LLMs have safety guardrails added, for example via fine-tuning. In this paper, we argue the position that current safety fine-tuning is very similar to a traditional cat-and-mouse game (or arms race) between attackers and defenders in cybersecurity. Model jailbreaks and attacks are patched with bandaids to target the specific attack mechanism, but many similar attack vectors might remain. When defenders are not proactively coming up with principled mechanisms, it becomes very easy for attackers to sidestep any new defenses. We show how current defenses are insufficient to prevent new adversarial jailbreak attacks, reward hacking, and loss of control problems. In order to learn from past mistakes in cybersecurity, we draw analogies with historical examples and develop lessons learned that can be applied to LLM safety. These arguments support the need for new and more principled approaches to designing safe models, which are architected for security from the beginning. We describe several such approaches from the AI literature.