Nazanin Mohammadi Sepahvand

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

Eleni Triantafillou

Daniel M. Roy

Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based… (see more) on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

2026-02-23

arXiv (preprint)

doi.org

arxiv.org

Leveraging Per-Instance Privacy for Machine Unlearning

Nazanin Mohammadi Sepahvand

Anvith Thudi

Berivan Isik

Ashmita Bhattacharyya

Nicolas Papernot

Eleni Triantafillou

Daniel M. Roy

Gintare Karolina Dziugaite

2025-07-14

International Conference on Machine Learning (Accept (poster))

doi.org

proceedings.mlr.press

Selective Unlearning via Representation Erasure Using Domain Adversarial Training

Nazanin Mohammadi Sepahvand

Eleni Triantafillou

Hugo Larochelle

Doina Precup

James J. Clark

Daniel M. Roy

Gintare Karolina Dziugaite

2025-01-21

ICLR.cc/2025/Conference (poster)

openreview.net

Data Selection for Transfer Unlearning

Nazanin Mohammadi Sepahvand

Vincent Dumoulin

Eleni Triantafillou

Gintare Karolina Dziugaite

2024-05-15

ArXiv (preprint)

doi.org

arxiv.org

HAD-Net: A Hierarchical Adversarial Knowledge Distillation Network for Improved Enhanced Tumour Segmentation Without Post-Contrast Images

Saverio Vadacchino

Raghav Mehta

Nazanin Mohammadi Sepahvand

Brennan Nichyporuk

James J Clark

Tal Arbel

Segmentation of enhancing tumours or lesions from MRI is important for detecting new disease activity in many clinical contexts. However, ac… (see more)curate segmentation requires the inclusion of medical images (e.g., T1 post contrast MRI) acquired after injecting patients with a contrast agent (e.g., Gadolinium), a process no longer thought to be safe. Although a number of modality-agnostic segmentation networks have been developed over the past few years, they have been met with limited success in the context of enhancing pathology segmentation. In this work, we present HAD-Net, a novel offline adversarial knowledge distillation (KD) technique, whereby a pre-trained teacher segmentation network, with access to all MRI sequences, teaches a student network, via hierarchical adversarial training, to better overcome the large domain shift presented when crucial images are absent during inference. In particular, we apply HAD-Net to the challenging task of enhancing tumour segmentation when access to post-contrast imaging is not available. The proposed network is trained and tested on the BraTS 2019 brain tumour segmentation challenge dataset, where it achieves performance improvements in the ranges of 16% - 26% over (a) recent modality-agnostic segmentation methods (U-HeMIS, U-HVED), (b) KD-Net adapted to this problem, (c) the pre-trained student network and (d) a non-hierarchical version of the network (AD-Net), in terms of Dice scores for enhancing tumour (ET). The network also shows improvements in tumour core (TC) Dice scores. Finally, the network outperforms both the baseline student network and AD-Net in terms of uncertainty quantification for enhancing tumour segmentation based on the BraTs 2019 uncertainty challenge metrics. Our code is publicly available at: https://github.com/SaverioVad/HAD_Net

2021-08-24

Proceedings of the Fourth Conference on Medical Imaging with Deep Learning (published)

doi.org

proceedings.mlr.press

CNN Detection of New and Enlarging Multiple Sclerosis Lesions from Longitudinal Mri Using Subtraction Images

Nazanin Mohammadi Sepahvand

Douglas Arnold

Tal Arbel

Accurate detection and segmentation of new lesional activity in longitudinal Magnetic Resonance Images (MRIs) of patients with Multiple Scle… (see more)rosis (MS) is important for monitoring disease activity, as well as for assessing treatment effects. In this work, we present the first deep learning framework to automatically detect and segment new and enlarging (NE) T2w lesions from longitudinal brain MRIs acquired from relapsing-remitting MS (RRMS) patients. The proposed framework is an adapted 3D U-Net [1] which includes as inputs the reference multi-modal MRI and T2-weighted lesion maps, as well an attention mechanism based on the subtraction MRI (between the two timepoints) which serves to assist the network in learning to differentiate between real anatomical change and artifactual change, while constraining the search space for small lesions. Experiments on a large, proprietary, multi -center, multi-modal, clinical trial dataset consisting of 1677 multi-modal scans illustrate that network achieves high overall detection accuracy (detection AUC=.95), outperforming (1) a U-Net without an attention mechanism (de-tection AUC=.93), (2) a framework based on subtracting independent T2-weighted segmentations (detection AUC=.57), and (3) DeepMedic (detection AUC=.84) [2], particularly for small lesions. In addition, the method was able to accurately classify patients as active/inactive with (sensitivities of. 69 and specificities of. 97).

2020-04-02

2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI) (published)

doi.org

Improving Pathological Structure Segmentation via Transfer Learning Across Diseases

Barleen Kaur

Paul Lemaitre

Raghav Mehta

Nazanin Mohammadi Sepahvand

Doina Precup

Douglas Arnold

Tal Arbel

2019-10-12