Sabyasachi Sahoo

Hidden-State Similarity Predicts Re-Elicitation After Inoculation Prompting

Fine-tuning on narrow harmful tasks can cause emergent misalignment, where models generalize harmful behavior beyond the training distributi… (voir plus)on. Inoculation prompting can reduce this effect by explicitly eliciting the undesired behavior during training, but recent work shows that the behavior can reappear when evaluation prompts contain cues from the training context. We study what makes such prompts effective triggers. We find that textual similarity to the inoculation prompt is an incomplete predictor: prompts are more likely to re-elicit suppressed behavior when they induce activation states similar to those produced by the inoculation context. These findings advance our understanding of how inoculation prompting modulates conditional misalignment, and suggest that activation-space analysis can help identify when suppressed behaviors remain accessible under eval-time prompts.

2026-06-10

ICML.cc/2026/Workshop/Mech_Interp (poster)

openreview.net

When Does Interleaving Prevent Emergent Misalignment?

Chen Sun

Large language models finetuned on narrow harmful tasks are prone to emergent misalignment (EM), where harmful behavior generalizes beyond t… (voir plus)he training distribution. Interleaving benign data during finetuning has been proposed as a mitigation, but recent work disagrees on whether it prevents EM. In this paper, we investigate this disagreement on Qwen-2.5 7B and 32B, and find that no single property of the interleaved data, taken in isolation, accounts for the gap. Instead, much of it traces to the evaluation itself, as the standard EM benchmark is sensitive to the length of the prompts it uses, and lengthening the evaluation prompts substantially shifts measured misalignment across model sizes. We then identify a region in the model's activations that predicts whether a given interleaved set will prevent EM, and show that reformulating benign data to fall within it substantially reduces EM on both 7B and 32B. This suggests that the standard EM benchmark, which relies on short prompts, may misrepresent the effectiveness of proposed mitigations.

2026-06-10

ICML.cc/2026/Workshop/Mech_Interp (poster)

openreview.net

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Vikash Singh

Debargha Ganguly

Weicong Chen

Sabyasachi Sahoo

Sreehari Sankar

Biyao Zhang

Mohsen Hariri

Shouren Wang

Osama Zafar

Christian Gagné

Vipin Chaudhary

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a froze… (voir plus)n source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately

2026-05-12

arXiv (prépublication)

doi.org

arxiv.org

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling

Ola Ahmad

Frédéric Precioso

Fine-tuning pretrained models is a standard and effective workflow in modern machine learning. However, robust fine-tuning (RFT), which aims… (voir plus) to simultaneously achieve adaptation to a downstream task and robustness to adversarial examples, remains challenging. Despite the abundance of non-robust pretrained models in open-source repositories, their potential for RFT is less understood. We address this knowledge gap by systematically examining RFT from such non-robust models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub _suboptimal transfer_. In challenging scenarios (eg, difficult tasks, high perturbation), the resulting performance can be so low that it may be considered a transfer failure. We find that fine-tuning using a robust objective impedes task adaptation at the beginning of training and eventually prevents optimal transfer. However, we propose a novel heuristic, _Epsilon-Scheduling_, a schedule over perturbation strength used during training that promotes optimal transfer. Additionally, we introduce _expected robustness_, a metric that captures performance across a range of perturbations, providing a more comprehensive evaluation of the accuracy-robustness trade-off of diverse models at test-time. Extensive experiments on wide range of configurations (six pretrained models and five datasets) show that _Epsilon-Scheduling_ successfully prevents _suboptimal transfer_ and consistently improves expected robustness.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling

Jonas Ngnawe

Maxime Heuillet

Sabyasachi Sahoo

Yann Batiste Pequignot

Frédéric Precioso

Christian Gagné

Fine-tuning pretrained models is the standard approach in current machine learning practice, but simultaneously achieving adversarial robust… (voir plus)ness to adversarial examples remains a challenge. Despite the abundance of non-robust pretrained models in open-source repositories, their use for Robust Fine-Tuning (RFT) remains understudied. This work aims to bridge this knowledge gap by systematically examining RFT from such models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub \emph{suboptimal transfer}. In fact, we find that fine-tuning using a robust objective impedes task alignment at the beginning of training and eventually prevents optimal transfer. To promote optimal transfer, we propose \emph{Epsilon-Scheduling}, a simple heuristic scheduling over perturbation strength. Additionally, we introduce \emph{expected robustness}, a metric that measures performance across a range of perturbations. Experiments on six pretrained models and five datasets show that \emph{Epsilon-Scheduling} prevents \emph{suboptimal transfer} and consistently improves the expected robustness.

2025-09-28

NeurIPS.cc/2025/Workshop/Reliable_ML (publié)

openreview.net

A Layer Selection Approach to Test Time Adaptation

Sabyasachi Sahoo

Mostafa Elaraby

Jonas Ngnawe

Yann Batiste Pequignot

Frédéric Precioso

Christian Gagné

Test Time Adaptation (TTA) addresses the problem of distribution shift by adapting a pretrained model to a new domain during inference. When… (voir plus) faced with challenging shifts, most methods collapse and perform worse than the original pretrained model. In this paper, we find that not all layers are equally receptive to the adaptation, and the layers with the most misaligned gradients often cause performance degradation. To address this, we propose GALA, a novel layer selection criterion to identify the most beneficial updates to perform during test time adaptation. This criterion can also filter out unreliable samples with noisy gradients. Its simplicity allows seamless integration with existing TTA loss functions, thereby preventing degradation and focusing adaptation on the most trainable layers. This approach also helps to regularize adaptation to preserve the pretrained features, which are crucial for handling unseen domains. Through extensive experiments, we demonstrate that the proposed layer selection framework improves the performance of existing TTA approaches across multiple datasets, domain shifts, model architectures, and TTA losses.

2025-04-10

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

doi.org

openreview.net

Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Adversarial Scheduling

Ola Ahmad

Frédéric Precioso

Fine-tuning pretrained models is a standard and effective workflow in modern machine learning. However, robust fine-tuning (RFT), which aims… (voir plus) to simultaneously achieve adaptation to a downstream task and robustness to adversarial examples, remains challenging. Despite the abundance of non-robust pretrained models in open-source repositories, their potential for RFT is less understood. We address this knowledge gap by systematically examining RFT from such non-robust models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub \emph{suboptimal transfer}. In challenging scenarios (eg, difficult tasks, high perturbation), the resulting performance can be so low that it may be considered a transfer failure. We find that fine-tuning using a robust objective impedes task adaptation at the beginning of training and eventually prevents optimal transfer. However, we propose a novel heuristic, \emph{Epsilon-Scheduling}, a schedule over perturbation strength used during training that promotes optimal transfer. Additionally, we introduce \emph{expected robustness}, a metric that captures performance across a range of perturbations, providing a more comprehensive evaluation of the accuracy-robustness trade-off for diverse models at test time. Extensive experiments on a wide range of configurations (six pretrained models and five datasets) show that \emph{Epsilon-Scheduling} successfully prevents \emph{suboptimal transfer} and consistently improves expected robustness.

2024-12-31

arXiv (prépublication)

doi.org

arxiv.org

Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers

Frédéric Precioso

Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning mod… (voir plus)els can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate high margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively and confidently use the logit margin to detect brittle decisions with such models. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to assess adversarial vulnerability in deployment scenarios efficiently.

2024-09-24

NeurIPS.cc/2024/Conference (poster)

doi.org

openreview.net

Hessian Aware Low-Rank Perturbation for Order-Robust Continual Learning

Jiaqi Li

Rui Wang

Yuanhao Lai

Changjian Shui

Sabyasachi Sahoo

Charles X. Ling

Shichun Yang

Boyu Wang

Christian Gagné

Fan Zhou

Continual learning aims to learn a series of tasks sequentially without forgetting the knowledge acquired from the previous ones. In this wo… (voir plus)rk, we propose the Hessian Aware Low-Rank Perturbation algorithm for continual learning. By modeling the parameter transitions along the sequential tasks with the weight matrix transformation, we propose to apply the low-rank approximation on the task-adaptive parameters in each layer of the neural networks. Specifically, we theoretically demonstrate the quantitative relationship between the Hessian and the proposed low-rank approximation. The approximation ranks are then globally determined according to the marginal increment of the empirical loss estimated by the layer-specific gradient and low-rank approximation error. Furthermore, we control the model capacity by pruning less important parameters to diminish the parameter growth. We conduct extensive experiments on various benchmarks, including a dataset with large-scale tasks, and compare our method against some recent state-of-the-art methods to demonstrate the effectiveness and scalability of our proposed method. Empirical results show that our method performs better on different benchmarks, especially in achieving task order robustness and handling the forgetting issue. The source code is at https://github.com/lijiaqi/HALRP.

2023-11-25

ArXiv (prépublication)

doi.org

arxiv.org

GROOD: Gradient-Aware Out-of-Distribution Detection

Mostafa Elaraby

Sabyasachi Sahoo

Yann Batiste Pequignot

Paul Novello

Liam Paull

2022-12-31

arXiv.org (prépublication)