Abstrakt | Explanation methods analyze the features in backdoored
input data that contribute to model misclassification. However, current methods like path techniques struggle to detect
backdoor patterns in adversarial situations. They fail to grasp
the hidden associations of backdoor features with other input
features, leading to misclassification. Additionally, they suffer
from irrelevant data attribution, imprecise feature connections,
baseline dependence, and vulnerability to the "saturation effect".
To address these limitations, we propose Xplain. Our
method aims to uncover hidden backdoor trigger patterns and
the subtle relationships between backdoor features and other
input objects, which are the main causes of model misclassification. Our algorithm improves existing path techniques by
integrating an additional baseline into the Integrated Gradients (IG) formulation. This ensures that features selected in
the baseline persist along the integration path, guaranteeing
baseline independence. Additionally, we introduce quantitative noise to interpolate samples along the integration path,
which reduces feature dependency and captures non-linear
interactions. This approach effectively identifies the relevant
features that significantly influence model predictions.
Furthermore, Xplain proposes sensitivity analysis to enhance AI system resilience against backdoor attacks. This
uncovers clear connections between the backdoor and other
input data features, thus shedding light on relevant interactions. We thoroughly test the effectiveness of Xplain on the
Imagenet and the multimodal domain of the Visual Question
Answering dataset, showing its superiority over current path
methods such as Integrated Gradient (IG), left-IG, Guided IG,
and Adversarial Gradient Integration (AGI) techniques |
---|