Detecting adversarial examples is an important defense against adversarial attacks. Existing supervised learning detectors perform well for known attacks but deteriorate when detecting unseen instances. To mitigate the sensitivity with training instances, we propose a detector based on the output inconsistency between the protected model and a designed dual model to detect unseen attacks. A test image with different predicted labels on the protected model and the dual model is taken as adversarial. To detect highly transferable adversarial examples and defense adaptive ensemble attacks against the proposed detector, an orthogonal knowledge distillation is employed to train the dual model. The distillation suppresses the transferability across the protected and dual model, therefore forcing them to output different labels for strong adversarial examples. Experimental results on CIFAR-10 and ImageNet show that our method detects various adversarial examples effectively. Compared with state-of-the-art methods, our method achieves at least 6.2% higher average detection accuracy in the cross-attack test. Our method is robust to the popular transferability-enhanced methods, with a minor accuracy decrease by up to 4% in the robustness test.
|