Evaluating the Adversarial Robustness of Arabic Spam Classifiers (Prof. Imtiaz Ahmad)
Computer Engineering Department
Several studies have exposed the vulnerability of Natural Language Processing (NLP) models to adversarial attacks, which are inputs crafted by attackers to deliberately force NLP models into an incorrect prediction. Adversarial robustness measures the performance of these systems under such attacks. In the Arabic literature, very limited contributions exist in the field of NLP adversarial robustness. Hence, this study focuses on examining the adversarial robustness of an Arabic NLP model, especially in a classic black-box spam evasion scenario. In this work we introduce eight adversarial attacks against Arabic NLP models on every level of granularity to craft semantically and grammatically correct adversarial examples. Three of them are based on character-level perturbations with a maximum accuracy decrease of 95% achieved by Diacritics attack. In addition, local post-hoc explanations such as SHapely Additive exPlanations (SHAP) were employed to design two-word perturbation strategies that considerably reduced the accuracy by 23% on average. Furthermore, sentence-level attacks are tested on the victim model through a paraphrase-based strategy that brought the accuracy down by 77.4%. To further extend the effects of the word-level attacks and provide a diverse adversarial example-generation method, multi-level attacks are explored. Despite the excellent unfortified model’s accuracy of 99.4%, the first and second proposed multi-level attacks dragged the accuracy down to 8.3% and 0.9%, respectively. Nevertheless, most of the various proposed adversarial example-generation strategies were effective in terms of maintaining high semantic similarity and low perturbation distance.
Supervisor: Prof. Imtiaz Ahmad
Convener: Prof. Anwer Al-Yatama
Examination Committee: Prof. Anwar Al-Yatama and Dr. Ameer Mohammed