Arabic Sign Language Machine Translation Based on Transformers
Computer Engineering Department
This thesis presents a deep learning-based framework for Arabic Sign Language (ArSL) recognition using transformer-based video classification architectures. The study explores and compares four models: TimeSformer, Swin-MSTP Light (BiLSTM), Swin-MSTP Full (MS-TCN + BiLSTM), and a newly proposed model, Swin-CTGR, which integrates the Swin Transformer with a lightweight temporal module that combines 1D Convolution, a Temporal Attention Block (TAB), and Gated Recurrent Unit (GRU). All models were evaluated on the KArSL-100 dataset, which comprises 30-frame RGB video sequences of 100 isolated Arabic signs. The methodology utilized standardized pre-processing methods, including pose estimation and normalization, and was evaluated with and without data augmentation to assess generalizability. Experimental results demonstrate that the Swin-CTGR model achieved the best overall performance, with a test accuracy of 98.91%, an F1-score of 0.9891, and the lowest inference time of 0.0072 seconds per sample, making it suitable for real-time applications. Swin-MSTP Light also performed well, offering a strong balance between accuracy and efficiency. While TimeSformer and Swin-MSTP Full exhibited good accuracy, they were comparatively slower and more resource-intensive. This work demonstrates the effectiveness of hierarchical spatial encoding and tailored temporal modeling for ArSL recognition. In addition, it offers valuable feedback on trade-offs between performance and latency, supporting the development of practical, real-time communication tools for the deaf and hard-of-hearing community.
Supervisor: Prof. Imtiaz Ahmad
Convener: Prof. Ayed Salman
Examination Committee: Dr. Mahmoud Ben Naser