||Discrete Cosine Transformed Images Are Easy to Recognize in Vision Transformers
||(Jongho Lee) ; (Hyun Kim)
|| Computer vision; Image classification; Deep learning; Discrete cosine transform (DCT); Vision transformer
||Deep learning models for image classification with adequate parameters show excellent classification performance because they can effectively extract the features of input images. On the other hand, there is a limit to the abilities of deep learning models to interpret images using only spatial information because an image is a signal with great spatial redundancy. Therefore, in this study, the discrete cosine transform was applied to an input image in units of an N×N block size to allow the deep learning model to employ both frequency and spatial information. The proposed method was implemented and verified by selecting a vision transformer using a 16×16 nonoverlapping patch as a baseline and training various datasets of Cifar-10, Cifar-100, and Tiny- ImageNet from the very beginning without pre-trained weights. The experimental results showed that the top-1 accuracy is improved by approximately 3-5% for every dataset with little increase in computational cost.