LeeJongho1
KimHyun1
-
(Department of Electrical and Information Engineering, Research Center for Electrical
and Information Technology, Seoul National University of Science and Technology /
Seoul, Korea {jhlees, hyunkim}@seoultech.ac.kr )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Computer vision, Image classification, Deep learning, Discrete cosine transform (DCT), Vision transformer
1. Introduction
Recently, owing to developments in deep learning (DL), there have been remarkable
performance improvements in the field of computer vision [1-4, 26-29]. Until now, most DL-based computer vision studies have been developed based mainly
on model architectures [5,6] and computational methods, such as convolution and self-attention [7,8]. In the 2020s, self-attention-based vision transformers have tended to replace convolutional
neural networks (CNNs) [1,3,5,9]. The transformer model, which has been actively studied in the field of natural language
processing (NLP), allows one image patch to act as a word in a sentence through patch
embedding. This enables self-attention operations in the field of computer vision.
However, unlike a single word of great importance in a language, an image is only
a signal of light. Thus, it has redundant information relative to the importance of
words in sentences [4]. Therefore, if important information is extracted in advance from the image, it may
help improve the accuracy of the DL model in computer vision. Frequency domain transform
methods, such as the Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT),
and Fast Fourier Transform (FFT), have been steadily used for extracting meaningful
information from images. In modern DL models, it is also possible to use these frequency
domain transform methods for this purpose [10-12] because computational methods for image recreation can achieve good performance with
the DCT, DWT, and FFT, as well as with convolution and self-attention [13]. Several attempts to use DCT in DL models have been reported [14-16]. On the other hand, these studies were only performed to reduce the communication
bandwidth and computational costs in CNN or NLP models.
This paper proposes a method to improve the accuracy of the vision transformer model
using the DCT. In detail, an input image enters the vision transformer after applying
a 2D DCT [10] in units with an N${\times}$N block size, allowing the DL models to utilize inputs
with both spatial and frequency information. The proposed method improves the top-1
accuracy of the vision transformer by approximately 3-5% on the Cifar10 [17], Cifar-100 [17], and Tiny-ImageNet [18] datasets, while performing the 2D-DCT only once, immediately before patch embedding.
In addition, as the proposed method can improve the performance of various vision
transformer models [1], including tiny and small sizes, it has high compatibility and scalability for model
sizes and datasets.
2. Background
2.1 Patch Embedding
Before the emergence of vision transformers [1] by Dosovitsky et al., CNN models [6,19] were used widely in computer vision. Subsequently, the emergence of vision transformer
DL models based solely on self-attention operations has become dominant, excluding
the convolution structure [1,3,5,9]. The concept of self-attention in computer vision is similar to self-attention in
the field of NLP; however, it is possible to change the concept of image-patch in
vision transformer models to that of sentence-word in NLP through the patch embedding
process [1], as shown in Fig. 1. This concept is simple. It cuts the image into a non-overlapping 16${\times}$16
patch for linear projection and adds a class token. Through these ideas, vision transformers
can achieve state-of-the-art performance not only in NLP but also in computer vision.
On the other hand, despite the contributions of patch embedding, most studies on computer
vision tasks have focused on improving model architectures and computational methods,
such as convolution, multi-layer perception, and self-attention [5-8, 30]. Accordingly, these studies cannot overcome the limitations of using only the spatial
information of the image. As shown in Fig. 1, when patch embedding is performed, the patches cut into 16${\times}$16 pixels are
converted to a 196${\times}$C-dimensional matrix through a 2D-convolution operation.
Owing to the nature of the CNN, the weights of the filters applied to each patch are
shared. Therefore, it may be helpful to match the uniform format for each patch rather
than to use the original image of the pixel.
Fig. 1. Diagram showing the patch embedding process of a sample image of size 224${\times}$224 extracted from imageNet-1k dataset. The RGB image is cut into 196(=14${\times}$14) 16${\times}$16 size patches, and is released in the form of a matrix of 196${\times}$C size through a non-overlapping 2D convolution operation of the 16${\times}$16 size filter.}
2.2 2D-Discrete Cosine Transform
The 2D-DCT is used widely in signal processing, and various fields, including image
compression, such as JPEGs [20]. One of the advantages of 2D-DCT is that the image can be viewed from a frequency
perspective. When the 2D-DCT (in units of N${\times}$N block size) is performed on
the image, the upper-left side of the block has low-frequency information, whereas
the lower-right side has high-frequency information. Fig. 2 presents the result from obtaining an image with frequency information for each block
by performing the N${\times}$N block 2D-DCT on the RGB image with a resolution of
224${\times}$224 in Fig. 1.
An image is a signal in which spatial redundant information is captured. Therefore,
if the 2D-DCT is performed on a block basis, the frequency and spatial information
can be expressed in the local and global parts, respectively. The motivation of this
study is that by exploiting these advantages, the vision transformer models without
inductive bias can be better trained by inputting images with the 2D-DCT applied to
vision transformer models. The 2D-DCT with the N${\times}$N size can be calculated
as follows:
where D represents one N${\times}$N-sized block. For example, the original image of
a 224${\times}$224 resolution may be converted to 3136 4${\times}$4 size blocks, 784
8${\times}$8 size blocks, 196 16${\times}$16 size blocks, or one 224${\times}$224
block. In (1), i and j are the pixel indices of the blocks converted by the 2D-DCT, and x and y
are the pixel indices of the original block I. $\alpha $ is a scale factor for ensuring
that the transform is orthonormal and is given as follows:
In this way, the original image can be converted to a frequency domain on a block
basis, and within the block, the upper-left part can concentrate low-frequency energy.
The lower right can have the relatively less important high-frequency energy.
Fig. 2. Results of a 2D-discrete cosine transform (DCT) operation for the 224${\times}$224 size input image of Fig. 1 in block units of (a) 4${\times}$4; (b) 8${\times}$8; (c) 16${\times}$16; (d) 224${\times}$224.}
2.3 Vision Transformer
Fig. 3(a) shows the overall structure of the vision transformer used for image classification.
First, the input image resized to a 224${\times}$224 resolution was cut into 196 16${\times}$16
size patches through a patch embedding process. Subsequently, each patch was supplemented
with position information through position embedding and class information by adding
a class token. Because the multi-head attention [21]-based transformer encoder has the same input and output dimensions, it can go through
several blocks depending on the size of the models (e.g., tiny and small). Multi-head
attention is a method of self-attention in parallel by dividing the query, key, and
value input by the number of heads. Through this method, the vision transformer identifies
the relationships between patches and extracts the image features. Finally, an image
classification task is performed through a multi-layer perceptron head after the previous
processes.
3. Proposed Method
While the vision transformer model receives an original image in RGB format, this
study proposes adding a new 2D-DCT patch embedding method at the input stage of the
vision transformer model. In the proposed method, the original image was cut into
N${\times}$N blocks, and the 2D-DCT is performed in units of blocks before patch embedding,
making the local spatial information available within each block, as shown in Fig. 3. The location information is more critical in the 2D-DCT block than in the original
image because the upper-left part of the 2D-DCT block has DC information representing
the average pixel value. In contrast, the bottom-right part has high-frequency AC
information. The proposed method improves performance by utilizing this additional
information and has the advantage that it can be implemented with a minimal computational
increase to the baseline (i.e., vision transformer) without changing the number of
parameters. In addition, it can be applied to various vision transformer models in
various structures (i.e., high compatibility and scalability).
This paper describes applying the proposed method through an example of a specific
block size. When the 2D-DCT block size is 4${\times}$4, the original image of size
224${\times}$224 is divided by 3136 (= (224${\times}$224) / (4${\times}$4)) blocks
and a 4${\times}$4 2D-DCT is then performed in parallel for each block. The 3136 blocks
on which the 4${\times}$4 2D-DCT is performed are again combined into images of size
224${\times}$224. In this case, an additional 1.5M floating-point operations per second
(FLOPs ) is required for the 2D-DCT operation, compared to the case where the vision
transformer’s inference operation is performed on the RGB image with a resolution
of 224${\times}$224. On the other hand, this is an insignificant increase considering
that the computational amount of DeiT-Tiny and DeiT-Small is 1.3G FLOPs and 4.6G FLOPs
[22], respectively. Even if the 2D-DCT block size increases to 8${\times}$8, only approximately
2.7M FLOPs are required (i.e., when performing the same process with the 784 8${\times}$8
blocks). In addition, because there is no dependency between each block, a fast parallel
operation is possible through ``CUDA'' [23]. The subsequent processing method follows the operation process of the existing vision
transformer described in Section 2.3. In other words, the proposed method is a straightforward
but effective method that performs N${\times}$N block DCT on the input image without
changing the structure of the existing vision transformer model and then inputs the
DCT blocks to patch embedding.
Fig. 3. Diagram showing the patch embedding process of a sample image of size 224${\times}$224 extracted from imageNet-1k dataset. The RGB image was cut into 196(=14${\times}$14) 16${\times}$16 size patches, and was released in the form of a matrix of 196${\times}$C size through a non-overlapping 2D convolution operation of the 16${\times}$16 size filter.}
4. Experimental Results
4.1 Experiment Settings
A vision transformer was trained and validated in four Tesla-V100 GPUs using the Cifar-10
[17], Cifar-100 [17], and Tiny-ImageNet [18] datasets. The Cifar-10 and Cifar-100 datasets have 50,000 training images and 10,000
validation images with 10 and 100 classes, respectively. The Tiny-ImageNet dataset
contains 100,000 training images and approximately 10,000 validation images for 200
classes. The most training strategy of DeiT [3] and the detailed environmental settings are as follows. Adam [31] was used as an optimizer, and set the momentum to 0.9 and weight decay to 0.05. All
models were trained for 300 epochs using a batch size of 1,024 and a learning rate
of 0.0005. All source codes are referred to the pytorch-based pytorch image models
(Timm) library [24], and for the 2D-DCT operations, the torchJpeg library [25] is used. It should be noted that the DCT block size in all result tables is marked
as ``-'' for vanilla vision transformers that do not use the DCT patch embedding.
4.2 Accuracy Evaluation
Table 1 lists the results of applying the proposed method to the vision transformer with
the tiny and small models on the Cifar-10 dataset. As suggested, the model was trained
by adding discrete cosine transformed images in units of N${\times}$N blocks before
performing patch embedding for the vision transformer. When a 4${\times}$4 block size
2D-DCT was adopted on the Cifar-10 dataset, the top-1 accuracy increased by 4.09%
and 0.36% for the tiny and small models, respectively. When a 16${\times}$16 block
size 2D-DCT was adopted, the top-1 accuracy improved by 4.57% and 3.7% for the tiny
and small models, respectively. When the block size was more than 16${\times}$16,
the performance was inferior to the others because the detailed information was lost
spatially. Therefore, experiments larger than 32${\times}$32 were not performed. In
particular, when 2D-DCT was performed with an image size of 224${\times}$224, all
spatial features of the image were lost because of the characteristics of 2D-DCT,
as shown in Fig. 2(d).
Table 2 shows the same experiment for the Cifar-100 dataset with a tiny model and small model.
When the 4${\times}$4 block size 2D-DCT was adopted, the top-1 accuracy increased
by 3.86% and 3.59% for the tiny and small models, respectively. When the 16${\times}$16
block size 2D-DCT was adopted, the top-1 accuracy increased by 5.36% and 8.92% for
the tiny and small models, respectively.
Experiments were conducted on the Tiny-ImageNet dataset, and a relatively large dataset
was used, as shown in Table 3. When the 4${\times}$4 block size 2D-DCT was adopted, the top-1 accuracy increased
by 2.92% and 5.37% for the tiny and small models, respectively. When the 16${\times}$16
block size 2D-DCT was adopted, it increased by 3.66% and 5.49% for the tiny and small
models, respectively.
As a result, when the 16${\times}$16 block size 2D-DCT patch embedding was applied,
performance was increased most in all cases through the proposed method. This result
was attributed to the patch being cut into 16x16 during the patch embedding process
in DeiT. The increase in computational cost and decrease in speed was negligible.
In addition, as the proposed model can be applied directly to most vision transformer
models using patch embedding, its compatibility was excellent, making it an easy and
general way to improve performance.
Table 1. Accuracy of the Proposed Method on the CIFAR-10.
Model
|
DCT block size
|
Top1-Acc.(%)
|
Top5-Acc.(%)
|
DeiT -Tiny
|
-
|
79.92
|
98.8
|
2
|
84.42
|
99.18
|
4
|
84.01
|
99.2
|
8
|
84.27
|
99.21
|
16
|
84.49
|
99.26
|
32
|
82.7
|
99.21
|
DeiT -Small
|
-
|
79.73
|
98.76
|
2
|
75.96
|
98.42
|
4
|
80.09
|
98.73
|
8
|
80.67
|
99.0
|
16
|
83.43
|
99.07
|
32
|
53.82
|
93.68
|
Table 2. Accuracy of the Proposed Method on the CIFAR-100.
Model
|
DCT block size
|
Top1-Acc.(%)
|
Top5-Acc.(%)
|
DeiT -Tiny
|
-
|
66.17
|
89.79
|
2
|
68.95
|
91.36
|
4
|
70.03
|
91.55
|
8
|
69.64
|
91.34
|
16
|
71.53
|
92.07
|
32
|
67.15
|
90.35
|
DeiT -small
|
-
|
61.13
|
86.2
|
2
|
61.19
|
85.34
|
4
|
64.72
|
88.12
|
8
|
66.31
|
88.59
|
16
|
70.05
|
90.75
|
32
|
38.47
|
38.76
|
Table 3. Accuracy of the Proposed Method on the Tiny-ImageNet.
Model
|
DCT block size
|
Top1-Acc.(%)
|
Top5-Acc.(%)
|
DeiT -Tiny
|
-
|
54.22
|
77.84
|
2
|
57.24
|
79.88
|
4
|
57.14
|
79.94
|
8
|
56.81
|
80.17
|
16
|
57.88
|
80.62
|
32
|
52.45
|
76.75
|
DeiT -Small
|
-
|
48.78
|
73.7
|
2
|
53.26
|
76.28
|
4
|
54.15
|
76.94
|
8
|
52.75
|
76.88
|
16
|
54.27
|
77.66
|
32
|
28.42
|
53.66
|
5. Conclusion
Thus far, modern DL models perform well because they can independently extract and
process the information needed in the image. On the other hand, in this study, because
an image is simply a light signal, the DL model can help better process an image by
utilizing a modulation method for the traditionally studied frequency band. The proposed
method shows remarkable performance improvements in all experimental cases, even though
the computational cost and latency remain relatively unchanged. The proposed method
can be applied directly to other transformer-affiliated models and can be extended
to tasks such as object detection, instance segmentation, semantic segmentation, and
depth estimation
ACKNOWLEDGMENTS
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under
the ITRC(Information Technology Research Center) support program (IITP-2023-RS-2022-00156295)
supervised by the IITP(Institute for Information & Communications Technology Planning
& Evaluation).
REFERENCES
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention
for all data sizes,” in Proc. Advances in Neural Information Processing Systems, 2021,
vol. 34, pp. 3965-3977.
H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training
data-efficient image transformers & distillation through attention,” in Proceedings
of the International Conference on Machine Learning, 2021, pp. 10347-10357.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are
scalable vision learners,” arXiv preprint arXiv:2111.06377, 2021.
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,”
in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021,
pp. 10012-10022.
Z. Xia, X. Pan, S. Song, L. E. Li, and G. Huang, “Vision Transformer with Deformable
Attention,” arXiv preprint arXiv:2201.00520, 2022.
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the
IEEE international conference on computer vision, 2017, pp. 2961-2969.
A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile
vision applications,” arXiv preprint arXiv:1704.04861, 2017.
Z. Liu et al., “Swin Transformer V2: Scaling Up Capacity and Resolution,” arXiv preprint
arXiv:2111.09883, 2021.
N. Ahmed and K. R. Natarajan T_ and Rao, “Discrete cosine transform,” IEEE Transactions
on Computers, vol. 100, no. 1, pp. 90-93, 1974.
J. Shin and H. Kim, “RL-SPIHT: Reinforcement Learning based Adaptive Selection of
Compression Ratio for 1-D SPIHT Algorithm,” IEEE Access, vol. 9, pp. 82485-82496,
2021.
H. Kim, A. No, and H.-J. Lee, “SPIHT Algorithm with Adaptive Selection of Compression
Ratio Depending on DWT Coefficients,” IEEE Transactions on Multimedia, vol. 20, no.
12, pp. 3200-3211, Dec. 2018.
Y. Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,”
in Proceedings of the Advances in Neural Information Processing Systems, 2021, vol.
34.
K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, and F. Ren, “Learning in the Frequency
Domain,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Jun. 2020, pp. 1740-1749.
X. Shen et al., “DCT-Mask: Discrete Cosine Transform Mask Representation for Instance
Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), Jun. 2021, pp. 8720-8729.
C. Scribano, G. Franchini, M. Prato, and M. Bertogna, “DCT-Former: Efficient Self-Attention
with Discrete Cosine Transform,” arXiv preprint arXiv:2203.01178, 2022.
A. Krizhevsky, G. Hinton, and others, “Learning multiple layers of features from tiny
images,” 2009.
Y. Le and X. S. Yang, “Tiny ImageNet Visual Recognition Challenge,” 2015.
J. Choi, D. Chun, H. Kim, and H.-J. Lee, “Gaussian yolov3: An accurate and fast object
detector using localization uncertainty for autonomous driving,” in Proc. IEEE/CVF
Int. Conf. Computer Vision, 2019, pp. 502-511.
G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on
Consumer Electronics, vol. 38, no. 1, pp. xviii-xxxiv, 1992.
A. Vaswani et al., “Attention is all you need,” in Proc. Advances in Neural Information
Processing Systems, 2017, vol. 30.
S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers
in vision: A survey,” ACM Computing Surveys (CSUR), 2021.
NVIDIA, P. Vingelmann, and F. H. P. Fitzek, CUDA, release: 10.2.89. 2020. [Online].
Available:
R. Wightman, PyTorch Image Models. GitHub, 2019. doi: 10.5281/zenodo.4414861.
M. Ehrlich, L. Davis, S.-N. Lim, and A. Shrivastava, “Quantization Guided JPEG Artifact
Correction,” 2020.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2016, pp. 770-778.
M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural
networks,” in Proceedings of the International conference on machine learning. PMLR,
2019, pp. 6105-6114.
M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in Proceedings
of the International Conference on Machine Learning. PMLR, 2021, pp. 10 096-10 106.
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” vol. 28, 2015.
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional
networks,” in Proc. IEEE/CVF Int. Conf. Computer Vision, 2017, pp. 764-773
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
arXiv:1711.05101, 2017.
Author
Jongho Lee received his B.S. degree in Electrical and Information Engineering from
Seoul National University of Science and Technology, Seoul, Korea, in 2020. Currently,
he is a graduate student of Seoul National University of Science and Technology, Seoul,
Korea. In 2020, he was a research student at the Korea Institute of Science and Technology
(KIST), Seoul, Korea. In 2022, he was a visit student at the University of Wisconsin-Madison,
Wisconsin, USA. His research interests are the deep learning and machine learning
algorithms for computer vision tasks.
Hyun Kim received his B.S., M.S. and Ph.D. degrees in Electrical Engi-neering and
Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015,
respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development
for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor. In 2018,
he joined the Department of Electrical and Information Engineering, Seoul National
University of Science and Technology, Seoul, Korea, where he is currently working
as an Associate Professor. His research interests are the areas of algorithms, computer
architecture, memory, and SoC design for low-complexity multimedia applications and
deep neural networks.