Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Kyungpook National University / Daegu 41566, Korea {w.shin, nbaek}@knu.ac.kr )



Video processing, Massively parallel processing, Mobile architecture, Mobile computing, GPGPU, OpenCL

1. Introduction

Recently, the mobile processor manufacturer, Qualcomm, introduced 8K (kilo) UHD (Ultra High Resolution) video recording ability to their chipset [1]. The chipset can process 24 frames of 8K video per second, thereby allowing the creation and consumption of UHD videos for the general public. In addition, captured UHD video can be broadcast from a mobile phone to the internet in real-time using 5G (fifth-generation) communication service [2-5]. A study by ETRI (Electronics and Telecommunications Research Institute, Korea) forecasted that the demand to consume 8K UHD contents would increase as more 8K UHD display panels are supplied in 2025 [6,7].

The frame from a live stream of UHD video can be utilized in various applications using image-processing techniques. One of the major areas that use real-time video to detect information is lane or object detection of vehicle-mounted video recorders to alert drivers to hazardous situations. The video frame can also be utilized to detect artifacts of a PCB (Printed Circuit Board) during the soldering process [8], and to detect fire from a surveillance camera [9]. Moreover, many conventional AR (Augmented Reality) applications analyze a frame-from-video to calibrate a camera or detect simple object information [10,11].

A captured video is also utilized in many deep learning applications combined with DNNs (Deep Neural Network), RNNs (Recurrent Neural Network), or CNNs (Convolution Neural Network) [12]. The application includes sophisticated object segmentation, human emotion recognition, situation awareness, and video analysis [13-17]. Moreover, several efforts have been made to process video with neural networks on mobile devices because recent mobile processors have dedicated NPU (Neural Processing Unit) or AI (Artificial Intelligence) accelerators.

One major time-consuming part of introduced deep learning-based methods is the pre-processing of the video frame. The pre-processing includes conventional image processing, e.g., color scale conversion, thresholding (binarization), and convolutional filtering. Fig. 1 gives an example of image pre-processing used widely for neural networks.

To gain as much information as possible, a wide-angle high-resolution camera is used to capture video, which leads to longer pre-processing times. Therefore, minimizing the pre-processing time for a high-resolution video frame is an important factor for increasing the efficiency of a deep learning process.

A Sobel operator is one of the most used convolutional image pre-processing methods. Sobel operators detect edge information and are used widely in classifications [11, 18-20]. A de-facto computer vision library OpenCV (Open Computer Vision) also provides image pre-processing algorithms, including the Sobel operator [21]. OpenCV provides a good general solution for image processing algorithms, but it lacks optimization for specific functions owing to its generality.

This paper introduced an optimized image pre-processing method, specialized in image convolution for videos with very high resolution, using OpenCL (Open Computing Library) [22]. The proposed scheme can fully utilize the resources of embedded systems with low-end mobile processors.

The remainder of this paper is organized as follows. The related works section introduces previous research to accelerate image convolution. In optimizing the OpenCL kernel architecture scheme section, this paper introduces an optimizing technique to process a convolution kernel. The performance evaluation section included a prototype video player implementation. In addition, the performance is evaluated, and a comparison is made with the well-known image processing library. Finally, the last section concludes the paper with discussions and effects of the proposed scheme.

Fig. 1. Examples of image pre-processing methods for neural networks.
../../Resources/ieie/IEIESPC.2021.10.2.084/fig1.png

2. Related Works

Several efforts have been made to accelerate image processing using OpenCL. The latest research [23] shows that it can process Sobel operation for a 2K image in 0.926 milliseconds. On the other hand, the research platform is based on a desktop platform rather than on mobile platforms. Table 1 lists the performance specifications of Nvidia GTX1060, which was used in a previous study [23], and ARM Mali-T880, which was used in this paper.

Although both GPUs (Graphics Processing Units) were launched in the same year, the performance gap between the two chipsets was apparent because of the limited power consumption of the mobile architectures. Therefore, another approach to accelerate image processing for mobile architectures is required.

Another study [24] optimized an image process by manipulating data to achieve the minimum load and store time. The research achieved an up to 3.3 times faster processing time than the conventional Sobel operation using OpenCL on mobile GPU. The data-level optimization technique is a good solution in accelerating parallel tasks.

Because the study goal was to process a video frame while playing real-time video, it is important to consider that the GPU resources are already being used to decode a video. The performance will deteriorate if more resources are used to manipulate data due to frequent context switching between different tasks.

A parallel library provided by OpenCV also provides a good general solution for convolution, including the Sobel operator. The library was also implemented with OpenCL to utilize parallel processing, but the OpenCV library does not consider its running platforms. Additionally, conversion from a ffmpeg decoded YV12 image type to an OpenCV RGB image type should be conducted to process an image, which requires multiple sequential reads of the memory, leading to performance reduction [25].

Table 1. Specifications of GTX1060 and Mali-T880.

Specification

GTX1060

Mali-T880

First Launch

Q3, 2016

Q1. 2016

Maximum Clock

1709 MHZ

650 MHz

Dedicated Memory

6 GB

None

Maximum Power Consumption

120 Watt

7 Watt

32-bit Floating Point Operations per Seconds

4375 GFLOPS

265 GFLOPS

3. Optimizing OpenCL Kernel Architecture Scheme

OpenCL was used for parallel processing API (Application Programming Interface) because this study aimed to perform real-time image processing while playing a real-time video for mobile architectures. The Exynos 8890 processor was also targeted because of its generality in being embedded in the Samsung Galaxy S7. The chipset has an integrated GPU (Graphics Processing Unit), ARM Mali-T880, which supports a maximum work-group size of $\textit{(256, 256, 256)}$.

In the scheme, each pixel in each thread is processed. A thread is the minimum size of a program to run kernel source code on a single GPU core. The multiple threads can be launched physically and simultaneously with the size of the work-group. Multiple work-groups can be logically contructed and simultaneously launched in the GPU. They are typically arranged into the ND-Range (n-dimensional range) data types in OpenCL. Therefore, although the total pixel number of 8K $\textit{(7680 * 4320)}$ video is larger than the maximum work-group size of the chipset, the threads can be divided into multiple work-groups within the size of the ND-Range.

A work-group size should be adjusted adaptively, so it does not exceed its maximum work-group size while approaching the maximum size. Because launching OpenCL kernels to a power of 2 is the most ideal, it is important to calculate the total number of threads $\textit{N}_{threads}$ using the equation

(1)
$ N_{\textit{threads}}=\left(\frac{\left(Width_{f}-1\right)}{64}+1\right)*\textit{Heigh}t_{f} $

where $\textit{Width}_{f}$ and $\textit{Height}_{f}$ represent the width and height for the video frame. With $\textit{N}_{threads}$, the optimal adoptive work-group size $\textit{(X, Y, Z)}$ can then be calculated with the procedure in Fig. 2.

The convolution kernel used was the Sobel kernel, as shown in Fig. 3. A general solution could be derived for all 3*3 convolution kernels by requiring the kernel as a parameter. In the proposed prototype, however, a fixed kernel was used for the best performance. Although fixed kernels lack generality in the programming aspect, the kernel itself can be adjusted if other kernels are needed.

OpenCL programs can access the texture memory area directly, while other CPU-based programs should copy the image data to the main memory area. Thus, in the case of OpenCL, there is no need to convert image color space for use in OpenCL kernels, and the image type conversion overhead can be minimized.

Fig. 2. Adaptive work-group size calculating procedure.
../../Resources/ieie/IEIESPC.2021.10.2.084/fig2.png
Fig. 3. Sobel convolution kernels.
../../Resources/ieie/IEIESPC.2021.10.2.084/fig3.png
Fig. 4. Flow charts of each thread for prototype video player implementation.
../../Resources/ieie/IEIESPC.2021.10.2.084/fig4.png

4. Performance Evaluation

If a commercial player is used to process frames, multiple unnecessary image conversions from YV12 color space to RGB color space should be conducted. Therefore, a simple video player was implemented to evaluate the performance of the proposed OpenCL kernel. The video player consisted of two main threads, a decoder thread, and an image processor thread.

The decoder thread decodes a video file until it reaches the EOF (end of file) and queues it to the global queue. The decoder is implemented using the ffmpeg library, and it fully supports the hardware acceleration of the GPU for video decoding. The frame processor thread obtains a decoded frame from the queue and launches the OpenCL kernel or OpenCV function to process the frame. Fig. 4 shows flowcharts of each thread for prototype video player implementation.

Using the proposed prototype implementation, the frame-processing performance was measured and compared with that in the OpenCV library. HEVC (High Efficient Video Codec) and H264 codec were used in the experiment. Although both codecs require high computing resources, they are used widely to minimize their file capacity. Only HEVC is used for 8K video. Table 2 lists the video resolution, codec, and bitrate information used for the experiment. The playback time of each video is the same.

Table 3 lists a performance for each resolution, codec, and method. The analysis was conducted using the Exynos 8890 processor. The table includes the decoding and image convolution time for each resolution, codecs, and methods. The result showed that the proposed method performs better than OpenCV for all resolutions and codecs. The proposed scheme can handle a single frame of 8K video at least in one second, even though it requires context switching because it exceeds the maximum work-group size. A higher frame rate can be expected if the decoding speed of the video can be minimized with a state-of-art chipset.

Another noticeable performance is the frame decoding time for each method. The decoding time was faster when OpenCV was used for convolution than when OpenCL was used. The decoding performance gap occurred because of the higher usage of GPU resources for the proposed scheme as it uses all the available physical threads of the GPU. On the other hand, the overall time consumption for each video frame processing was always smaller in the proposed scheme.

Table 2. Video information used for the experiment.

Resolution

Codec

Bitrate

8K

(7680*4320)

HEVC

88 Mbps

5K

(5120*2880)

HEVC / H264

28 Mbps

4K

(3840*2160)

HEVC / H264

22 Mbps

1080P

(1920*1080)

HVEC / H264

6 Mbps

Table 3. Video decoding and convolution processing time of each method for different resolution video.

Resolution

Codec

Method

De-

coding

Con-volution

8K

(7680*4320)

HEVC

CV

713 ms

410 ms

CL

834 ms

172 ms

5K

(5120*2880)

HEVC

CV

323 ms

198 ms

CL

387 ms

94 ms

H264

CV

59 ms

185 ms

CL

70 ms

78 ms

4K

(3840*2160)

HEVC

CV

228 ms

115 ms

CL

246 ms

68 ms

H264

CV

58 ms

109 ms

CL

62 ms

57 ms

1080P

(1920*1080)

HVEC

CV

54 ms

30 ms

CL

62 ms

16 ms

H264

CV

30 ms

27 ms

CL

28 ms

16 ms

5. Conclusion

An optimized OpenCL kernel scheme that can perform convolution for video frames, including 8K UHD video was designed. The proposed design achieved better overall performance than the well-known conventional image-processing library OpenCV. The performance gap occurred because the OpenCV parallel algorithms focused on generalizing overall image processing algorithms, whereas the current design focused on specialized kernels for convolution. On the other hand, the video decoding time of the proposed scheme was longer than when OpenCV was used. The reason for the difference in decoding time was that the scheme fully utilizes the GPU resource. Thus, the available decoding resource of GPU was smaller.

In the future, the proposed scheme is expected to be adopted to a processor that can decode 8K HEVC video in real-time. In addition, the image processing workload is expected to be distributed between the central prosceiing unit and GPU to achieve higher performance. Furthermore, a solution for video with more than 8K resolution should be considered because it exceeds the maximum work-group size of the chipset.

ACKNOWLEDGMENTS

This research was supported by Kyungpook National University Research Fund, 2020.

REFERENCES

1 
Qualcomm , 2020, Snapdragon 865 5G Mobile Platform, Qualcomm, CA, USAURL
2 
Vo N., Duong T. Q., Tuan H. D., Kortun A., 2017, Optimal Video Streaming in Dense 5G Networks With D2D Communications, IEEE Access, Vol. 6, pp. 209-223DOI
3 
Argyriou A., Poularakis K., Iosifidis G., Tassiulas L., 2017, Video Delivery in Dense 5G Cellular Networks, IEEE Network, Vol. 31, No. 4, pp. 28-34DOI
4 
Nightingale J., Salva-Garcia P., Calero J. M. A., Wang Q., 2016, 5G-QoE: QoE Modelling for Ultra-HD Video Streaming in 5G Networks, IEEE Tran-sactions on Broadcasting, Vol. 64, No. 2, pp. 621-634DOI
5 
Tan B., Lu J., Wu J., Zhang D., Zhang Z., 2018, Toward a Network Slice Design for Ultra High Definition Video Broadcasting in 5G, IEEE Wireless Communications, Vol. 25, No. 4, pp. 88-94DOI
6 
Kim S. C., Oh H. J., Yim H. J., Hyun E. H., Choi D. J., 2019, Trends of Cloud and Virtualization in Broadcast Infra, Electronics and Telecommuni-cations Trends, Vol. 34, No. 3, pp. 23-33DOI
7 
Lee J. S., Yoon K. S., 2012, Technical and Industrial Trends of Ultra High Definition Contents of the level of 8K, Electronics and Telecommunications Trends, Vol. 27, No. 3, pp. 101-109DOI
8 
Baek N., Kim K.J., 2017, An artifact detection scheme with CUDA-based image operations, Cluster Comput, Vol. 20, pp. 749-755DOI
9 
Moon C. B., Kim B. M., Kim D-S., 2019, Real-time Parallel Image-processing Scheme for a Fire-control System, IEIE Transactions on Smart Processing & Computing, Vol. 8, pp. 27-35DOI
10 
Redmon J., Divvala S., Girshick R., Farhadi A., Jun 2016, You Only Look Once: Unified, Real-Time Object Detection, in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788DOI
11 
Dim J., Takamura T., 2013, Alternative Approach for Satellite Cloud Classification: Edge Gradient Application, Advances in Meteorology, Vol. 2013, No. 11, pp. 1-8DOI
12 
Aggarwal C. C., 2018, Neural Networks and Deep Learning, Springer, Cham, GermanyDOI
13 
Bao L., Wu B., Liu W., Jun. 2018, CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF, in Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5977-5986DOI
14 
Fan Y., Lu X., Li D., Liu Y., Oct. 2016, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in Proc. of 18th ACM International Conference on Multimodal Interaction (ICMI), pp. 445-450DOI
15 
Lu N., Wu Y., Feng L., Song J., 2019, Deep Learning for Fall Detection: Three-Dimensional CNN Combined With LSTM on Video Kinematic Data, IEEE Journal of Biomedical and Health Informatics, Vol. 23, No. 1, pp. 314-323DOI
16 
Amerini I., Galteri L., Caldelli R., Bimbo A. D., Oct. 2019, Deepfake Video Detection through Optical Flow Based CNN, in Proc. of 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1205-1207DOI
17 
Jung S., Kim Y., Hwang E., 2018, Real-time car tracking system based on surveillance videos, EURASIP Journal on Image and Video ProcessingDOI
18 
Wang W., 2009, Reach on Sobel Operator for Vehicle Recognition, in Proc. of 2009 International Joint Conference on Artificial Intelligence, pp. 448-451DOI
19 
Kutty S. B., Saaidin S., Yunus P. N. A. M., Hassan S. A., 2014, Evaluation of canny and sobel operator for logo edge detection, in Proc. of 2014 International Symposium on Technology Management and Emerging Technologies, pp. 153-156DOI
20 
Wong K. Y. E., Chekima A., Dargham J. A., Sainarayana G., 2008, Palmprint identification using Sobel operator, in Proc. of 2008 10th International Conference on Control, Automation, Robotics and Vision, pp. 1338-1341DOI
21 
Kaehler A., Bradski G., 2016, Learning OpenCV 3: computer vision in C++ with the OpenCV library, O’Reilly Media, INC., CA, USAURL
22 
Stone J. E., Gohara D., Shi G., 2010, OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, Vol. 12, No. 3, pp. 66-73DOI
23 
Sanida T., Sideris A., Dasygenis M., 2020, A Heterogeneous Implementation of the Sobel Edge Detection Filter Using OpenCL, in Proc. of 2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4DOI
24 
Gandhi B. R., 2018, OpenCL Optimization: Accelerating the Sobel Filter on Adreno GPU, Qualcomm Developer Network, Qualcomm, CA. USAURL
25 
Li D., Bui V., Chang L.-C., 2016, Performance Comparison of State-of-Art Lossless Video Compression Methods, in Proc. of 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 386-391DOI

Author

Woosuk Shin
../../Resources/ieie/IEIESPC.2021.10.2.084/au1.png

Woosuk Shin is now a Ph.D. student of the School of Computer Science and Engineering at Kyungpook National University, Korea. He received his B.A. and M.S. degrees in Computer Science and Engineering from Kyungpook National University in 2016 and 2018, respectively. His main research topic is massively parallel processing on heterogeneous platforms using general-purpose graphics processing units (GPGPU). He also has an interest in practical graphics applications.

Nakhoon Baek
../../Resources/ieie/IEIESPC.2021.10.2.084/au2.png

Nakhoon Baek is currently a professor of the School of Computer Science and Engineering at Kyung-pook National University, Korea. He received his B.A., M.S., and Ph.D. degrees in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 1990, 1992, and 1997, respectively. His research interests include graphics standards, graphics algorithms, real-time rendering, big data visualization, and massively parallel processing. He is also the Chief Engineer of Dassomey.com Inc., Korea. He is now also visiting the Division of Computer Science and Engineering at Louisiana State University.