Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 10, No. 2, p.084-089

ISSN (print) :

2287-5255

Received : 30 August 2020Revised : 04 October 2020Accepted : 24 November 2021

DOI :

https://doi.org/10.5573/IEIESPC.2021.10.2.084

Short Paper

Optimizing Ultra High-resolution Video Processing on Mobile Architecture with Massively Parallel Processing

ShinWoosuk BaekNakhoon^*

(School of Computer Science and Engineering, Kyungpook National University / Daegu 41566, Korea {w.shin, nbaek}@knu.ac.kr )

^* Corresponding Author: Nakhoon Baek

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper introduces an optimized video frame pre-processing scheme for UHD video with up to 8K resolution using OpenCL for mobile architectures, particularly for convolution. The introduced scheme can fully utilize the maximum computational resources of the mobile architecture, with an adaptive work-group size adjustment. As a prototype, a simple video player with a Sobel kernel was implemented as an example of a convolution kernel. The prototype implementation showed a better video frame processing time than the de-facto image-processing library, OpenCV. On the other hand, the decoding time of the video increased because the OpenCL kernel utilizes GPU (Graphics Processing Unit) resources to almost its maximum. In the future, the processing workload will be distributed between the CPU (Central Processing Unit) and GPU to achieve higher performance.

Keywords

Video processing, Massively parallel processing, Mobile architecture, Mobile computing, GPGPU, OpenCL

1. Introduction

Recently, the mobile processor manufacturer, Qualcomm, introduced 8K (kilo) UHD (Ultra High Resolution) video recording ability to their chipset ^[1]. The chipset can process 24 frames of 8K video per second, thereby allowing the creation and consumption of UHD videos for the general public. In addition, captured UHD video can be broadcast from a mobile phone to the internet in real-time using 5G (fifth-generation) communication service ^[2-^5]. A study by ETRI (Electronics and Telecommunications Research Institute, Korea) forecasted that the demand to consume 8K UHD contents would increase as more 8K UHD display panels are supplied in 2025 ^[6,^7].

The frame from a live stream of UHD video can be utilized in various applications using image-processing techniques. One of the major areas that use real-time video to detect information is lane or object detection of vehicle-mounted video recorders to alert drivers to hazardous situations. The video frame can also be utilized to detect artifacts of a PCB (Printed Circuit Board) during the soldering process ^[8], and to detect fire from a surveillance camera ^[9]. Moreover, many conventional AR (Augmented Reality) applications analyze a frame-from-video to calibrate a camera or detect simple object information ^[10,^11].

A captured video is also utilized in many deep learning applications combined with DNNs (Deep Neural Network), RNNs (Recurrent Neural Network), or CNNs (Convolution Neural Network) ^[12]. The application includes sophisticated object segmentation, human emotion recognition, situation awareness, and video analysis ^[13-^17]. Moreover, several efforts have been made to process video with neural networks on mobile devices because recent mobile processors have dedicated NPU (Neural Processing Unit) or AI (Artificial Intelligence) accelerators.

One major time-consuming part of introduced deep learning-based methods is the pre-processing of the video frame. The pre-processing includes conventional image processing, e.g., color scale conversion, thresholding (binarization), and convolutional filtering. Fig. 1 gives an example of image pre-processing used widely for neural networks.

To gain as much information as possible, a wide-angle high-resolution camera is used to capture video, which leads to longer pre-processing times. Therefore, minimizing the pre-processing time for a high-resolution video frame is an important factor for increasing the efficiency of a deep learning process.

A Sobel operator is one of the most used convolutional image pre-processing methods. Sobel operators detect edge information and are used widely in classifications [11, 18-20]. A de-facto computer vision library OpenCV (Open Computer Vision) also provides image pre-processing algorithms, including the Sobel operator ^[21]. OpenCV provides a good general solution for image processing algorithms, but it lacks optimization for specific functions owing to its generality.

This paper introduced an optimized image pre-processing method, specialized in image convolution for videos with very high resolution, using OpenCL (Open Computing Library) ^[22]. The proposed scheme can fully utilize the resources of embedded systems with low-end mobile processors.

The remainder of this paper is organized as follows. The related works section introduces previous research to accelerate image convolution. In optimizing the OpenCL kernel architecture scheme section, this paper introduces an optimizing technique to process a convolution kernel. The performance evaluation section included a prototype video player implementation. In addition, the performance is evaluated, and a comparison is made with the well-known image processing library. Finally, the last section concludes the paper with discussions and effects of the proposed scheme.

Fig. 1. Examples of image pre-processing methods for neural networks.

2. Related Works

Several efforts have been made to accelerate image processing using OpenCL. The latest research ^[23] shows that it can process Sobel operation for a 2K image in 0.926 milliseconds. On the other hand, the research platform is based on a desktop platform rather than on mobile platforms. Table 1 lists the performance specifications of Nvidia GTX1060, which was used in a previous study ^[23], and ARM Mali-T880, which was used in this paper.

Although both GPUs (Graphics Processing Units) were launched in the same year, the performance gap between the two chipsets was apparent because of the limited power consumption of the mobile architectures. Therefore, another approach to accelerate image processing for mobile architectures is required.

Another study ^[24] optimized an image process by manipulating data to achieve the minimum load and store time. The research achieved an up to 3.3 times faster processing time than the conventional Sobel operation using OpenCL on mobile GPU. The data-level optimization technique is a good solution in accelerating parallel tasks.

Because the study goal was to process a video frame while playing real-time video, it is important to consider that the GPU resources are already being used to decode a video. The performance will deteriorate if more resources are used to manipulate data due to frequent context switching between different tasks.

A parallel library provided by OpenCV also provides a good general solution for convolution, including the Sobel operator. The library was also implemented with OpenCL to utilize parallel processing, but the OpenCV library does not consider its running platforms. Additionally, conversion from a ffmpeg decoded YV12 image type to an OpenCV RGB image type should be conducted to process an image, which requires multiple sequential reads of the memory, leading to performance reduction ^[25].

Table 1. Specifications of GTX1060 and Mali-T880.

Specification	GTX1060	Mali-T880
First Launch	Q3, 2016	Q1. 2016
Maximum Clock	1709 MHZ	650 MHz
Dedicated Memory	6 GB	None
Maximum Power Consumption	120 Watt	7 Watt
32-bit Floating Point Operations per Seconds	4375 GFLOPS	265 GFLOPS

3. Optimizing OpenCL Kernel Architecture Scheme

OpenCL was used for parallel processing API (Application Programming Interface) because this study aimed to perform real-time image processing while playing a real-time video for mobile architectures. The Exynos 8890 processor was also targeted because of its generality in being embedded in the Samsung Galaxy S7. The chipset has an integrated GPU (Graphics Processing Unit), ARM Mali-T880, which supports a maximum work-group size of $\textit{(256, 256, 256)}$.

In the scheme, each pixel in each thread is processed. A thread is the minimum size of a program to run kernel source code on a single GPU core. The multiple threads can be launched physically and simultaneously with the size of the work-group. Multiple work-groups can be logically contructed and simultaneously launched in the GPU. They are typically arranged into the ND-Range (n-dimensional range) data types in OpenCL. Therefore, although the total pixel number of 8K $\textit{(7680 * 4320)}$ video is larger than the maximum work-group size of the chipset, the threads can be divided into multiple work-groups within the size of the ND-Range.

A work-group size should be adjusted adaptively, so it does not exceed its maximum work-group size while approaching the maximum size. Because launching OpenCL kernels to a power of 2 is the most ideal, it is important to calculate the total number of threads $\textit{N}_{threads}$ using the equation

(1)

$ N_{\textit{threads}}=\left(\frac{\left(Width_{f}-1\right)}{64}+1\right)*\textit{Heigh}t_{f} $

where $\textit{Width}_{f}$ and $\textit{Height}_{f}$ represent the width and height for the video frame. With $\textit{N}_{threads}$, the optimal adoptive work-group size $\textit{(X, Y, Z)}$ can then be calculated with the procedure in Fig. 2.

The convolution kernel used was the Sobel kernel, as shown in Fig. 3. A general solution could be derived for all 3*3 convolution kernels by requiring the kernel as a parameter. In the proposed prototype, however, a fixed kernel was used for the best performance. Although fixed kernels lack generality in the programming aspect, the kernel itself can be adjusted if other kernels are needed.

OpenCL programs can access the texture memory area directly, while other CPU-based programs should copy the image data to the main memory area. Thus, in the case of OpenCL, there is no need to convert image color space for use in OpenCL kernels, and the image type conversion overhead can be minimized.

Fig. 2. Adaptive work-group size calculating procedure.

Fig. 3. Sobel convolution kernels.

Fig. 4. Flow charts of each thread for prototype video player implementation.

4. Performance Evaluation

If a commercial player is used to process frames, multiple unnecessary image conversions from YV12 color space to RGB color space should be conducted. Therefore, a simple video player was implemented to evaluate the performance of the proposed OpenCL kernel. The video player consisted of two main threads, a decoder thread, and an image processor thread.

The decoder thread decodes a video file until it reaches the EOF (end of file) and queues it to the global queue. The decoder is implemented using the ffmpeg library, and it fully supports the hardware acceleration of the GPU for video decoding. The frame processor thread obtains a decoded frame from the queue and launches the OpenCL kernel or OpenCV function to process the frame. Fig. 4 shows flowcharts of each thread for prototype video player implementation.

Using the proposed prototype implementation, the frame-processing performance was measured and compared with that in the OpenCV library. HEVC (High Efficient Video Codec) and H264 codec were used in the experiment. Although both codecs require high computing resources, they are used widely to minimize their file capacity. Only HEVC is used for 8K video. Table 2 lists the video resolution, codec, and bitrate information used for the experiment. The playback time of each video is the same.

Table 3 lists a performance for each resolution, codec, and method. The analysis was conducted using the Exynos 8890 processor. The table includes the decoding and image convolution time for each resolution, codecs, and methods. The result showed that the proposed method performs better than OpenCV for all resolutions and codecs. The proposed scheme can handle a single frame of 8K video at least in one second, even though it requires context switching because it exceeds the maximum work-group size. A higher frame rate can be expected if the decoding speed of the video can be minimized with a state-of-art chipset.

Another noticeable performance is the frame decoding time for each method. The decoding time was faster when OpenCV was used for convolution than when OpenCL was used. The decoding performance gap occurred because of the higher usage of GPU resources for the proposed scheme as it uses all the available physical threads of the GPU. On the other hand, the overall time consumption for each video frame processing was always smaller in the proposed scheme.

Table 2. Video information used for the experiment.

Resolution	Codec	Bitrate
8K (7680*4320)	HEVC	88 Mbps
5K (5120*2880)	HEVC / H264	28 Mbps
4K (3840*2160)	HEVC / H264	22 Mbps
1080P (1920*1080)	HVEC / H264	6 Mbps

Table 3. Video decoding and convolution processing time of each method for different resolution video.

Resolution	Codec	Method	De- coding	Con-volution
8K (7680*4320)	HEVC	CV	713 ms	410 ms
8K (7680*4320)	HEVC	CL	834 ms	172 ms
5K (5120*2880)	HEVC	CV	323 ms	198 ms
	HEVC	CL	387 ms	94 ms
	H264	CV	59 ms	185 ms
	H264	CL	70 ms	78 ms
4K (3840*2160)	HEVC	CV	228 ms	115 ms
	HEVC	CL	246 ms	68 ms
	H264	CV	58 ms	109 ms
	H264	CL	62 ms	57 ms
1080P (1920*1080)	HVEC	CV	54 ms	30 ms
	HVEC	CL	62 ms	16 ms
	H264	CV	30 ms	27 ms
	H264	CL	28 ms	16 ms

5. Conclusion

An optimized OpenCL kernel scheme that can perform convolution for video frames, including 8K UHD video was designed. The proposed design achieved better overall performance than the well-known conventional image-processing library OpenCV. The performance gap occurred because the OpenCV parallel algorithms focused on generalizing overall image processing algorithms, whereas the current design focused on specialized kernels for convolution. On the other hand, the video decoding time of the proposed scheme was longer than when OpenCV was used. The reason for the difference in decoding time was that the scheme fully utilizes the GPU resource. Thus, the available decoding resource of GPU was smaller.

In the future, the proposed scheme is expected to be adopted to a processor that can decode 8K HEVC video in real-time. In addition, the image processing workload is expected to be distributed between the central prosceiing unit and GPU to achieve higher performance. Furthermore, a solution for video with more than 8K resolution should be considered because it exceeds the maximum work-group size of the chipset.

ACKNOWLEDGMENTS

This research was supported by Kyungpook National University Research Fund, 2020.

REFERENCES

Qualcomm , 2020, Snapdragon 865 5G Mobile Platform, Qualcomm, CA, USA

Vo N., Duong T. Q., Tuan H. D., Kortun A., 2017, Optimal Video Streaming in Dense 5G Networks With D2D Communications, IEEE Access, Vol. 6, pp. 209-223

Argyriou A., Poularakis K., Iosifidis G., Tassiulas L., 2017, Video Delivery in Dense 5G Cellular Networks, IEEE Network, Vol. 31, No. 4, pp. 28-34

Nightingale J., Salva-Garcia P., Calero J. M. A., Wang Q., 2016, 5G-QoE: QoE Modelling for Ultra-HD Video Streaming in 5G Networks, IEEE Tran-sactions on Broadcasting, Vol. 64, No. 2, pp. 621-634

Tan B., Lu J., Wu J., Zhang D., Zhang Z., 2018, Toward a Network Slice Design for Ultra High Definition Video Broadcasting in 5G, IEEE Wireless Communications, Vol. 25, No. 4, pp. 88-94

Kim S. C., Oh H. J., Yim H. J., Hyun E. H., Choi D. J., 2019, Trends of Cloud and Virtualization in Broadcast Infra, Electronics and Telecommuni-cations Trends, Vol. 34, No. 3, pp. 23-33

Lee J. S., Yoon K. S., 2012, Technical and Industrial Trends of Ultra High Definition Contents of the level of 8K, Electronics and Telecommunications Trends, Vol. 27, No. 3, pp. 101-109

Baek N., Kim K.J., 2017, An artifact detection scheme with CUDA-based image operations, Cluster Comput, Vol. 20, pp. 749-755

Moon C. B., Kim B. M., Kim D-S., 2019, Real-time Parallel Image-processing Scheme for a Fire-control System, IEIE Transactions on Smart Processing & Computing, Vol. 8, pp. 27-35

Redmon J., Divvala S., Girshick R., Farhadi A., Jun 2016, You Only Look Once: Unified, Real-Time Object Detection, in Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788

Dim J., Takamura T., 2013, Alternative Approach for Satellite Cloud Classification: Edge Gradient Application, Advances in Meteorology, Vol. 2013, No. 11, pp. 1-8

Aggarwal C. C., 2018, Neural Networks and Deep Learning, Springer, Cham, Germany

Bao L., Wu B., Liu W., Jun. 2018, CNN in MRF: Video Object Segmentation via Inference in A CNN-Based Higher-Order Spatio-Temporal MRF, in Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5977-5986

Fan Y., Lu X., Li D., Liu Y., Oct. 2016, Video-based emotion recognition using CNN-RNN and C3D hybrid networks, in Proc. of 18th ACM International Conference on Multimodal Interaction (ICMI), pp. 445-450

Lu N., Wu Y., Feng L., Song J., 2019, Deep Learning for Fall Detection: Three-Dimensional CNN Combined With LSTM on Video Kinematic Data, IEEE Journal of Biomedical and Health Informatics, Vol. 23, No. 1, pp. 314-323

Amerini I., Galteri L., Caldelli R., Bimbo A. D., Oct. 2019, Deepfake Video Detection through Optical Flow Based CNN, in Proc. of 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1205-1207

Jung S., Kim Y., Hwang E., 2018, Real-time car tracking system based on surveillance videos, EURASIP Journal on Image and Video Processing

Wang W., 2009, Reach on Sobel Operator for Vehicle Recognition, in Proc. of 2009 International Joint Conference on Artificial Intelligence, pp. 448-451

Kutty S. B., Saaidin S., Yunus P. N. A. M., Hassan S. A., 2014, Evaluation of canny and sobel operator for logo edge detection, in Proc. of 2014 International Symposium on Technology Management and Emerging Technologies, pp. 153-156

Wong K. Y. E., Chekima A., Dargham J. A., Sainarayana G., 2008, Palmprint identification using Sobel operator, in Proc. of 2008 10th International Conference on Control, Automation, Robotics and Vision, pp. 1338-1341

Kaehler A., Bradski G., 2016, Learning OpenCV 3: computer vision in C++ with the OpenCV library, O’Reilly Media, INC., CA, USA

Stone J. E., Gohara D., Shi G., 2010, OpenCL: A parallel programming standard for heterogeneous computing systems, Computing in science & engineering, Vol. 12, No. 3, pp. 66-73

Sanida T., Sideris A., Dasygenis M., 2020, A Heterogeneous Implementation of the Sobel Edge Detection Filter Using OpenCL, in Proc. of 2020 9th International Conference on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4

Gandhi B. R., 2018, OpenCL Optimization: Accelerating the Sobel Filter on Adreno GPU, Qualcomm Developer Network, Qualcomm, CA. USA

Li D., Bui V., Chang L.-C., 2016, Performance Comparison of State-of-Art Lossless Video Compression Methods, in Proc. of 2016 International Conference on Computational Science and Computational Intelligence (CSCI), pp. 386-391

Author

Woosuk Shin

Woosuk Shin is now a Ph.D. student of the School of Computer Science and Engineering at Kyungpook National University, Korea. He received his B.A. and M.S. degrees in Computer Science and Engineering from Kyungpook National University in 2016 and 2018, respectively. His main research topic is massively parallel processing on heterogeneous platforms using general-purpose graphics processing units (GPGPU). He also has an interest in practical graphics applications.

Nakhoon Baek

Nakhoon Baek is currently a professor of the School of Computer Science and Engineering at Kyung-pook National University, Korea. He received his B.A., M.S., and Ph.D. degrees in Computer Science from Korea Advanced Institute of Science and Technology (KAIST) in 1990, 1992, and 1997, respectively. His research interests include graphics standards, graphics algorithms, real-time rendering, big data visualization, and massively parallel processing. He is also the Chief Engineer of Dassomey.com Inc., Korea. He is now also visiting the Division of Computer Science and Engineering at Louisiana State University.