ShinWoosuk
BaekNakhoon*
-
(School of Computer Science and Engineering, Kyungpook National University / Daegu
41566, Korea
{w.shin, nbaek}@knu.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Video processing, Massively parallel processing, Mobile architecture, Mobile computing, GPGPU, OpenCL
1. Introduction
Recently, the mobile processor manufacturer, Qualcomm, introduced 8K (kilo) UHD
(Ultra High Resolution) video recording ability to their chipset [1]. The chipset can process 24 frames of 8K video per second, thereby allowing the creation
and consumption of UHD videos for the general public. In addition, captured UHD video
can be broadcast from a mobile phone to the internet in real-time using 5G (fifth-generation)
communication service [2-5]. A study by ETRI (Electronics and Telecommunications Research Institute, Korea) forecasted
that the demand to consume 8K UHD contents would increase as more 8K UHD display panels
are supplied in 2025 [6,7].
The frame from a live stream of UHD video can be utilized in various applications
using image-processing techniques. One of the major areas that use real-time video
to detect information is lane or object detection of vehicle-mounted video recorders
to alert drivers to hazardous situations. The video frame can also be utilized to
detect artifacts of a PCB (Printed Circuit Board) during the soldering process [8], and to detect fire from a surveillance camera [9]. Moreover, many conventional AR (Augmented Reality) applications analyze a frame-from-video
to calibrate a camera or detect simple object information [10,11].
A captured video is also utilized in many deep learning applications combined
with DNNs (Deep Neural Network), RNNs (Recurrent Neural Network), or CNNs (Convolution
Neural Network) [12]. The application includes sophisticated object segmentation, human emotion recognition,
situation awareness, and video analysis [13-17]. Moreover, several efforts have been made to process video with neural networks on
mobile devices because recent mobile processors have dedicated NPU (Neural Processing
Unit) or AI (Artificial Intelligence) accelerators.
One major time-consuming part of introduced deep learning-based methods is the
pre-processing of the video frame. The pre-processing includes conventional image
processing, e.g., color scale conversion, thresholding (binarization), and convolutional
filtering. Fig. 1 gives an example of image pre-processing used widely for neural networks.
To gain as much information as possible, a wide-angle high-resolution camera is
used to capture video, which leads to longer pre-processing times. Therefore, minimizing
the pre-processing time for a high-resolution video frame is an important factor for
increasing the efficiency of a deep learning process.
A Sobel operator is one of the most used convolutional image pre-processing methods.
Sobel operators detect edge information and are used widely in classifications [11,
18-20]. A de-facto computer vision library OpenCV (Open Computer Vision) also provides
image pre-processing algorithms, including the Sobel operator [21]. OpenCV provides a good general solution for image processing algorithms, but it
lacks optimization for specific functions owing to its generality.
This paper introduced an optimized image pre-processing method, specialized in
image convolution for videos with very high resolution, using OpenCL (Open Computing
Library) [22]. The proposed scheme can fully utilize the resources of embedded systems with low-end
mobile processors.
The remainder of this paper is organized as follows. The related works section
introduces previous research to accelerate image convolution. In optimizing the OpenCL
kernel architecture scheme section, this paper introduces an optimizing technique
to process a convolution kernel. The performance evaluation section included a prototype
video player implementation. In addition, the performance is evaluated, and a comparison
is made with the well-known image processing library. Finally, the last section concludes
the paper with discussions and effects of the proposed scheme.
Fig. 1. Examples of image pre-processing methods for neural networks.
2. Related Works
Several efforts have been made to accelerate image processing using OpenCL. The
latest research [23] shows that it can process Sobel operation for a 2K image in 0.926 milliseconds. On
the other hand, the research platform is based on a desktop platform rather than on
mobile platforms. Table 1 lists the performance specifications of Nvidia GTX1060, which was used in a previous
study [23], and ARM Mali-T880, which was used in this paper.
Although both GPUs (Graphics Processing Units) were launched in the same year,
the performance gap between the two chipsets was apparent because of the limited power
consumption of the mobile architectures. Therefore, another approach to accelerate
image processing for mobile architectures is required.
Another study [24] optimized an image process by manipulating data to achieve the minimum load and store
time. The research achieved an up to 3.3 times faster processing time than the conventional
Sobel operation using OpenCL on mobile GPU. The data-level optimization technique
is a good solution in accelerating parallel tasks.
Because the study goal was to process a video frame while playing real-time video,
it is important to consider that the GPU resources are already being used to decode
a video. The performance will deteriorate if more resources are used to manipulate
data due to frequent context switching between different tasks.
A parallel library provided by OpenCV also provides a good general solution for
convolution, including the Sobel operator. The library was also implemented with OpenCL
to utilize parallel processing, but the OpenCV library does not consider its running
platforms. Additionally, conversion from a ffmpeg decoded YV12 image type to an OpenCV
RGB image type should be conducted to process an image, which requires multiple sequential
reads of the memory, leading to performance reduction [25].
Table 1. Specifications of GTX1060 and Mali-T880.
Specification
|
GTX1060
|
Mali-T880
|
First Launch
|
Q3, 2016
|
Q1. 2016
|
Maximum Clock
|
1709 MHZ
|
650 MHz
|
Dedicated Memory
|
6 GB
|
None
|
Maximum Power Consumption
|
120 Watt
|
7 Watt
|
32-bit Floating Point Operations per Seconds
|
4375 GFLOPS
|
265 GFLOPS
|
3. Optimizing OpenCL Kernel Architecture Scheme
OpenCL was used for parallel processing API (Application Programming Interface)
because this study aimed to perform real-time image processing while playing a real-time
video for mobile architectures. The Exynos 8890 processor was also targeted because
of its generality in being embedded in the Samsung Galaxy S7. The chipset has an integrated
GPU (Graphics Processing Unit), ARM Mali-T880, which supports a maximum work-group
size of $\textit{(256, 256, 256)}$.
In the scheme, each pixel in each thread is processed. A thread is the minimum
size of a program to run kernel source code on a single GPU core. The multiple threads
can be launched physically and simultaneously with the size of the work-group. Multiple
work-groups can be logically contructed and simultaneously launched in the GPU. They
are typically arranged into the ND-Range (n-dimensional range) data types in OpenCL.
Therefore, although the total pixel number of 8K $\textit{(7680 * 4320)}$ video is
larger than the maximum work-group size of the chipset, the threads can be divided
into multiple work-groups within the size of the ND-Range.
A work-group size should be adjusted adaptively, so it does not exceed its maximum
work-group size while approaching the maximum size. Because launching OpenCL kernels
to a power of 2 is the most ideal, it is important to calculate the total number of
threads $\textit{N}_{threads}$ using the equation
where $\textit{Width}_{f}$ and $\textit{Height}_{f}$ represent the width and height
for the video frame. With $\textit{N}_{threads}$, the optimal adoptive work-group
size $\textit{(X, Y, Z)}$ can then be calculated with the procedure in Fig. 2.
The convolution kernel used was the Sobel kernel, as shown in Fig. 3. A general solution could be derived for all 3*3 convolution kernels by requiring
the kernel as a parameter. In the proposed prototype, however, a fixed kernel was
used for the best performance. Although fixed kernels lack generality in the programming
aspect, the kernel itself can be adjusted if other kernels are needed.
OpenCL programs can access the texture memory area directly, while other CPU-based
programs should copy the image data to the main memory area. Thus, in the case of
OpenCL, there is no need to convert image color space for use in OpenCL kernels, and
the image type conversion overhead can be minimized.
Fig. 2. Adaptive work-group size calculating procedure.
Fig. 3. Sobel convolution kernels.
Fig. 4. Flow charts of each thread for prototype video player implementation.
4. Performance Evaluation
If a commercial player is used to process frames, multiple unnecessary image conversions
from YV12 color space to RGB color space should be conducted. Therefore, a simple
video player was implemented to evaluate the performance of the proposed OpenCL kernel.
The video player consisted of two main threads, a decoder thread, and an image processor
thread.
The decoder thread decodes a video file until it reaches the EOF (end of file)
and queues it to the global queue. The decoder is implemented using the ffmpeg library,
and it fully supports the hardware acceleration of the GPU for video decoding. The
frame processor thread obtains a decoded frame from the queue and launches the OpenCL
kernel or OpenCV function to process the frame. Fig. 4 shows flowcharts of each thread for prototype video player implementation.
Using the proposed prototype implementation, the frame-processing performance
was measured and compared with that in the OpenCV library. HEVC (High Efficient Video
Codec) and H264 codec were used in the experiment. Although both codecs require high
computing resources, they are used widely to minimize their file capacity. Only HEVC
is used for 8K video. Table 2 lists the video resolution, codec, and bitrate information used for the experiment.
The playback time of each video is the same.
Table 3 lists a performance for each resolution, codec, and method. The analysis was conducted
using the Exynos 8890 processor. The table includes the decoding and image convolution
time for each resolution, codecs, and methods. The result showed that the proposed
method performs better than OpenCV for all resolutions and codecs. The proposed scheme
can handle a single frame of 8K video at least in one second, even though it requires
context switching because it exceeds the maximum work-group size. A higher frame rate
can be expected if the decoding speed of the video can be minimized with a state-of-art
chipset.
Another noticeable performance is the frame decoding time for each method. The
decoding time was faster when OpenCV was used for convolution than when OpenCL was
used. The decoding performance gap occurred because of the higher usage of GPU resources
for the proposed scheme as it uses all the available physical threads of the GPU.
On the other hand, the overall time consumption for each video frame processing was
always smaller in the proposed scheme.
Table 2. Video information used for the experiment.
Resolution
|
Codec
|
Bitrate
|
8K
(7680*4320)
|
HEVC
|
88 Mbps
|
5K
(5120*2880)
|
HEVC / H264
|
28 Mbps
|
4K
(3840*2160)
|
HEVC / H264
|
22 Mbps
|
1080P
(1920*1080)
|
HVEC / H264
|
6 Mbps
|
Table 3. Video decoding and convolution processing time of each method for different resolution video.
Resolution
|
Codec
|
Method
|
De-
coding
|
Con-volution
|
8K
(7680*4320)
|
HEVC
|
CV
|
713 ms
|
410 ms
|
CL
|
834 ms
|
172 ms
|
5K
(5120*2880)
|
HEVC
|
CV
|
323 ms
|
198 ms
|
CL
|
387 ms
|
94 ms
|
H264
|
CV
|
59 ms
|
185 ms
|
CL
|
70 ms
|
78 ms
|
4K
(3840*2160)
|
HEVC
|
CV
|
228 ms
|
115 ms
|
CL
|
246 ms
|
68 ms
|
H264
|
CV
|
58 ms
|
109 ms
|
CL
|
62 ms
|
57 ms
|
1080P
(1920*1080)
|
HVEC
|
CV
|
54 ms
|
30 ms
|
CL
|
62 ms
|
16 ms
|
H264
|
CV
|
30 ms
|
27 ms
|
CL
|
28 ms
|
16 ms
|
5. Conclusion
An optimized OpenCL kernel scheme that can perform convolution for video frames,
including 8K UHD video was designed. The proposed design achieved better overall performance
than the well-known conventional image-processing library OpenCV. The performance
gap occurred because the OpenCV parallel algorithms focused on generalizing overall
image processing algorithms, whereas the current design focused on specialized kernels
for convolution. On the other hand, the video decoding time of the proposed scheme
was longer than when OpenCV was used. The reason for the difference in decoding time
was that the scheme fully utilizes the GPU resource. Thus, the available decoding
resource of GPU was smaller.
In the future, the proposed scheme is expected to be adopted to a processor that
can decode 8K HEVC video in real-time. In addition, the image processing workload
is expected to be distributed between the central prosceiing unit and GPU to achieve
higher performance. Furthermore, a solution for video with more than 8K resolution
should be considered because it exceeds the maximum work-group size of the chipset.
ACKNOWLEDGMENTS
This research was supported by Kyungpook National University Research Fund, 2020.
REFERENCES
Qualcomm , 2020, Snapdragon 865 5G Mobile Platform, Qualcomm, CA, USA
Vo N., Duong T. Q., Tuan H. D., Kortun A., 2017, Optimal Video Streaming in Dense
5G Networks With D2D Communications, IEEE Access, Vol. 6, pp. 209-223
Argyriou A., Poularakis K., Iosifidis G., Tassiulas L., 2017, Video Delivery in Dense
5G Cellular Networks, IEEE Network, Vol. 31, No. 4, pp. 28-34
Nightingale J., Salva-Garcia P., Calero J. M. A., Wang Q., 2016, 5G-QoE: QoE Modelling
for Ultra-HD Video Streaming in 5G Networks, IEEE Tran-sactions on Broadcasting, Vol.
64, No. 2, pp. 621-634
Tan B., Lu J., Wu J., Zhang D., Zhang Z., 2018, Toward a Network Slice Design for
Ultra High Definition Video Broadcasting in 5G, IEEE Wireless Communications, Vol.
25, No. 4, pp. 88-94
Kim S. C., Oh H. J., Yim H. J., Hyun E. H., Choi D. J., 2019, Trends of Cloud and
Virtualization in Broadcast Infra, Electronics and Telecommuni-cations Trends, Vol.
34, No. 3, pp. 23-33
Lee J. S., Yoon K. S., 2012, Technical and Industrial Trends of Ultra High Definition
Contents of the level of 8K, Electronics and Telecommunications Trends, Vol. 27, No.
3, pp. 101-109
Baek N., Kim K.J., 2017, An artifact detection scheme with CUDA-based image operations,
Cluster Comput, Vol. 20, pp. 749-755
Moon C. B., Kim B. M., Kim D-S., 2019, Real-time Parallel Image-processing Scheme
for a Fire-control System, IEIE Transactions on Smart Processing & Computing, Vol.
8, pp. 27-35
Redmon J., Divvala S., Girshick R., Farhadi A., Jun 2016, You Only Look Once: Unified,
Real-Time Object Detection, in Proc. of 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 779-788
Dim J., Takamura T., 2013, Alternative Approach for Satellite Cloud Classification:
Edge Gradient Application, Advances in Meteorology, Vol. 2013, No. 11, pp. 1-8
Aggarwal C. C., 2018, Neural Networks and Deep Learning, Springer, Cham, Germany
Bao L., Wu B., Liu W., Jun. 2018, CNN in MRF: Video Object Segmentation via Inference
in A CNN-Based Higher-Order Spatio-Temporal MRF, in Proc. of 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 5977-5986
Fan Y., Lu X., Li D., Liu Y., Oct. 2016, Video-based emotion recognition using CNN-RNN
and C3D hybrid networks, in Proc. of 18th ACM International Conference on Multimodal
Interaction (ICMI), pp. 445-450
Lu N., Wu Y., Feng L., Song J., 2019, Deep Learning for Fall Detection: Three-Dimensional
CNN Combined With LSTM on Video Kinematic Data, IEEE Journal of Biomedical and Health
Informatics, Vol. 23, No. 1, pp. 314-323
Amerini I., Galteri L., Caldelli R., Bimbo A. D., Oct. 2019, Deepfake Video Detection
through Optical Flow Based CNN, in Proc. of 2019 IEEE/CVF International Conference
on Computer Vision Workshop (ICCVW), pp. 1205-1207
Jung S., Kim Y., Hwang E., 2018, Real-time car tracking system based on surveillance
videos, EURASIP Journal on Image and Video Processing
Wang W., 2009, Reach on Sobel Operator for Vehicle Recognition, in Proc. of 2009 International
Joint Conference on Artificial Intelligence, pp. 448-451
Kutty S. B., Saaidin S., Yunus P. N. A. M., Hassan S. A., 2014, Evaluation of canny
and sobel operator for logo edge detection, in Proc. of 2014 International Symposium
on Technology Management and Emerging Technologies, pp. 153-156
Wong K. Y. E., Chekima A., Dargham J. A., Sainarayana G., 2008, Palmprint identification
using Sobel operator, in Proc. of 2008 10th International Conference on Control, Automation,
Robotics and Vision, pp. 1338-1341
Kaehler A., Bradski G., 2016, Learning OpenCV 3: computer vision in C++ with the OpenCV
library, O’Reilly Media, INC., CA, USA
Stone J. E., Gohara D., Shi G., 2010, OpenCL: A parallel programming standard for
heterogeneous computing systems, Computing in science & engineering, Vol. 12, No.
3, pp. 66-73
Sanida T., Sideris A., Dasygenis M., 2020, A Heterogeneous Implementation of the Sobel
Edge Detection Filter Using OpenCL, in Proc. of 2020 9th International Conference
on Modern Circuits and Systems Technologies (MOCAST), pp. 1-4
Gandhi B. R., 2018, OpenCL Optimization: Accelerating the Sobel Filter on Adreno GPU,
Qualcomm Developer Network, Qualcomm, CA. USA
Li D., Bui V., Chang L.-C., 2016, Performance Comparison of State-of-Art Lossless
Video Compression Methods, in Proc. of 2016 International Conference on Computational
Science and Computational Intelligence (CSCI), pp. 386-391
Author
Woosuk Shin is now a Ph.D. student of the School of Computer Science and Engineering
at Kyungpook National University, Korea. He received his B.A. and M.S. degrees in
Computer Science and Engineering from Kyungpook National University in 2016 and 2018,
respectively. His main research topic is massively parallel processing on heterogeneous
platforms using general-purpose graphics processing units (GPGPU). He also has an
interest in practical graphics applications.
Nakhoon Baek is currently a professor of the School of Computer Science and Engineering
at Kyung-pook National University, Korea. He received his B.A., M.S., and Ph.D. degrees
in Computer Science from Korea Advanced Institute of Science and Technology (KAIST)
in 1990, 1992, and 1997, respectively. His research interests include graphics standards,
graphics algorithms, real-time rendering, big data visualization, and massively parallel
processing. He is also the Chief Engineer of Dassomey.com Inc., Korea. He is now also
visiting the Division of Computer Science and Engineering at Louisiana State University.