Mobile QR Code

1. (Inter-university Semiconductor Research Center (ISRC), Department of Electrical and Computer Engineering, Seoul National University / Seoul 08826, Korea)
2. (Research Center for Electrical and Information Technology, Department of Electrical and Information Engineering, Seoul National University of Science and Technology / Seoul 01811, Korea hyunkim@seoultech.ac.kr)

Object detection, Embedded system, Deep learning, Autonomous driving, NVIDIA Jetson AGX Xavier

## 1. Introduction

Recently, deep neural network (DNN)-based object detection [1] with camera sensors has shown better detection accuracy than humans, significantly increasing its importance in the object detection part of autonomous vehicles [2,3]. For autonomous vehicles, the real-time detection speed of object detectors is essential for reducing latency while maintaining a high detection accuracy so that the control system can respond quickly [4]. In addition, reducing power consumption is also essential for autonomous vehicles that operate on battery-generated power. Therefore, autonomous vehicles typically operate based on embedded systems, making it difficult to detect objects in real-time using limited hardware resources, even with a relatively fast and highly efficient DNN-based one-stage detector [5].

To overcome this limitation, lightweight DNN-based object detectors that support a real-time detection speed in embedded platforms and corresponding lightweight and low-power implementation techniques have been proposed in various previous studies [5-12]. These algorithms have focused on improving the detection speed significantly by reducing the computing cost, thereby allowing DNN algorithms to be used in embedded platforms. On the other hand, there is a problem of accuracy loss compared to the conventional object detection algorithms.

Given that improved accuracy is essential for the practical deployment of these lightweight algorithms in autonomous driving, various techniques have been actively studied to enhance the accuracy of lightweight networks [13-15]. Choi $\textit{et al.}$ [13] proposed a model for predicting the localization uncertainty in a lightweight network and used the predicted uncertainty in post-processing to improve the accuracy significantly. On the other hand, the increased computing cost for post-processing leads to a decrease in the overall detection speed of the model. Yi $\textit{et al.}$ [14] and Dong $\textit{et al.}$ [15] enhanced the accuracy by constructing an additional layer in lightweight networks. Unfortunately, these methods also increased the computing cost and decreased the detection speed.

To enhance the detection speed in embedded platforms, this study proposes a parallel processing scheme for CNN operations in a GPU and Non-maximum Suppression (NMS) operations in a CPU, thereby hiding the NMS-processing time in the GPU-processing time while maintaining accuracy. Generally, one-stage object detectors, used widely for autonomous driving, process the input images as square images in the training and inference steps [16]. A preprocessing step is required to convert the inputs to square images, given that all camera sensors employed in recent autonomous-driving applications use a wide-angle camera [17]. Consequently, this conversion damages the original input image. Although CNNs can be well trained to recognize objects in distorted ($\textit{i.e.}$, square) images [16], the accuracy is significantly degraded for the input of the same ratio as the original image ($\textit{i.e.}$, wide-angle). To address these problems, this study proposes a new data augmentation technique that considers multiple images and various image ratios in the training step, thereby enabling the model to cope with various input sizes and ratios robustly without the penalty of a detection speed during the inference phase. Furthermore, in the inference phase, the input image is resized to the ratio of the autonomous driving ($\textit{i.e.}$, wide-angle) camera, thus improving detection speed and further increasing the accuracy in autonomous-driving embedded systems.

By applying all these proposed methods, the mean average precision (mAP) is improved by 1.14 percent points (pp) in the Berkeley deep drive (BDD) dataset and 1.34 pp in the KITTI dataset. The detection speed is also improved by 22.54 % in the BDD and 24.67 % in the KITTI compared to the baseline algorithm, enabling faster and more accurate detection.

## 2. Proposed Acceleration Methods

### 2.1 Non-maximum Suppression Hiding

To enhance the detection speed of object detectors in the embedded platforms, this study proposes a parallel processing technique for the convolution operations on the GPU and NMS operations on the CPU, thereby hiding the NMS-processing time into the convolution-processing time. Figs. 1(a) and (b) show the detection process of the conventional algorithm and the process after applying the proposed technique, respectively. As shown in Fig. 1(a), the baseline algorithm does not process the next input image until completing all operations of a currently input image. In autonomous-driving applications, where the images are input through streaming, the processing structure of conventional algorithms is inefficient in terms of hardware utilization. Accordingly, the proposed method employs a pipeline structure to address this issue. As shown in Fig. 1(b), using multi-thread processing, the NMS calculation process of the T frame is hidden by the GPU calculation process of the T+1 frame. Thus, the previous GPU operation result of the T frame stored in the buffer is post-processed simultaneously by the CPU when performing convolution operations of the T+1 frame by the GPU. The processing time of each task is synchronized so that the CPU can process it at the right timing according to the GPU operation. This is possible because there is no data dependency between the GPU inference calculation of the T+1 frame and the CPU post-processing task of the T frame. Therefore, the proposed method is highly efficient in terms of hardware utilization and enables a continuous flow of input images to be efficiently processed.

### 2.2 Proposed Data Augmentation in Training Step

Among previous augmentation studies, representative Mixup [18] and RICAP [19] enable a one-stage detector to learn various features based on a square input image, but there is a problem in that the detection accuracy is decreased for non-square image ratios. In addition, fully convolutional network (FCN), which is a typical network structure employed in recent object detectors, determines the amount of computation of the entire network according to the input image size. Therefore, the total computational cost of the FCN is sensitive to the size of the input image; the computational cost of the network decreases with decreasing size of the input image, increasing the detection speed. On the other hand, this leads to a decrease in accuracy. To address this problem, this subsection proposes a new data augmentation technique for autonomous driving embedded systems that enhance detection accuracy without compromising the detection speed.

Fig. 2 shows the proposed data augmentation method using a single image. The example on the left side in Fig. 2 shows data augmentation maintaining the ratio of the original image. In the first step, the original image is cropped randomly. This cropped image is inserted into a square training plate while maintaining its ratio. This process prevents the shapes of the objects in the original image from becoming distorted. In the final step, conventional data augmentation schemes, such as flip, saturation, hue, and exposure changes, are applied [20,21]. The example on the right side in Fig. 2 shows data augmentation without maintaining the ratio of the original image. In the first step, the image is cropped randomly, resized into a square, and inserted into a square training plate. In terms of maintaining the original ratio, an affine transform effect for the objects can be obtained using this method. In the final process, the aforementioned conventional augmentation techniques are also applied to produce the final training image. These two augmentation techniques are used during the training phase to enable the CNN to learn various objects, ratios, and features.

##### Fig. 2. Example of the proposed data augmentation using single image.

Moreover, the use of multiple images rather than a single image can greatly increase the diversity of the data and prevent the overfitting of the CNN with deep layers [19]. Fig. 3 shows the proposed data augmentation technique using two images. In Fig. 3, the augmentation processes that maintain and do not maintain the original ratio are the same as in Fig. 2. When two images are used, the square training plate area is divided into two. Each preprocessed input image ($\textit{i.e.}$, image with or without maintaining ratio) is inserted independently into the divided area. The area of the training plate is divided based on the width because the input ratio of the autonomous driving camera is wide-angle. Thus, four training images are generated using two input images, allowing the training data to be expanded. The left side in Fig. 4 shows an example of forming a training plate using two images. The boundary line on the left side in Fig. 4 is not fixed at the center; it can move in the movable direction ($\textit{i.e.}$, up and down), further increasing the number of cases of training data.

##### Fig. 4. Example of partitioning a training plate according to various images.

A training image is produced using four images to expand the diversity of data beyond two images. As shown in the right example in Fig. 4, images with or without the maintained ratio are inserted in each of the four areas partitioned according to the boundary position and line, thus producing the new training data. The boundary position can move within a specific range in the movable direction ($\textit{i.e.}$, left, right, up, and down), producing more diverse data. Using this proposed augmentation, new global features, which refer to the formation of a new image by combining multiple patches from multiple images [19], can be produced to prevent overfitting. Thus, the accuracy is improved significantly.

### 2.3 Proposed Image Resize in Inference Step

This subsection proposes a preprocessing scheme that enhances the detection speed by eliminating unnecessary operations in object detectors for an autonomous-driving embedded system. In actual autonomous driving, in general, Full HD ($\textit{i.e.}$, 1920${\times}$1080) is used as the input resolution [22]. However, because this Full HD size causes serious power consumption and speed degradation, it is common to automatically resize the high-resolution image and process it at a lower resolution. Fig. 5 shows the difference in image resizing with the conventional object detectors and after applying the proposed technique in the inference phase. While most previous methods [16] resize the image with a square ratio ($\textit{i.e.}$, Conventional in Fig. 5), YOLO-based object detectors [6,13] maintain the actual input ratio when resizing to enhance the accuracy. However, as shown in the Baseline image in Fig. 5, YOLO-based object detectors also execute operations on the square size input by filling in certain pixel values in the margin area in the image ($\textit{i.e.}$, letterbox). In other words, a square input that maintains the input image ratio is generated and processed in the network. Although this provides the advantage of maintaining the input ratio for high accuracy, there is no benefit of reducing the computational cost because the total input still has the same image size of the square ratio.

##### Fig. 5. Examples of the image resize method in the inference phase on the conventional object detector, baseline algorithm, and proposed scheme.

The proposed inference preprocessing technique can compensate for the inefficient structure by removing unnecessary upper and lower letterbox areas from the baseline method, thereby computing convolution operations only on meaningful pixels and greatly reducing the computational cost. Furthermore, as the algorithm using the proposed data augmentation technique robustly detects diverse input ratios by resizing the image while maintaining the original ratio in the inference phase, the computational cost is reduced while significantly enhancing the detection accuracy. In other words, using the effect of reduced computation, the image can be notably resized again, considering the trade-off between the computing cost and detection speed, thereby improving accuracy.

## 3. Experimental Results

### 3.1 Experimental Environment

The superiority of the proposed methods is assessed by experiments using BDD [23] and KITTI [24] datasets, which are widely used in autonomous-driving research. The same datasets, baseline algorithm ($\textit{i.e.}$, tiny Gaussian YOLOv3), open-source, and experimental settings used in [13] are used for a fair comparison.

### 3.2 Accuracy Evaluation

Table 1 lists the mAP and floating-point operations per second (FLOPs) results of the baseline algorithm [13] and the algorithms applying the proposed methods for the BDD and KITTI datasets. The input resolution is set to 512${\times}$512 as in [13] to compare the accuracy fairly, excluding the proposed image resize method. It is noteworthy that in BDD, the number of classes is larger than that of KITTI, and the data diversity is also higher. In addition, because BDD applies the strict IoU threshold ($\textit{i.e.}$, IoU>0.75) for all classes in the evaluation, it shows a lower mAP than KITTI [24]. When applying the proposed data augmentation process (+Proposed augmentation in Table 1), the mAP is improved by 0.98 pp and 0.96 pp compared to the baseline algorithm for the BDD and KITTI datasets, respectively. It should be noted that as the proposed data augmentation scheme is applied only to the training phase, the computing cost during the inference phase is not increased. Finally, by applying the proposed resize technique maintaining the original image ratio rather than resizing to a square (+Proposed resize in Table 1), the mAP is improved greatly by 1.14 pp for the BDD and 1.34 pp for the KITTI dataset with respect to the baseline. It is noteworthy that applying the proposed resize technique, which removes unnecessary letterbox operations, causes the input image to become smaller than the square size ($\textit{i.e.}$, 512${\times}$512). Accordingly, considering the trade-off between the accuracy and computing cost, it can be scaled up again while maintaining the original image ratio ($\textit{i.e.}$, 672${\times}$384 in BDD and 768${\times}$256 in KITTI. The reason for these values is described in Section 3.3), thereby simultaneously improving the accuracy and reducing the computing cost ($\textit{i.e.}$, FLOPs).

##### Table 1. Accuracy and FLOPs comparison.
 Method mAP (%) Diff. FLOPs (×109) Input size BDD test set Baseline [13] 8.56 8.27 512×512 + Proposed augmentation 9.54 +0.98 8.27 512×512 + Proposed resize 9.70 +1.14 8.14 672×384 KITTI validation set Baseline [13] 68.69 8.26 512×512 + Proposed augmentation 69.65 +0.96 8.26 512×512 + Proposed resize 70.03 +1.34 6.19 768×256

Table 2 lists the accuracy of the existing data augmentation studies [18,19] and the proposed techniques. When making a square image size, Mixup [18], RICAP [19], and the proposed method resize the image into a square using the conventional method shown in Fig. 5 in the inference phase. In contrast, when maintaining the original ratio, they resize the image using the proposed scheme shown in Fig. 5. For square images, all previous techniques, Mixup [18] and RICAP [19], improve the accuracy compared to the baseline for the BDD and KITTI datasets; however, for input images that maintain the original ratio, their accuracy is reduced greatly. For the KITTI, which comprises wider images than FULL HD, the previous techniques significantly degrade the accuracy compared to the baseline. Even in this case, the proposed method improves the mAP of 0.25 pp for the BDD and 1.85 pp for the KITTI dataset compared to the baseline. In other words, the proposed augmentation method improves the accuracy for both square and original ratio input images. When comparing the accuracy of Mixup [18] and RICAP [19] for square images with that of the proposed method which maintains the original ratio, the proposed method also shows better result. Nevertheless, the FLOPs of the algorithm applying the proposed method ($\textit{i.e.}$, 8.14${\times}$10$^{9}$ on BDD and 6.19${\times}$10$^{9}$ on KITTI) are smaller than the FLOPs of Mixup [18] and RICAP [19] ($\textit{i.e.}$, 8.27${\times}$10$^{9}$ on BDD and 8.26${\times}$10$^{9}$ on KITTI), which use the square image size.

##### Table 2. Accuracy comparison with previous augmentation studies according to image resize.
 Method mAP (%) Input size (Square) mAP (%) Input size (Resize) BDD test set Baseline [13] 8.56 512×512 9.45 672×384 + Mixup [18] 8.87 512×512 7.79 672×384 + RICAP [19] 9.59 512×512 8.78 672×384 + Proposed Aug. 9.54 512×512 9.70 672×384 KITTI validation set Baseline [13] 68.69 512×512 68.18 768×256 + Mixup [18] 69.59 512×512 38.88 768×256 + RICAP [19] 69.66 512×512 52.44 768×256 + Proposed Aug. 69.65 512×512 70.03 768×256

### 3.3 Detection Speed Evaluation

Table 3 shows the inference, post-processing, and total processing times of the baseline ($\textit{i.e.}$, tiny Gaussian YOLOv3 [13]) and the algorithm applying proposed schemes in NVIDIA Jetson AGX Xavier [25] for the BDD and KITTI datasets. The image size for the proposed resize technique is set to match the FLOPs as closely as possible. Hence, it is set to 672${\times}$384 for the BDD and 768${\times}$256 for the KITTI dataset, while the remaining method is set to 512${\times}$512. Specifically, the size of the original BDD image is 1280${\times}$720, and that of the original KITTI image is 1242${\times}$375. Therefore, the horizontal:vertical ratio is 1.78:1 and 3.3:1, respectively. To match the number of pixels ($\textit{i.e.}$, 262,144 pixels) of the baseline square image ($\textit{i.e.}$, 512${\times}$512), the image size for the proposed resize technique is set to 672${\times}$384 in BDD and 768${\times}$256 in KITTI, while matching the ratio of the original image.

The proposed technique shows greater improvements in performance in default mode, which is typically used in NVIDIA Jetson AGX Xavier. The proposed NMS hiding (+Prop. NMS hiding in Table 3) reduces the total processing time by 7.64 ms for the BDD dataset compared to the baseline. Furthermore, the proposed preprocessing technique (+Prop. Resize in Table 3) reduces the processing time further by 7.9 ms (22.54 %). For the KITTI dataset, applying all proposed techniques reduces the total processing time by 6.57 ms (24.67 %) with respect to the baseline. Finally, applying the proposed schemes to the baseline algorithm improves the accuracy while achieving a detection speed of 36.83 fps (=1000 ms/27.15 ms) for the BDD dataset and 49.85 fps (=1000 ms/20.06 ms) for the KITTI dataset, thereby enabling real-time detection to support faster autonomous driving than the baseline algorithm.

##### Table 3. Results of processing time required in default mode of NVIDIA Jetson AGX Xavier.
 Time (ms) Inference (GPU) Post processing (CPU) Total Confidence calculation NMS BDD test set Baseline [13] 27.05 0.36 7.64 35.05 + Prop. NMS hiding 27.05 0.36 - 27.41 + Prop. Resize 26.81 0.34 - 27.15 KITTI validation set Baseline [13] 26.31 0.16 0.16 26.63 + Prop. NMS hiding 26.31 0.16 - 26.47 + Prop. Resize 19.91 0.15 - 20.06

## 4. Conclusion

This study has proposed a method that enhances the detection speed of an object detector while significantly improving the accuracy in NVIDIA Jetson AGX Xavier, an embedded platform for autonomous driving. To improve the detection speed, this study has proposed a parallel processing scheme for the convolution and post-processing operations. This study has also proposed new data augmentation and image resize techniques. Applying the proposed methods to the baseline achieves outstanding performance gains in accuracy and detection speed, enabling accurate and real-time detection for autonomous driving embedded platforms.

### ACKNOWLEDGMENTS

This work was supported in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-01304, Development of Self-learnable Mobile Recursive Neural Network Processor Technology) and in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education under Grant NRF-2019R1A6A1A03032119.

### REFERENCES

1
Choi J., Elezi I., Lee H-J., Farabet C., Alvarez J. M., Oct. 2021, Active Learning for Deep Object Detection via Probabilistic Modeling, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 10264-10273.
2
Ravindran R., et al. , Mar. 2021, Multi-Object Detection and Tracking, Based on DNN, for Autonomous Vehicles: A Review, in IEEE Sens. J., Vol. 21, No. 5, pp. 5668-5677
3
Zhao X., et al. , May 2020, Fusion of 3D LIDAR and Camera Data for Object Detection in Autonomous Vehicle Applications, in IEEE Sens. J., Vol. 20, No. 9, pp. 4901-4913
4
Choi J., Chun D., Kim H., Lee H-J., Oct. 2019, Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving, in Proc. IEEE Int. Conf. Comput. Vis. (ICCV)
5
Womg A., Shafiee M. J., Li F., Chwyl B., May 2018, Tiny SSD: A Tiny Single-Shot Detection Deep Convolutional Neural Network for Real-Time Embedded Object Detection, in Proc. 15th Conf. on Comput. Robot Vision (CRV), pp. 95-101.
6
Redmon J., Farhadi A., 2018., YOLOv3: An incremental improvement, arXiv preprint, arXiv:1804.02767
7
Nguyen D. T., Nguyen T. N., Kim H., H.-J Lee. , 2019, A High-Throughput and Power-Efficient FPGA Implementation of YOLO CNN for Object Detection, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Vol. 27, No. 8, pp. 1861-1873
8
Nguyen D. T., Hung N. H., Kim H., Lee H. J., 2020, An Approximate Memory Architecture for Energy Saving in Deep Learning Applications, IEEE Trans. Circuits Syst. I, Reg. Papers, Vol. 67, No. 5, pp. 1588-1601
9
Nguyen D. T., Kim H., Lee H. J., Chang I. J., May. 2018, An approximate memory architecture for a reduction of refresh power consumption in deep learning applications, in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1-5
10
Kang D., Kang D., Kang J., Yoo S., Ha S., Mar. 2018, Joint optimization of speed, accuracy, and energy for embedded image recognition systems, in Proc. 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 715-720
11
Sandler M., Howard A., Zhu M., Zhmoginov A., Chen L., Jun. 2018, MobileNetV2: Inverted Residuals and Linear Bottlenecks, in Proc. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510-4520
12
Nguyen X. T., Nguyen T. N., H.-J Lee , Kim H., Dec. 2020, An Accurate Weight Binarization Scheme for CNN Object Detectors with Two Scaling Factors, IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 6, pp. 497-503
13
Choi J., Chun D., Lee H-J., Kim H., Aug. 2020, Uncertainty-based Object Detector for Autonomous Driving Embedded Platforms, in Proc. IEEE Int. Conf. Artifici. Intell. Circuits Syst. (AICAS), pp. 16-20
14
Zhang Y., Shen Y., Zhang J., Apr. 2019, An improved tiny-yolov3 pedestrian detection algorithm, Int. J. Light Electron Opt., Vol. 183, pp. 17-23
15
Xiao D., et al. , Jul. 2019., A target detection model based on improved tiny-yolov3 under the environment of mining truck, IEEE Access, Vol. 7
16
Zhao Q., et al. , Jan. 2019, M2Det: A single-shot object detector based on multi-level feature pyramid network, in Proc. AAAI Conf. Artif. Intell. (AAAI), pp. 9259-9266
17
SEKONIX Corp. , Feb. 2020., SF332X-10X Family Preliminary Datasheet, [Online]. Available: http://sekolab.com/products/camera/
18
H. Zhang , et al. , Apr. 2018, mixup: Beyond empirical risk minimization, in Proc. Int. Conf. Learn. Represent. (ICLR), pp. 1-13
19
Takahashi R., Matsubara T., Uehara K., 2020, Data augmentation using random image cropping and patching for deep CNNs, IEEE Trans.Circuits Syst. Video Technol., Vol. 30, No. 9, pp. 2917-2931
20
Krizhevsky A., Sutskever I., Hinton G. E., Dec. 2012, ImageNet classification with deep convolutional neural networks, in Proc. Adv. Neural Inf. Process. Syst., pp. 1097-1105
21
He K., Zhang X., Ren S., Sun J., Jun. 2016, Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770-778
22
Hemmati M., B-Abhari M., Niar S., 2019, Adaptive Vehicle Detection for Real-time Autonomous Driving System, in Proc. Des. Autom. And Test in Eur.Conf. & Exhib.
23
Yu F., et al. , Jun. 2020, BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)
24
Geiger A., Lenz P., Urtasun R., Jun. 2012, Are we ready for autonomous driving? the kitti vision benchmark suite, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3354-3361
25
NVIDIA Corp. , Dec. 17, 2018, NVIDIA Xavier Documentation

## Author

##### Jiwoong Choi

Jiwoong Choi received his B.S. degree in electrical and electronics engineering from Chung-ang University, Seoul, South Korea, in 2015, and M.S. and Ph.D. degrees in electrical and computer engineering from Seoul National University, Seoul, South Korea, in 2017 and 2021, respectively. He is currently a Deep Learning Research Engineer at NVIDIA, Santa Clara, CA, USA.

##### Dayoung Chun

Dayoung Chun received her B.S. degree in Electronics Engineering from Sogang University, Seoul, Korea, in 2018. She is working toward Integrated M.S. and Ph.D. degree in Electrical and Computer Engineering at Seoul National University, Seoul. Her research interests include the algorithms and architectures of deep learning, and GPU architecture for computer vision.

##### Hyuk-Jae Lee

Hyuk-Jae Lee received his B.S. and M.S. degrees in electronics engi-neering from Seoul National University, South Korea, in 1987 and 1989, respectively, and Ph.D. degree in Electrical and Computer Engi-neering from Purdue University, West Lafayette, IN, in 1996. From 1998 to 2001, he was with the Server and Workstation Chipset Division, Intel Corporation, Hillsboro, OR, as a Senior Component Design Engineer. From 1996 to 1998, he was with the Faculty of the Department of Computer Science, Louisiana Tech University, Ruston, LS. In 2001, he joined the School of Electrical Engineering and Computer Science, Seoul National University, South Korea, where he is currently a Professor. He is a Founder of Mamurian Design, Inc., a fabless SoC design house for multimedia applications. His research interests are in the areas of computer architecture and SoC design for multimedia applications.

##### Hyun Kim

Hyun Kim received his B.S., M.S. and Ph.D. degrees in Electrical Engi-neering and Computer Science from Seoul National University, Seoul, Korea, in 2009, 2011 and 2015, respectively. From 2015 to 2018, he was with the BK21 Creative Research Engineer Development for IT, Seoul National University, Seoul, Korea, as a BK Assistant Professor. In 2018, he joined the Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul, Korea, where he is currently working as an Assistant Professor. His research interests are the areas of algorithms, computer architecture, memory, and SoC design for low-complexity multimedia applications and deep neural networks.