Mobile QR Code

1. (Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University / Seoul 06974, Korea {jhun, heegwang, chanyeong}@ipis.cau.ac.kr, paikj@cau.ac.kr)
2. ( Graduate School of Artificial Intelligence, Chung-Ang University / Seoul 06974, Korea {mgkim, yhmok}@ipis.cau.ac.kr)

Data augmentation, Synthetic data generation, SOD, UAV

## 1. Introduction

One of the most researched computer vision tasks is object detection, which has been used in many applications, such as disaster relief, intelligent monitoring systems, and the defense industry. In the field of disaster relief, object detection in drone images is used to detect an unconscious person located in a hard-to-approach area. Therefore, to achieve high accuracy in unconscious person detection, it is necessary to acquire numerous high-quality training datasets, but that requires time and human resources. Moreover, it is difficult to obtain unconscious person data in a variety of environments.

To solve this problem, we propose a novel data augmentation method using synthetic data generation. In synthetic data generation for unconscious person data in drone images, we first extract the object to obtain a foreground mask with object information from a reference image. In this process, it is necessary to minimize the loss of edge information of the object so that we extract only the object. For this reason, we use one of the salient object detection methods, $\boldsymbol{U}^{2}$Net [1], which is the most suitable for single object detection. We detect the most salient object from the input image and generate a foreground mask that contains only the object information without background.

When synthesizing the foreground mask and the target background image, we have to consider an incompatibility between these two images. Therefore, it is necessary to generate shadows on the foreground mask to make more natural synthetic data. Finally, the refined foreground mask and the drone background image are synthesized.

## 2. Related Work

Data augmentation is widely used in object detection and image classification tasks to generate additional training datasets. Taylor et al. proposed geometric augmentation methods such as flipping, rotating, and cropping [2]. These methods artificially transform training data by preserving labels to reduce the cost of collecting training data and are widely used to improve the performance of CNNs. Zhang et al. proposed MixUp, which solves the memorization problem of deep learning networks or sensitive issues in adversarial examples [3]. This method improves the classification performance by using weighted linear interpolation of two example images and labels.

Takahashi et al. proposed a random image cropping and patching (RICAP) method that combines randomly cropped patches from four images into one image [4]. Although this method combines 4 images, it has the advantage of resizing instead of random cropping and use without any wasted area. Summers et al. proposed 8 mixing-method types, including cutmix and Mosaic as an improvement of existing methods for mixing two images [5]. Zhong et al. proposed a data augmentation method that eliminates object information in an image to solve object occlusion problems [6]. This method assigns a random area of the input image and fills it with a specific value such as random noise, ImageNet mean value, 0, 255, etc. Yun et al. proposed a method combining MixUp and CutOut [7,8]. This method is used to synthesize a random area eliminated from one image and a patch extracted from another image.

In addition to existing geometric augmentation methods, data augmentation using a Generative Adversarial Network (GAN) has been studied [9]. Tanaka et al. proposed a data augmentation method using GAN to compensate for the class imbalance problem in a dataset [10]. The synthetic data made with GAN can replace real data using a decision tree. Remeae et al. proposed a GAN-based data augmentation method using copy and paste [11]. This method generates images for weakly supervised segmentation tasks.

Dwibedi et al. proposed a synthetic data generation method using cut and paste [12]. This method is used to randomly place a foreground mask on a background image. Tripathi et al. proposed a method of synthesizing a foreground mask in a suitable location with a background image through a synthesizer network [13]. In this method, information loss occurs in the edge area of the foreground mask obtained through synthetizing, so the synthetic image can be unnatural.

Bang et al. proposed an object extracting method using cut and paste from an image obtained by an Unmanned Aerial Vehicle (UAV) and synthesizing it into a UAV reference image using illumination, blur, and scale [14]. This method has difficulties estimating the object to be extracted from the UAV image and is not suitable for natural synthetic data generation.

## 3. The Proposed Method

We propose a data augmentation method using synthetic data generation for unconscious person detection in a drone image. First, the unconscious person object is extracted by SOD from an input image. The foreground object is obtained by separating the salient object from the background using SOD. Shadow generation helps to create a natural foreground mask. Then, background data are generated by data augmentation using geometric transformations. The image result is derived by synthesizing the background image and refined foreground mask. Fig. 1 shows a flowchart of the proposed method. The proposed method can be formulated as follows:

##### (1)
$$$I_{h}=f_{h}\left(f_{s}\left(f_{u}\left(I_{x}\right)\right),f_{c}\left(I_{y}\right)\right)$$$

$I_{x}$ and $I_{y}$ are inputs containing an unconscious person and background image of the UAV environment, respectively. $M_{u}$ is the result of the SOD using $U^{2}$ net, and $M_{s}$ is the image with a shadow added to the estimated $M_{u}$. $I_{b}$ is an augmented background image, and $I_{h}$ is a synthetic image of $M_{s}$ and $I_{b}$.

### 3.1 Salient Object Detection

First, we use $U^{2}$Net, a representative deep learning-based salient object detection, to generate synthetic data to detect an unconscious person in drone images. $U^{2}$Net has a deep overlapped $U$Net structure that maintains high resolution without significantly increasing the amount of computational cost. $U^{2}$Net finds the most salient and attention-grabbing object in the image and segments the object. More contextual information can be captured at various scales in $U^{2}$Net because multiple sizes of receptive fields are mixed in the residual U-block (RSU). In addition, it can effectively increase the depth of the overall architecture without a significant increase of the computational cost due to the pooling operation used in the RSU block.

Additionally, unlike conventional SOD methods, $U^{2}$Net does not use a pre-trained backbone network. $U^{2}$Net is trained end to end according to the target data, so it leads to better performance with the most efficient data adaptively. In this study, our goal is to separate salient objects using $U^{2}$Net accurately. As shown in Fig. 2, we separate the salient area by using $U^{2}$Net in an image where an object exists, and the result of the foreground object image without background is obtained by eliminating the remaining space.

To generate an object bounding box for the synthetic step, size and position information are required. Therefore, we calculate the range of $x$ and $y$ coordinates of the object and crop the corresponding area. In data labeling, our object extraction method is much better than the existing methods. The traditional data labeling requires a great deal of cost, effort, and time. However, using the proposed synthetic data generation method makes the label information of an object easy. There is an expectation that the cost can be significantly reduced, and the accuracy of unconscious person detection can be further increased.

##### Fig. 2. Object extraction using $\boldsymbol{U}^{2}$Net.

To generate more elaborate and natural unconscious person data, we propose a method that combines a foreground mask and shadow of an object. As shown in Fig. 3, the foreground mask is used for the input image. Then, the foreground mask is transformed to a binary mask, and we use the degree of Gaussian blur [15] and transparency to generate the refined shadow of the object similar to a real shadow.

Following that, the shadow region is normally located outside of the object contour, and we conduct resizing and shifting on the original foreground mask to composite the refined shadow and original foreground mask adaptively. Finally, we paste these two data to generate a refined foreground mask. This mask can be applied to a synthetic step more naturally than the previous simple foreground mask, so it performs as essential high- quality synthetic data for detecting an unconscious person.

### 3.3 Various UAV Background Generation

Before placing refined unconscious person images on the UAV images, various background images are required. However, it is challenging to obtain various background images because of legal restrictions and environmental factors of a UAV. To solve this problem, an existing data augmentation method was used for UAV background image generation. First, the foreground object was cut out from the unconscious person dataset in the UAV environment provided by AIHub [18], and the surrounding background was used as a reference. As shown in Fig. 4, a background image was acquired through the UAV for each altitude. A data augmentation method that maintains the resolution and naturalness of the background image is expected to have synthetic data similar to real-world data.

Because we aim to generate synthetic data similar to real-world data, we utilized existing data augmentation methods restrictively. Data augmentation methods such as mirroring [16] and natural color-shifting [17] are actively used, which are helpful for the construction of synthetic data and maintaining naturalness.

### 3.4 Synthetic Image Generation

We conducted a data scaling process to synthesize refined unconscious person objects with the various background images according to each altitude adaptively. Based on the altitude of 10 meters weighted by 1.0, 15 meters is weighted by 0.8, 20 meters is weighted by 0.6, and 25 meters is weighted by 0.4, respectively. Then, for the diversity of objects, they are synthesized at random locations on the background image after rotation, and inversion processes are performed using data augmentation methods.

At this time, we need annotations corresponding to each synthetic image used for training object detection model, and the coordinates ($X_{\min }$, $Y_{\min }$), ($X_{\max }$, $Y_{\max }$) of the refined synthetic unconscious person objects can be calculated. Fig. 5 shows the process that synthesizes an unconscious person objects with a background image. Scale adjustment according to the altitude and the rotation of the data augmentation method is used, a synthetic foreground mask of various environments is generated, and the background image is synthesized. Fig. 6 shows the results of generated synthetic images. The overall process makes natural synthetic data by placing the refined foreground mask of unconscious person objects on the background image.

## 4. Experimental Result

The AIhub real-world dataset [18] and synthetic dataset were used for experiments to verify the proposed method. The real-world dataset consists of 1,032 images for training, 200 images for validation, and 200 images for testing. It contains unconscious person objects with class information and the bounding box coordinates of each object. The synthetic dataset was generated with 1,032 UAV background images and 100 unconscious person images. Therefore, we generate 12,018 synthetic images for object detection.

For the training step, 13,050 images with synthetic images, real-world images, and synthetic + real-world images were used. Only the real-world dataset was used for the validation step and test step. The experiment was conducted in an RTX 3090 (24G) environment, and object detection models YOLOv4 [19], YOLOv5 [20], and EfficientDet [21] were trained with a batch size of 16 and learning rate of 0.00001.

Table 1 shows the result of the comparative analysis of the mAP of each dataset. These three models trained by the synthetic dataset achieved similar performance on the real-world dataset. Interestingly, using the synthetic dataset and real-world dataset shows the best result. It can be seen that the synthetic dataset can be generated with much similarity to real-world images, so using both datasets shows the best mAP score. In other words, if the real-world dataset has a small quantity or is difficult to build, then the results using both the real-world dataset and synthetic dataset will achieve the best performance. Therefore, it is effective to use our proposed method as a data augmentation method in a situation of harsh data acquisition. Furthermore, our proposed method can leverage some factors like resizing the scale of the object according to altitude and natural shadow generation, so it has lower time cost and human resources than a normal data acquisition method.

##### Table 1. Comparison of object detection models.
 Model mAP@.5 mAP@.5:.95 Real-world Dataset EfficientDet-D1 [21] 0.581 0.158 YOLOv4-s [19] 0.838 0.340 YOLOv5-s [20] 0.806 0.302 Synthetic Dataset EfficientDet-D1 0.590 0.167 YOLOv4-s 0.853 0.333 YOLOv5-s 0.833 0.345 Synthetic Dataset + Real-world Dataset EfficientDet-D1 0.622 0.178 YOLOv4-s 0.873 0.433 YOLOv5-s 0.864 0.440

## 5. Conclusion

In this paper, we proposed a synthetic-based data augmentation method for training an object detection model. Firstly, we extract a foreground mask considered as an unconscious person using $\boldsymbol{U}^{2}$Net and generate an elaborate refined foreground mask with a shadow effect using binary masking, blurring, and opacity control. Then, we apply scaling to the refined foreground mask by the altitude of the background to generate a final unconscious person detection dataset. The synthetic dataset has better performance than a real-world dataset in the evaluation of object detection. This happened because we reduced the difference between synthetic data and a real-world image by making the synthetic data similar to a real-world image with various expression of an object, environment, and augmentation compared to the AIhub dataset. It is expected that the proposed method can be used to construct a dataset more effectively when utilize for object detection and action recognition.

### ACKNOWLEDGMENTS

This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant, which is funded by the Korean government (MSIT) (2021-0-01341, Artificial Intelligence Graduate School Program (Chung-Ang University)), and fina ncially supported by the Institute of Civil-Military Technology Cooperation Program funded by the Defense Acquisition Program Administration and Ministry of Trade, Industry and Energy of Korean government under grant No. UM20311RD3.

### REFERENCES

1
Xuebin Q., Zichen Z., Chenyang H., Martin J., Aug. 2020., U2-Net: Going deeper with nested U-structure for salient object detection, Pattern Recognition, Vol. 106
2
Taylor L., Geoff N., Nov. 2018., Improving deep learning with generic data augmentation, IEEE Symposium Series on Computational Intelligence, pp. 1542-1547
3
Zhang H., Moustapha C., Yann N D., David L., Oct. 2017. , mixup: Beyond empirical risk minimization, arXiv preprint arXiv:1710.09412
4
Takahashi R., Takashi M., Kuniaki U., 2019, Data augmentation using random image cropping and patching for deep CNNs, IEEE Transactions on Circuits and Systems for Video Technology 30.9, Vol. 30, No. 9, pp. 2917-2931
5
Summers C., Michael J., Dinneen , Mar. 2019, Improved mixed-example data augmentation, Winter Conference on Applications of Computer Vision, pp. 1262-1270
6
Zhong Z., Liang Z., Guoliang K., Yi Y., 2020, Random erasing data augmentation, Proceedings of the AAAI conference on artificial intelligence, Vol. 34, No. 7, pp. 13001-13008
7
DeVries T., Graham W., Nov. 2017, Improved regularization of convolutional neural networks with cutout, arXiv preprint arXiv:1708.04552
8
Sangdoo Y., Dongyoon H., Seong joon O., Junsuk C., 2019, cutmix: Regularization strategy to train strong classifiers with localizable features, Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023-6032
9
Creswell A., Tom W., Vincent D., Kai A., Jan. 2018, Generative adversarial networks: An overview, IEEE Signal Processing Magazine 35.1, Vol. 35, No. 1, pp. 53-65
10
Tanaka F., Henrique K., Claus Aranha , Apr. 2019, Data augmentation using GANs, arXiv preprint arXiv:1904.09135
11
Remez T., Huang J., Brown M., 2018, Learning to segment via cut-and-paste, in Proc. ECCV, pp. 37-52
12
Dwibedi D., Ishan M., Martial Hebert , 2017, cut, paste and learn: Surprisingly easy synthesis for instance detection, Proceedings of the IEEE international conference on computer vision, pp. 1301-1310
13
Tripathi S., Siddhartha C., Amit A., Ambrish T., 2019, Learning to generate synthetic data via compositing, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 461-470
14
Sengdeok B., Francis B., Somin P., Wontae K., Jul. 2020, Image augmentation to improve construction resource detection using generative adversarial networks, cut and paste, image transformation techniques, Automation in Construction 115, Vol. 115
15
Hummel , Robert A., Kimia B., Steven W., 1987, Deblurring gaussian blur, Computer Vision, Graphics, Image Processing 38.1, Vol. 38, No. 1, pp. 66-80
16
Mariani G., Florian S., Roxana I., Costas B., Jun. 2018, Bagan: Data augmentation with balancing gan, arXiv preprint arXiv:1803.09655
17
Tellez D., Geert L., B. Peter B, Dec. 2019, Wouter. Quantifying the effects of data augmentation and stain color normalization in convolutional neural networks for computational pathology, Medical image analysis 58, Vol. 58
18
Hwang S., Minsong K., Seung L., Sanghoon P., 2022, cut and Continuous paste towards Real-time Deep Fall Detection, arXiv preprint arXiv: 2202.10687, pp. 1775-1779
19
Bochkovskiy A., Chien-Yao Wang , Hong-Yuan M., Apr. 2020, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv: 2004.10934
20
Zhou F., Huailin Z., Zhen N., 2021, Safety helmet detection based on YOLOv5, International Conference on Power Electronics, Computer Applications, pp. 6-11
21
Tan M., Ruoming P., Quoc LE V., 2020, Efficientdet: Scalable and efficient object detection, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781-10790

## Author

##### Junghoon Sung

Junghoon Sung was born in Suwon, Korea, in 1994. He received a B.S. degree in Urban Engineering from Hyupsung University, South Korea, in 2019. Currently, he is pursuing an M.S. degree in image processing at Chung-Ang University.

##### Heegwang Kim

Heegwang Kim was born in Seoul, Korea, in 1992. He received a B.S. degree in electronic engineering from Soongsil University, Korea, in 2016. He received an M.S. degree in Image Science from Chung-Ang University, Korea, in 2018. Currently, he is pursuing a Ph.D. degree in image engineering at Chung-Ang University.

##### Mingi Kim

Mingi Kim was born in Okcheon, Korea, in 1996. He received a B.S. degree in data analysis from Hannam University, South Korea, in 2021. He is currently pursuing an M.S. degree with the Department of Artificial Intelligence, Chung-Ang University.

##### Yeongheon Mok

Yeongheon Mok was born in Seoul, Korea, in 1997. He received a B.S. degree from the Department of Digital Imaging Engineering from Chung-Ang University, Korea, in 2021. He is currently pursuing an M.S. degree with the Department of Artificial Intelligence, Chung-Ang University.

##### Chanyeong Park

Chanyeong Park was born in Seoul, South Korea, in 1997. He received a B.S. degree in computer science from Coventry University in 2021. Currently, he is pursuing an M.S. degree in image processing at Chung-Ang University.

##### Joonki Paik

Joonki Paik was born in Seoul, South Korea, in 1960. He received a B.S. degree in control and instrumentation engineering from Seoul National University in 1984 and M.Sc. and Ph.D. degrees in electrical engineering and computer science from Northwestern University in 1987 and 1990, respectively. From 1990 to 1993, he joined Samsung Electronics, where he designed image stabilization chipsets for consumer camcorders. Since 1993, he has been a member of the faculty of Chung-Ang University, Seoul, Korea, where he is currently a professor with the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 1999 to 2002, he was a visiting professor with the Department of Electrical and Computer Engineering, University of Tennessee, Knoxville. Since 2005, he has been the director of the National Research Laboratory in the field of image processing and intelligent systems. From 2005 to 2007, he served as the dean of the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 2005 to 2007, he was the director of the Seoul Future Contents Convergence Cluster established by the Seoul Research and Business Development Program. In 2008, he was a full-time technical consultant for the System LSI Division of Samsung Electronics, where he developed various computational photographic techniques, including an extended depth of field system. He has served as a member of the Presidential Advisory Board for Scientific/Technical Policy with the Korean Government and is currently serving as a technical consultant for the Korean Supreme Prosecutor's Office for computational forensics. He was a two-time recipient of the Chester-Sall Award from the IEEE Consumer Electronics Society, the Academic Award from the Institute of Electronic Engineers of Korea, and the Best Research Professor Award from Chung-Ang University. He has served the Consumer Electronics Society of the IEEE as a member of the editorial board, vice president of international affairs, and director of sister and related societies committee.