SungJunghoon1
KimHeegwang1
KimMingi2
MokYeongheon2
ParkChanyeong1
PaikJoonki12
-
(Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film,
Chung-Ang University / Seoul 06974, Korea {jhun, heegwang, chanyeong}@ipis.cau.ac.kr,
paikj@cau.ac.kr)
-
( Graduate School of Artificial Intelligence, Chung-Ang University / Seoul 06974, Korea
{mgkim, yhmok}@ipis.cau.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Data augmentation, Synthetic data generation, SOD, UAV
1. Introduction
One of the most researched computer vision tasks is object detection, which has been
used in many applications, such as disaster relief, intelligent monitoring systems,
and the defense industry. In the field of disaster relief, object detection in drone
images is used to detect an unconscious person located in a hard-to-approach area.
Therefore, to achieve high accuracy in unconscious person detection, it is necessary
to acquire numerous high-quality training datasets, but that requires time and human
resources. Moreover, it is difficult to obtain unconscious person data in a variety
of environments.
To solve this problem, we propose a novel data augmentation method using synthetic
data generation. In synthetic data generation for unconscious person data in drone
images, we first extract the object to obtain a foreground mask with object information
from a reference image. In this process, it is necessary to minimize the loss of edge
information of the object so that we extract only the object. For this reason, we
use one of the salient object detection methods, $\boldsymbol{U}^{2}$Net [1], which is the most suitable for single object detection. We detect the most salient
object from the input image and generate a foreground mask that contains only the
object information without background.
When synthesizing the foreground mask and the target background image, we have to
consider an incompatibility between these two images. Therefore, it is necessary to
generate shadows on the foreground mask to make more natural synthetic data. Finally,
the refined foreground mask and the drone background image are synthesized.
2. Related Work
Data augmentation is widely used in object detection and image classification tasks
to generate additional training datasets. Taylor et al. proposed geometric augmentation
methods such as flipping, rotating, and cropping [2]. These methods artificially transform training data by preserving labels to reduce
the cost of collecting training data and are widely used to improve the performance
of CNNs. Zhang et al. proposed MixUp, which solves the memorization problem of deep
learning networks or sensitive issues in adversarial examples [3]. This method improves the classification performance by using weighted linear interpolation
of two example images and labels.
Takahashi et al. proposed a random image cropping and patching (RICAP) method that
combines randomly cropped patches from four images into one image [4]. Although this method combines 4 images, it has the advantage of resizing instead
of random cropping and use without any wasted area. Summers et al. proposed 8 mixing-method
types, including cutmix and Mosaic as an improvement of existing methods for mixing
two images [5]. Zhong et al. proposed a data augmentation method that eliminates object information
in an image to solve object occlusion problems [6]. This method assigns a random area of the input image and fills it with a specific
value such as random noise, ImageNet mean value, 0, 255, etc. Yun et al. proposed
a method combining MixUp and CutOut [7,8]. This method is used to synthesize a random area eliminated from one image and a
patch extracted from another image.
In addition to existing geometric augmentation methods, data augmentation using a
Generative Adversarial Network (GAN) has been studied [9]. Tanaka et al. proposed a data augmentation method using GAN to compensate for the
class imbalance problem in a dataset [10]. The synthetic data made with GAN can replace real data using a decision tree. Remeae
et al. proposed a GAN-based data augmentation method using copy and paste [11]. This method generates images for weakly supervised segmentation tasks.
Dwibedi et al. proposed a synthetic data generation method using cut and paste [12]. This method is used to randomly place a foreground mask on a background image. Tripathi
et al. proposed a method of synthesizing a foreground mask in a suitable location
with a background image through a synthesizer network [13]. In this method, information loss occurs in the edge area of the foreground mask
obtained through synthetizing, so the synthetic image can be unnatural.
Bang et al. proposed an object extracting method using cut and paste from an image
obtained by an Unmanned Aerial Vehicle (UAV) and synthesizing it into a UAV reference
image using illumination, blur, and scale [14]. This method has difficulties estimating the object to be extracted from the UAV
image and is not suitable for natural synthetic data generation.
3. The Proposed Method
We propose a data augmentation method using synthetic data generation for unconscious
person detection in a drone image. First, the unconscious person object is extracted
by SOD from an input image. The foreground object is obtained by separating the salient
object from the background using SOD. Shadow generation helps to create a natural
foreground mask. Then, background data are generated by data augmentation using geometric
transformations. The image result is derived by synthesizing the background image
and refined foreground mask. Fig. 1 shows a flowchart of the proposed method. The proposed method can be formulated as
follows:
$I_{x}$ and $I_{y}$ are inputs containing an unconscious person and background image
of the UAV environment, respectively. $M_{u}$ is the result of the SOD using $U^{2}$
net, and $M_{s}$ is the image with a shadow added to the estimated $M_{u}$. $I_{b}$
is an augmented background image, and $I_{h}$ is a synthetic image of $M_{s}$ and
$I_{b}$.
Fig. 1. Our proposed architecture. $I_{x}$: Reference person image, $M_{u~ }\colon $ Foreground mask, $M_{s}$ : Refined foreground mask, $I_{y~ }$: Background image, $I_{b}$: Augmented background image, $I_{h}$ : Result image.
3.1 Salient Object Detection
First, we use $U^{2}$Net, a representative deep learning-based salient object detection,
to generate synthetic data to detect an unconscious person in drone images. $U^{2}$Net
has a deep overlapped $U$Net structure that maintains high resolution without significantly
increasing the amount of computational cost. $U^{2}$Net finds the most salient and
attention-grabbing object in the image and segments the object. More contextual information
can be captured at various scales in $U^{2}$Net because multiple sizes of receptive
fields are mixed in the residual U-block (RSU). In addition, it can effectively increase
the depth of the overall architecture without a significant increase of the computational
cost due to the pooling operation used in the RSU block.
Additionally, unlike conventional SOD methods, $U^{2}$Net does not use a pre-trained
backbone network. $U^{2}$Net is trained end to end according to the target data, so
it leads to better performance with the most efficient data adaptively. In this study,
our goal is to separate salient objects using $U^{2}$Net accurately. As shown in Fig. 2, we separate the salient area by using $U^{2}$Net in an image where an object exists,
and the result of the foreground object image without background is obtained by eliminating
the remaining space.
To generate an object bounding box for the synthetic step, size and position information
are required. Therefore, we calculate the range of $x$ and $y$ coordinates of the
object and crop the corresponding area. In data labeling, our object extraction method
is much better than the existing methods. The traditional data labeling requires a
great deal of cost, effort, and time. However, using the proposed synthetic data generation
method makes the label information of an object easy. There is an expectation that
the cost can be significantly reduced, and the accuracy of unconscious person detection
can be further increased.
Fig. 2. Object extraction using $\boldsymbol{U}^{2}$Net.
3.2 Shadow Generation
To generate more elaborate and natural unconscious person data, we propose a method
that combines a foreground mask and shadow of an object. As shown in Fig. 3, the foreground mask is used for the input image. Then, the foreground mask is transformed
to a binary mask, and we use the degree of Gaussian blur [15] and transparency to generate the refined shadow of the object similar to a real shadow.
Following that, the shadow region is normally located outside of the object contour,
and we conduct resizing and shifting on the original foreground mask to composite
the refined shadow and original foreground mask adaptively. Finally, we paste these
two data to generate a refined foreground mask. This mask can be applied to a synthetic
step more naturally than the previous simple foreground mask, so it performs as essential
high- quality synthetic data for detecting an unconscious person.
Fig. 3. Examples of drop shadow images.
3.3 Various UAV Background Generation
Before placing refined unconscious person images on the UAV images, various background
images are required. However, it is challenging to obtain various background images
because of legal restrictions and environmental factors of a UAV. To solve this problem,
an existing data augmentation method was used for UAV background image generation.
First, the foreground object was cut out from the unconscious person dataset in the
UAV environment provided by AIHub [18], and the surrounding background was used as a reference. As shown in Fig. 4, a background image was acquired through the UAV for each altitude. A data augmentation
method that maintains the resolution and naturalness of the background image is expected
to have synthetic data similar to real-world data.
Because we aim to generate synthetic data similar to real-world data, we utilized
existing data augmentation methods restrictively. Data augmentation methods such as
mirroring [16] and natural color-shifting [17] are actively used, which are helpful for the construction of synthetic data and maintaining
naturalness.
Fig. 4. UAV background images using data augmentation method by altitude: (a) Input; (b) mirroring; (c) color shifting; (d) mirroring and color shifting.
3.4 Synthetic Image Generation
We conducted a data scaling process to synthesize refined unconscious person objects
with the various background images according to each altitude adaptively. Based on
the altitude of 10 meters weighted by 1.0, 15 meters is weighted by 0.8, 20 meters
is weighted by 0.6, and 25 meters is weighted by 0.4, respectively. Then, for the
diversity of objects, they are synthesized at random locations on the background image
after rotation, and inversion processes are performed using data augmentation methods.
At this time, we need annotations corresponding to each synthetic image used for training
object detection model, and the coordinates ($X_{\min }$, $Y_{\min }$), ($X_{\max
}$, $Y_{\max }$) of the refined synthetic unconscious person objects can be calculated.
Fig. 5 shows the process that synthesizes an unconscious person objects with a background
image. Scale adjustment according to the altitude and the rotation of the data augmentation
method is used, a synthetic foreground mask of various environments is generated,
and the background image is synthesized. Fig. 6 shows the results of generated synthetic images. The overall process makes natural
synthetic data by placing the refined foreground mask of unconscious person objects
on the background image.
Fig. 5. Synthesizing refined unconscious person and UAV background image.
Fig. 6. Examples of synthetic unconscious person dataset.
4. Experimental Result
The AIhub real-world dataset [18] and synthetic dataset were used for experiments to verify the proposed method. The
real-world dataset consists of 1,032 images for training, 200 images for validation,
and 200 images for testing. It contains unconscious person objects with class information
and the bounding box coordinates of each object. The synthetic dataset was generated
with 1,032 UAV background images and 100 unconscious person images. Therefore, we
generate 12,018 synthetic images for object detection.
For the training step, 13,050 images with synthetic images, real-world images, and
synthetic + real-world images were used. Only the real-world dataset was used for
the validation step and test step. The experiment was conducted in an RTX 3090 (24G)
environment, and object detection models YOLOv4 [19], YOLOv5 [20], and EfficientDet [21] were trained with a batch size of 16 and learning rate of 0.00001.
Table 1 shows the result of the comparative analysis of the mAP of each dataset. These three
models trained by the synthetic dataset achieved similar performance on the real-world
dataset. Interestingly, using the synthetic dataset and real-world dataset shows the
best result. It can be seen that the synthetic dataset can be generated with much
similarity to real-world images, so using both datasets shows the best mAP score.
In other words, if the real-world dataset has a small quantity or is difficult to
build, then the results using both the real-world dataset and synthetic dataset will
achieve the best performance. Therefore, it is effective to use our proposed method
as a data augmentation method in a situation of harsh data acquisition. Furthermore,
our proposed method can leverage some factors like resizing the scale of the object
according to altitude and natural shadow generation, so it has lower time cost and
human resources than a normal data acquisition method.
Table 1. Comparison of object detection models.
Model
|
mAP@.5
|
mAP@.5:.95
|
Real-world Dataset
|
EfficientDet-D1 [21]
|
0.581
|
0.158
|
YOLOv4-s [19]
|
0.838
|
0.340
|
YOLOv5-s [20]
|
0.806
|
0.302
|
Synthetic Dataset
|
EfficientDet-D1
|
0.590
|
0.167
|
YOLOv4-s
|
0.853
|
0.333
|
YOLOv5-s
|
0.833
|
0.345
|
Synthetic Dataset + Real-world Dataset
|
EfficientDet-D1
|
0.622
|
0.178
|
YOLOv4-s
|
0.873
|
0.433
|
YOLOv5-s
|
0.864
|
0.440
|
5. Conclusion
In this paper, we proposed a synthetic-based data augmentation method for training
an object detection model. Firstly, we extract a foreground mask considered as an
unconscious person using $\boldsymbol{U}^{2}$Net and generate an elaborate refined
foreground mask with a shadow effect using binary masking, blurring, and opacity control.
Then, we apply scaling to the refined foreground mask by the altitude of the background
to generate a final unconscious person detection dataset. The synthetic dataset has
better performance than a real-world dataset in the evaluation of object detection.
This happened because we reduced the difference between synthetic data and a real-world
image by making the synthetic data similar to a real-world image with various expression
of an object, environment, and augmentation compared to the AIhub dataset. It is expected
that the proposed method can be used to construct a dataset more effectively when
utilize for object detection and action recognition.
ACKNOWLEDGMENTS
This work was supported by the Institute of Information & communications Technology
Planning & Evaluation (IITP) grant, which is funded by the Korean government (MSIT)
(2021-0-01341, Artificial Intelligence Graduate School Program (Chung-Ang University)),
and fina ncially supported by the Institute of Civil-Military Technology Cooperation
Program funded by the Defense Acquisition Program Administration and Ministry of Trade,
Industry and Energy of Korean government under grant No. UM20311RD3.
REFERENCES
Xuebin Q., Zichen Z., Chenyang H., Martin J., Aug. 2020., U2-Net: Going deeper with
nested U-structure for salient object detection, Pattern Recognition, Vol. 106
Taylor L., Geoff N., Nov. 2018., Improving deep learning with generic data augmentation,
IEEE Symposium Series on Computational Intelligence, pp. 1542-1547
Zhang H., Moustapha C., Yann N D., David L., Oct. 2017. , mixup: Beyond empirical
risk minimization, arXiv preprint arXiv:1710.09412
Takahashi R., Takashi M., Kuniaki U., 2019, Data augmentation using random image cropping
and patching for deep CNNs, IEEE Transactions on Circuits and Systems for Video Technology
30.9, Vol. 30, No. 9, pp. 2917-2931
Summers C., Michael J., Dinneen , Mar. 2019, Improved mixed-example data augmentation,
Winter Conference on Applications of Computer Vision, pp. 1262-1270
Zhong Z., Liang Z., Guoliang K., Yi Y., 2020, Random erasing data augmentation, Proceedings
of the AAAI conference on artificial intelligence, Vol. 34, No. 7, pp. 13001-13008
DeVries T., Graham W., Nov. 2017, Improved regularization of convolutional neural
networks with cutout, arXiv preprint arXiv:1708.04552
Sangdoo Y., Dongyoon H., Seong joon O., Junsuk C., 2019, cutmix: Regularization strategy
to train strong classifiers with localizable features, Proceedings of the IEEE/CVF
international conference on computer vision, pp. 6023-6032
Creswell A., Tom W., Vincent D., Kai A., Jan. 2018, Generative adversarial networks:
An overview, IEEE Signal Processing Magazine 35.1, Vol. 35, No. 1, pp. 53-65
Tanaka F., Henrique K., Claus Aranha , Apr. 2019, Data augmentation using GANs, arXiv
preprint arXiv:1904.09135
Remez T., Huang J., Brown M., 2018, Learning to segment via cut-and-paste, in Proc.
ECCV, pp. 37-52
Dwibedi D., Ishan M., Martial Hebert , 2017, cut, paste and learn: Surprisingly easy
synthesis for instance detection, Proceedings of the IEEE international conference
on computer vision, pp. 1301-1310
Tripathi S., Siddhartha C., Amit A., Ambrish T., 2019, Learning to generate synthetic
data via compositing, Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 461-470
Sengdeok B., Francis B., Somin P., Wontae K., Jul. 2020, Image augmentation to improve
construction resource detection using generative adversarial networks, cut and paste,
image transformation techniques, Automation in Construction 115, Vol. 115
Hummel , Robert A., Kimia B., Steven W., 1987, Deblurring gaussian blur, Computer
Vision, Graphics, Image Processing 38.1, Vol. 38, No. 1, pp. 66-80
Mariani G., Florian S., Roxana I., Costas B., Jun. 2018, Bagan: Data augmentation
with balancing gan, arXiv preprint arXiv:1803.09655
Tellez D., Geert L., B. Peter B, Dec. 2019, Wouter. Quantifying the effects of data
augmentation and stain color normalization in convolutional neural networks for computational
pathology, Medical image analysis 58, Vol. 58
Hwang S., Minsong K., Seung L., Sanghoon P., 2022, cut and Continuous paste towards
Real-time Deep Fall Detection, arXiv preprint arXiv: 2202.10687, pp. 1775-1779
Bochkovskiy A., Chien-Yao Wang , Hong-Yuan M., Apr. 2020, Yolov4: Optimal speed and
accuracy of object detection, arXiv preprint arXiv: 2004.10934
Zhou F., Huailin Z., Zhen N., 2021, Safety helmet detection based on YOLOv5, International
Conference on Power Electronics, Computer Applications, pp. 6-11
Tan M., Ruoming P., Quoc LE V., 2020, Efficientdet: Scalable and efficient object
detection, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pp. 10781-10790
Author
Junghoon Sung was born in Suwon, Korea, in 1994. He received a B.S. degree in Urban
Engineering from Hyupsung University, South Korea, in 2019. Currently, he is pursuing
an M.S. degree in image processing at Chung-Ang University.
Heegwang Kim was born in Seoul, Korea, in 1992. He received a B.S. degree in electronic
engineering from Soongsil University, Korea, in 2016. He received an M.S. degree in
Image Science from Chung-Ang University, Korea, in 2018. Currently, he is pursuing
a Ph.D. degree in image engineering at Chung-Ang University.
Mingi Kim was born in Okcheon, Korea, in 1996. He received a B.S. degree in data
analysis from Hannam University, South Korea, in 2021. He is currently pursuing an
M.S. degree with the Department of Artificial Intelligence, Chung-Ang University.
Yeongheon Mok was born in Seoul, Korea, in 1997. He received a B.S. degree from
the Department of Digital Imaging Engineering from Chung-Ang University, Korea, in
2021. He is currently pursuing an M.S. degree with the Department of Artificial Intelligence,
Chung-Ang University.
Chanyeong Park was born in Seoul, South Korea, in 1997. He received a B.S. degree
in computer science from Coventry University in 2021. Currently, he is pursuing an
M.S. degree in image processing at Chung-Ang University.
Joonki Paik was born in Seoul, South Korea, in 1960. He received a B.S. degree
in control and instrumentation engineering from Seoul National University in 1984
and M.Sc. and Ph.D. degrees in electrical engineering and computer science from Northwestern
University in 1987 and 1990, respectively. From 1990 to 1993, he joined Samsung Electronics,
where he designed image stabilization chipsets for consumer camcorders. Since 1993,
he has been a member of the faculty of Chung-Ang University, Seoul, Korea, where he
is currently a professor with the Graduate School of Advanced Imaging Science, Multimedia,
and Film. From 1999 to 2002, he was a visiting professor with the Department of Electrical
and Computer Engineering, University of Tennessee, Knoxville. Since 2005, he has been
the director of the National Research Laboratory in the field of image processing
and intelligent systems. From 2005 to 2007, he served as the dean of the Graduate
School of Advanced Imaging Science, Multimedia, and Film. From 2005 to 2007, he was
the director of the Seoul Future Contents Convergence Cluster established by the Seoul
Research and Business Development Program. In 2008, he was a full-time technical consultant
for the System LSI Division of Samsung Electronics, where he developed various computational
photographic techniques, including an extended depth of field system. He has served
as a member of the Presidential Advisory Board for Scientific/Technical Policy with
the Korean Government and is currently serving as a technical consultant for the Korean
Supreme Prosecutor's Office for computational forensics. He was a two-time recipient
of the Chester-Sall Award from the IEEE Consumer Electronics Society, the Academic
Award from the Institute of Electronic Engineers of Korea, and the Best Research Professor
Award from Chung-Ang University. He has served the Consumer Electronics Society of
the IEEE as a member of the editorial board, vice president of international affairs,
and director of sister and related societies committee.