KimMingi1
KimHeegwang2
ParkChanyeong2
PaikJoonki1,2,*
-
(Graduate School of Artificial Intelligence, Chung-Ang University / Seoul 06974, Korea
mgkim@ipis.cau.ac.kr)
-
(Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film,
Chung-Ang University / Seoul 06974, Korea
{heegwang, chanyeong}@ipis.cau.ac.kr, paikj@cau.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Small object detection, Deep learning, Light-weight, Attention mechanism
1. Introduction
Object detection using deep learning has become popular in various applications, and
drone-based detection is particularly useful in fields such as the military, industry,
security, and transportation. In particular, drone-based traffic analysis is gaining
increasing attention in various applications such as traffic jam identification, illegal
parking detection, and intelligent traffic control systems. However, drone images
are distinct from those captured by CCTV or vehicle cameras as images taken from drones
at high altitudes and varying angles and mainly contain small objects with different
shapes of features. Since general object detection models require many parameters,
high power consumption, and large memory size, they are not suitable for low-power
embedded systems on drones.
Model scaling techniques [28,29] can be used to create an efficient model that can be used on a drone or low-power
embedded system. A general model scaling technique reduces the size of a model by
changing the depth, width, and input resolution of the backbone network [30,31]. Therefore, we propose a novel efficient lightweight deep neural network model for
small vehicle detection in drone images. The YOLOv4-s model is the lightest version
for real-time object detection and was used as a baseline model [1]. Since drone images mostly contain small objects, we removed the head layer that
detects large objects and performed efficient model scaling for small object detection.
To compensate for lost information, an attention stacked hourglass network (ASHN)
was added to the middle level of the backbone network where feature fusion is performed.
The ASHN was designed as a structure to supplement multi-scale and filter features
that were lost through the previously proposed method. The purpose of this study is
to design a model that can be used for specific tasks in hardware-limited environments
such as drones. Through the proposed methods, we achieved an efficient light-weight
deep neural network model that can detect small vehicles in drone images. The paper
is organized as follows: Section 2 presents a review of related work on small object
detection and model scaling techniques. Section 3 presents our proposed efficient
light-weight deep neural network model. Section 4 summarizes the experimental results,
and Section 5 concludes the paper.
2. Related Work
2.1 Object Detection in UAV Images
Unmanned Aerial Vehicles (UAVs) equipped with cameras can flexibly acquire ground
images without geographical restrictions. Therefore, UAV images have wide applications
in detecting humans, vehicles, and military targets for search and rescue operations.
For robust and accurate detection from UAV images in real-life environments, state-of-the-art
deep learning-based object detectors should be scaled down to reduce weight and save
memory. We briefly discuss existing detectors in terms of the number of stages.
2.1.1 Two-stage Detector
Regions with convolutional neural networks features (R-CNN) [5] is a two-stage detector that sequentially performs region proposal and classification.
To improve the performance of R-CNN, several enhanced methods have been proposed.
Faster R-CNN [2] calculates a region of interest (ROI) through a region proposal network (RPN) instead
of using selective search. The RPN improves the accuracy of learning with accelerated
ROI calculation using GPU.
Cascade R-CNN [3] uses multiple classifiers that each receive a bounding box step by step and perform
a new classification task. This is done under assumption that the bounding box produced
at each step will be more accurate. The classifier of the next stage has a higher
intersection over union (IoU) value than the previous stage. For balanced learning,
Libra R-CNN [4] uses imbalanced object detection at three levels: sample, feature, and objective
levels. Libra R-CNN solves the imbalance problem by integrating three new components:
IoU balanced sampling, balanced feature pyramid, and balanced ${\ell}$1 loss.
2.1.2 One-stage Detector
The single shot multibox detector (SSD) [6] and ``you once look once'' (YOLO) [39] are one-stage detectors that perform classification and regional proposal at the
same time. They improve the detection speed by replacing the last fully connected
(FC) layer of the network with a convolution layer. SSD estimates an object using
a default box with different scales and aspect ratios for each feature map cell.
M2Det [7] has a multi-level feature pyramid network (MLFPN) that consists of three modules
to find objects with different sizes and complexity of appearance. The feature fusion
module (FFM) creates an optimal feature by fusing shallow and deep features from the
backbone. A thinned U-shape module (TUM) is reconstructed using the second version
of the feature fusion module (FFMv2). M2Det uses multi-level multi-scale features
through scale-wise feature concatenation and channel-wise attention. This model has
an end-to-end form combining MLFPN and SSD.
YOLOv4 combines a CSPDarknet53-based backbone architecture with spatial pyramid pooling
(SPP) [8] and a path aggregation network (PAN) [9]. This enables fast learning and inference, as well as high performance with a single
GPU [1]. In addition, various data augmentation techniques were presented to improve the
detection performance without increasing inference time.
A fast, light-weight detector is needed for drones and low-power embedded systems.
However, widely used one-stage detector models have generally been tested by MSCOCO
[10], and it is difficult to apply them to UAV images since they are taken at high altitude
and various angles and have low accuracy due to complex background, small size of
objects, and object changes according to the angle. Therefore, small object detection
suitable for UAV images is challenging.
2.2 Small Object Detection
Detection of small objects is a challenging task in computer vision due to the limited
number of pixels in the object and the imbalanced amount of information between background
and objects. To detect small objects, it is common practice to make the CNN layers
deeper to obtain a higher-level feature map containing semantic information of the
object at the cost of losing a low-level spatial information. To solve this problem,
various studies combining shallow and deep features have been proposed [6,9, 11-14]. Combined methods can learn shallow-level features even in a deeper layer.
Another challenge is the limited amount of contextual information available for small
objects, which are as small as 32$\times $32 pixels [15]. Local context contains very important information such as the edge, color, and texture
of an object. To compensate for the lack of local pixel context, the filter size of
the network can be increased, or a deconvolution layer can be added for a higher-level
feature map of the image [16-18].
Recently, multi-scale feature maps have been widely used in small object detection.
However, the matching ratio between the feature map and the ground-truth small object
is still insufficient due to inappropriate anchor adjustments. This leads to lower
performance for small object detection compared to large objects. In order to solve
the imbalance of small objects, several methods have been proposed to generate positive
examples for small objects using multi-scale feature maps and anchor boxes [14, 19-21].
Only anchors with high IoU scores are designated as positive examples, and all others
are considered to be negative, which results in a severe imbalance between positive
and negative examples. To solve this problem, the weights of the network are adjusted
so that positive and negative examples have similar numbers after training a machine
learning model based on the data distribution [4,6, 22-25]. Another approach involves designing a new loss function to reset the weights between
positive and negative example data that are unbalanced for each training [26,27].
2.3 Model Scaling
Model scaling is a technique that determines the size of a model by changing the width,
depth, and resolution, which are factors that determine the size and amount of calculation
of a baseline model [28,29]. Model scale-up was applied using these techniques in Scaled-YOLOv4 [30] and EfficientDet [31]. Width scaling changes the number of filters (channels). It is commonly understood
that a wider network can extract finer information. In contrast, depth scaling changes
the number of layers. Finally, resolution scaling changes the resolution of the input
image.
For EfficientDet, various conditions for model scaling were tested. Increasing width
or depth makes convergence occur earlier, while increasing resolution makes the accuracy
higher. In other words, the change in resolution has a great effect on the performance
in model scaling. It has been proven that increasing all three elements at the same
time produces the best performance. Therefore, Scaled-YOLOv4 and EfficientDet use
model scaling to fix the base model and adjust the three elements to fit the model
size through a factor value. However, by scaling down the model, it is possible to
create a lightweight model that is smaller than the base model. Scaling down uses
three elements like in scaling up, but scaling down can minimize performance degradation
and is also a lightweight method.
3. The Proposed Method
3.1 Head Layer Removal Method for a Small Object.
The baseline YOLOv4-s has three head layers [1]. In a 640$\times $640 input image, small objects are detected in 80$\times $80 feature
maps, medium objects are detected in 40$\times $40 feature maps, and large objects
are detected in 20$\times $20 feature maps. However, since drone images are taken
at high altitudes, most objects in the images have small or medium size. Therefore,
we removed the 20$\times $20 feature map head layer and the connected neck layer for
large objects.
This reduces the size of weight and processing time for the NMS process. The reduced
model has six anchor boxes in two head layers, whereas the original model has nine
anchor boxes in three head layers. This method is commonly used in small object detection
tasks to achieve a lightweight model [1]. Therefore, in this study, we applied this popular method to effectively detect small
objects in drone images.
3.2 Model Scaling
Fig. 1 shows the architecture of the proposed scaled-down model. The resolution of an input
image is fixed to 640$\times $640 without scaling down, but the model is scaled-down
in both width and depth. While scaling-down the width and depth, we tested the performance
change according to each element scaling. The Visdrone2019-Det dataset [32] was used for training, which consists of 10 classes under various environments. In
this work, the focus is on detecting small vehicles, and we conducted an experiment
to find the reference points of depth scaling and width scaling using only car, bus,
and truck (vehicle) classes in Visdrone2019-Det. The proposed scaling method is different
from existing methods in that we changed the model structure by estimating the scaling
levels of depth and width that are effective for small object feature learning.
Fig. 1. Architecture of the proposed model.
3.2.1 Model Depth Scaling
Fig. 2 shows the depth scaling of the proposed network. In order to find the reference point
of depth scaling, we compared the Depth-SC-V1 model, in which the layer was removed
as a starting point in the 20$\times $20 feature map, and the Depth-SC-V2 model, the
starting point in the 40$\times $40 feature map, with the original model. Each model
adds 1 or 2 convolution layers to match the feature map size when performing feature
fusion between the neck and head layers. All of the tables that do not mention HLE
have the same three head layers.
Table 1 shows the result of the depth scaling. Compared with the original model, the depth
scaling model had higher performance with reduced parameters since most detection
networks have global and spatial information at a shallow level, while specific and
semantic information such as the size of the feature map is reduced at a deep level.
However, the feature information of small objects that can be learned is very small.
Therefore, if the layer becomes deeper, more down sampling is required in the feature
map, and small object feature information may be lost.
This can lead to confusion in the training process. In that case, efficient learning
is not possible, and unnecessary layers and filters remain with high computational
load. As can be seen from this experiment, we observed that if the deep-level layer
with reduced size of the feature map was removed, the features of small objects could
be efficiently learned, and the recall increased. We also set a standard for efficiently
learning the feature information of small objects. Therefore, the network proposed
selects the depth scaling point as Depth-SC-V2.
Fig. 2. Depth scaling. Maintaining the structure of the model: down scaling an existing model and changing the structure of the model: Depth-SC-V1, Depth-SC-V2 Model. The thickness of each feature map shows the number of layers of the feature map.
Table 1. Model Scaling Results.
Model
|
Parameters
|
Precision
|
Recall
|
mAP@.5
|
mAP@.5:.95
|
YOLOv4-s(ORI)
|
8.06M
|
0.426
|
0.603
|
0.558
|
0.367
|
Depth Scaling
|
Depth-SC-V1
|
6.28M
|
0.402
|
0.595
|
0.556
|
0.368
|
Depth-SC-V2
|
6.40M
|
0.414
|
0.62
|
0.57
|
0.380
|
Width Scaling
|
Width-SC-V1
|
6.48M
|
0.392
|
0.609
|
0.555
|
0.365
|
Width-SC-V2
|
2.73M
|
0.396
|
0.594
|
0.549
|
0.358
|
Compound Scaling
|
Depth-SC-V2+
Width-SC-V1
Depth-SC-V2+
Width-SC-V2
|
5.80M
2.42M
|
0.417
0.386
|
0.628
0.612
|
0.579
0.555
|
0.39
0.369
|
Compound Scaling + Head Layer Elimination
|
Depth-SC-V2+
Width-SC-V1+HLE
Depth-SC-V2+
Width-SC-V2+HLE
|
2.55M
1.59M
|
0.490
0.468
|
0.580
0.583
|
0.568
0.556
|
0.387
0.374
|
3.2.2 Model Width Scaling
For width scaling, we compared the Width-SC-V1 model with the maximum filter limit
of 256 and the Width-SC-V2 model with the maximum filter limit of 128 with the original
model. Table 1 shows the result of width scaling. In the case of width scaling, the number of model
filters has a large effect on the model size. The lighter the model, the more the
model filter affects the model size. It is clear that the more filters there are,
the more detailed information can be learned. However, it is important to understand
the minimum number of optimal filters when the light-weight is required.
As a result of the width scaling experiment, we observed that the more we scale down
the model, the lower the performance is. Compared with the original model, the Width-SC-V1
model becomes slightly lighter while preserving similar performance. However, although
the performance of Width-SC-V2 decreased slightly, it shows a significant weight reduction
effect. In the case of width scaling, the Width-SC-V1 model should be adopted in terms
of performance, but the Width-SC-V2 model can show good value in terms of weight reduction.
Therefore, we combined Width-SC-V1 and Width-SC-V2 with depth scaling and tested compound
scaling. Both Width-SC-V1 and Width-SC-V2 models were combined with depth scaling
to experiment with compound scaling.
3.2.3 Model Compound Scaling
Both Width-SC-V1 and Width-SC-V2 models were combined with the Depth-SC-V2 model for
compound scaling. The experimental result of compound scaling is shown in Table 1. Although the weight was reduced by scaling down the depth and width, the overall
performance was maintained or even improved because of an efficient model structure
for small object detection by eliminating unnecessary layers and filters. The result
of removing the head layer in the first method is combined scaling and head layer
elimination, as shown in Table 1. Significant light-weight was achieved by down-scaling and head layer elimination.
Also, an important point to consider in the two experimental results is that performance
was improved only through model scaling. This proves that the structure of the model
is important in a small object detection task.
3.2.4 Model Scaling Network
Based on the experimental results, the proposed network removes the deep convolution
layers at the bottom of the network from the starting point (Depth-SC-V2) where the
feature map becomes 40$\times $40 in the baseline network. In addition, the network
was scaled-down by combining the width scaling model (Width-SC-V1) that limits the
number of filters in the network layer to 256. However, in terms of weight reduction,
it can be a good option to use the model combination of Depth-SC-V1 and Widith-SC-V1.
The result of a compromise between light weight and performance is the former model.
Thus, we designed a network that maximizes object location information and global
features by minimizing the reduction of feature maps through depth scaling.
Also, it is designed to efficiently use the minimum number of filters through width
scaling. The proposed deep neural network applies a model scaling method to the depth
and width of the backbone network to construct an efficient and low-power-embedded
network suitable for the small object detection task. However, various feature information
was lost because of model scaling.
3.3 Attention Stacked Hourglass Network
ASHN was added to compensate for feature information loss and to effectively detect
small objects. ASHN is shown in Fig. 3. The existing hourglass network [33] is used for human pose estimation and can extract various feature information through
iterative dimensionality reduction and dimensionality increase on multi-scales. In
addition, by combining feature maps, it is possible to reestimate the features of
the overall image. A stacked hourglass network [33] is a structure in which multiple hourglass networks are stacked. The structure of
the hourglass network is used to compensate for the feature information loss of depth
and width scaling during model scaling. In order to focus on the weight of the small
objects features, an attention mechanism was added to the hourglass network to design
the attention hourglass network, as shown in Fig. 3.
Fig. 3. Attention Stacked Hourglass Network.
3.3.1 Feature Fusion using Attention Stacked Hourglass Network
In the backbone network, each shallow, medium, and deep level has different feature
information. M2Det [7] shows that traffic sign, car, and pedestrian objects have different feature characteristics
and are detected at different levels and scales. Since the object in the drone image
is almost small and contains little object feature information, it would be better
to focus on smaller feature information.
In general, we know that low-level features are useful for detecting small objects.
In order to detect small objects, not only a low-level feature, but also a high-level
feature that captures the context of the image is required. Therefore, if low-level
and high-level feature information is used, the detection performance will be better.
Therefore, feature fusion was conducted by using attention stacked hourglass network
to extract small object features from the network.
3.3.2 Attention Stacked Hourglass Network Structure
Two things were considered when designing the ASHN, as shown in Fig. 2. First, it was designed to supplement the feature information loss that can be learned
from various feature map sizes through depth scaling. Second, it was designed to supplement
filter information that can be used to learn various and fine features through width
scaling. ASHN was designed as follows. An hourglass network was stacked 2 times, and
features were extracted from 80$\times $80, 40$\times $40 and 20$\times $20 feature
map sizes through max pooling twice. Also, the filter was designed to increase by
128 whenever the feature map size decreases.
We aimed to extract features specified at various scales with shallow-level feature
information. It was designed to compensate for feature loss in various feature map
sizes through backbone depth scaling with repeated dimensionality reduction and increase
through hourglass network. By increasing the number of filters in the hourglass network,
the various information of missing filters from backbone width scaling could be supplemented
by feature fusion with richer information.
Additionally, in order to effectively learn small object features, ASHN was designed
to focus on the weights of small objects by adding an attention module of CBAM [34] structure in the residual block. An attention module is largely divided into a channel
attention module [35] and a spatial attention module [36]. A channel attention extracts only the critical part of the feature from each channel
and extracts one value for each channel. By taking the corresponding value as a sigmoid
and producing a product on the input feature map, the values that need attention in
each channel have a high weight.
Spatial attention creates a feature map with one channel through max and average pooling
of all channels. If the corresponding feature map is taken as a sigmoid and multiplied
by the input feature map, the values that need attention in the entire channel have
a high weight. Thus, channel attention pays attention to ``what,'' and spatial attention
pays attention to ``where.''
CBAM [34] was applied to our network as a module that sequentially processes channel attention
and spatial attention. Table 2 shows the result of adding ASHN to the two models adopted in Method 2. In Table 2, it can be confirmed that although the weight has been reduced compared to the original
model, there is an improvement in performance. Feature fusion using ASHN contributed
to performance improvement by concentrating weights on features of small objects.
In particular, it can be seen that the precision has significantly increased. Small
objects can be confused with the background and other objects because the object is
so small. However, the proposed network in this study tries to solve this problem,
and the precision was greatly improved.
Table 2. Compound + Head Layer Elimination + ASHN Experimental Results.
Model
|
Parameters
|
Precision
|
Recall
|
mAP@.5
|
mAP@.5:.95
|
YOLOv4-s(Ori)
|
8.06M
|
0.426
|
0.603
|
0.558
|
0.367
|
Depth-SC-V2+
Width-SC-V1+
HLE+ASHN
|
5.73M
|
0.534
|
0.60
|
0.591
|
0.399
|
Depth-SC-V2+
Width-SC-V2+
HLE+ASHN
|
4.77M
|
0.502
|
0.579
|
0.571
|
0.384
|
4. Experimental Result
An experiment was implemented in an RTX 3090 (24G) environment. YOLOv4-s (Ori) and
our network model were tested under the same environmental conditions with batch size
of 16, learning rate of 0.00261, and 640$\times $640 image size. Through k-means,
the original model was trained by setting 9 anchor boxes, and our network model was
trained by setting 6 anchor boxes.
Table 3 shows the comparison results of parameters, precision, recall, and mAP for each dataset.
Precision means misclassification in object detection. Also, recall is an important
indicator in small object detection. Recall means non-detection. mAP is an area value
for precision and recall measured at each IoU and is used as an evaluation metric
for detection. In this study, vehicles (cars, trucks, and buses) were evaluated for
precision and recall based on IoU 0.5. In Visdrone2019-Det, car and van classes were
processed as cars.
Our network has 5.73M parameters, which is about 1.4 times lighter than the original
YOLOv4-s (ORI) model. In addition, high mAP was recorded by supplementing small object
feature information in the image through feature fusion using ASHN. In particular,
we can see that precision has much difference compared to other performance indicators.
Precision is an indicator of false detection. Our network showed more robust results
against false detection than the original model. In addition, our networks are not
limited to small vehicles but are also sufficiently applicable to small objects of
various classes.
Table 4 shows the experimental results for 10 classes (pedestrian, person, car, van, bus,
truck, motor, bicycle, awning tricycle, and tricycle) used in the Visdrone-DET2019
challenge. The results obtained through this experiment are as follows. When designing
a model for a small object detection task, the deep depth of the model may confuse
the learning of features of small objects. Therefore, it can be seen that the scale
of depth plays an important role in small object detection task. In addition, it can
be seen that feature fusion using ASHN contributes to the improvement of model performance
by extracting small object features of various scales and various filters.
The accuracy and model size of the general object detection model have a proportional
relationship. However, it is impossible to build a model with good performance in
all tasks. The contribution in this study is to suggest a method to efficiently construct
a model that can detect small objects and use it in a low-power embedded environment.
A more efficient model can be designed by constructing a model differently for each
domain, application, and class.
Fig. 4 shows a comparison result image. Fig. 4 shows the experimental result of the Visdrone-DET2019 test. A car is expressed in
red, a truck is green, and bus is blue. It performed better than the original model
at various angles, altitudes, and illuminance. As shown in Table 3, our network is robust against misclassification. In the low-light environment, the
original model had misclassification and non-detection for classification, but our
network did detection correctly. Also, unlike the original model, which detects the
background as a class at high altitude, our network detects small objects well.
Fig. 4. Experiment results. The following is the experimental result of Visdrone-DET2019 test. Each image consists of 3 pairs. The top is ground truth (GT), the middle is the original model, and the bottom is our network.
Table 3. Experimental Results with Various Datasets.
Model
|
Parameters
|
Precision
|
Recall
|
mAP@.5
|
mAP@.5:.95
|
Visdrone-DET2019 [32]
|
YOLOv4-s(Ori)
Our Network
|
8.06M
5.73M
|
0.426
0.534
|
0.603
0.60
|
0.558
0.591
|
0.367
0.399
|
UAVDT [37]
|
YOLOv4-s(Ori)
Our Network
|
8.06M
5.73M
|
0.655
0.785
|
0.991
0.991
|
0.992
0.992
|
0.747
0.774
|
CARPK [38]
|
YOLOv4-s(Ori)
Our Network
|
8.06M
5.73M
|
0.505
0.682
|
0.996
0.997
|
0.996
0.997
|
0.819
0.844
|
Table 4. Visdrone2019-Det 10 Class Experimental Results.
Model
|
Parameters
|
Precision
|
Recall
|
mAP@.5
|
mAP@.5:.95
|
Ori
|
8.08M
|
0.296
|
0.426
|
0.364
|
0.213
|
Our
|
5.74M
|
0.372
|
0.429
|
0.391
|
0.233
|
5. Conclusion
In this paper, we proposed an efficient lightweight and deep neural network model
for small vehicle detection in a drone environment. Considering the drone environment
with a high proportion of small objects, lightweight was performed by applying head
layer elimination to detect large size objects and efficient model scaling. In addition,
we supplemented the lost information with feature fusion using ASHN and focused on
feature information on small objects. As a result, the model parameters were reduced
by 1.4 times compared to the original model. Also, mAP showed higher performance.
The model could be applied in various ways, such as identification of a traffic jam,
illegal parking detection, and an intelligent traffic system. In addition, it could
be applied to various systems such as CCTV and portable cameras, which are low-power
embedded environments, as well as drones. Lastly, our network is not limited to vehicles
and can be used for small object detection tasks of various classes.
ACKNOWLEDGMENTS
This work was supported by the Institute of Information & Communications Technology
Planning & Evaluation (IITP) grant, which is funded by the Korean government (MSIT)
(2021-0-01341, Artificial Intelligence Graduate School Program (Chung-Ang University)),
and financially supported by the Institute of Civil-Military Technology Cooperation
Program funded by the Defense Acquisition Program Administration and Ministry of Trade,
Industry and Energy of Korean government under grant No. UM20311RD3.
REFERENCES
A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ``Yolov4: Optimal speed and accuracy
of object detection,'' arXiv preprint arXiv:2004.10934, 2020.
S. Ren, K. He, R. Girshick, and J. Sun, ``Faster r-cnn: Towards real- time object
detection with region proposal networks,'' Advances in neural information processing
systems, vol. 28, pp. 91-99, 2015.
Z. Cai and N. Vasconcelos, ``Cascade r-cnn: Delving into high quality object detection,''
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 6154-6162.
J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ``Libra r-cnn: Towards balanced
learning for object detection,'' in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2019, pp. 821-830.
Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and
semantic segmentation." Proceedings of the IEEE conference on computer vision and
pattern recognition. 2014.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ``Ssd:
Single shot multibox detector,'' in European conference on computer vision. Springer,
2016, pp. 21-37.
Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, ``M2det: A single-shot
object detector based on multi-level feature pyramid network,'' in Proceedings of
the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9259-9266.
P. Purkait, C. Zhao, and C. Zach, ``Spp-net: Deep absolute pose regression with synthetic
views,'' arXiv preprint arXiv:1712.03452, 2017.
S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ``Path aggregation network for instance
segmentation,'' in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 8759-8768.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.
L. Zitnick, ``Microsoft coco: Common objects in context,'' in European conference
on computer vision. Springer, 2014, pp. 740-755.
Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, ``A unified multi-scale deep convolutional
neural network for fast object detection,'' in European conference on computer vision.
Springer, 2016, pp. 354-370.
C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, ``Dssd: Deconvolu- tional single
shot detector,'' arXiv preprint arXiv:1701.06659, 2017.
T. Kong, A. Yao, Y. Chen, and F. Sun, ``Hypernet: Towards accurate region proposal
generation and joint object detection,'' in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2016, pp. 845- 853.
T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, ``Feature
pyramid networks for object detection,'' in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2017, pp. 2117-2125.
Y. Liu, P. Sun, N. Wergeles, and Y. Shang, ``A survey and performance evaluation of
deep learning methods for small object detection,'' Expert Systems with Applications,
p. 114602, 2021.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, ``Rich feature hierarchies for
accurate object detection and semantic segmentation,'' in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2014, pp. 580-587.
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, ``Deeplab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected
crfs,'' IEEE transactions on pattern analysis and machine intelligence, vol. 40, no.
4, pp. 834-848, 2017.
F. Yu and V. Koltun, ``Multi-scale context aggregation by dilated convolutions,''
arXiv preprint arXiv:1511.07122, 2015.
Y. Li, Y. Chen, N. Wang, and Z. Zhang, ``Scale-aware trident networks for object detection,''
in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019,
pp. 6054-6063.
B. Singh and L. S. Davis, ``An analysis of scale invariance in object detection snip,''
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 3578-3587.
B. Singh, M. Najibi, and L. S. Davis, ``Sniper: Efficient multi-scale training,''
arXiv preprint arXiv:1805.09300, 2018.
M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, ``Augmentation for small
object detection,'' arXiv preprint arXiv:1902.07296, 2019.
B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and Q. V. Le, ``Learning data
augmentation strategies for object detection,'' in European Conference on Computer
Vision. Springer, 2020, pp. 566-583.
A. Shrivastava, A. Gupta, and R. Girshick, ``Training region-based object detectors
with online hard example mining,'' in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 761- 769.
Y. Cao, K. Chen, C. C. Loy, and D. Lin, ``Prime sample attention in object detection,''
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2020, pp. 11 583-11 591.
K. Chen, J. Li, W. Lin, J. See, J. Wang, L. Duan, Z. Chen, C. He, and J. Zou, ``Towards
accurate one-stage object detection with ap-loss,'' in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp. 5119-5127.
Q. Qian, L. Chen, H. Li, and R. Jin, ``Dr loss: Improving object detection by distributional
ranking,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2020, pp. 12 164-12 172.
P. Dollár, M. Singh, and R. Girshick, ``Fast and accurate model scaling,'' in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 924-932.
M. Tan and Q. Le, ``Efficientnet: Rethinking model scaling for convolutional neural
networks,'' in International Conference on Machine Learning. PMLR, 2019, pp. 6105-6114.
C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, ``Scaled-yolov4: Scaling cross stage
partial network,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2021, pp. 13 029-13 038.
M. Tan, R. Pang, and Q. V. Le, ``Efficientdet: Scalable and efficient object detection,''
in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2020, pp. 10 781-10 790.
D. Du, P. Zhu, L. Wen, X. Bian, H. Lin, Q. Hu, T. Peng, J. Zheng, X. Wang, Y. Zhang
et al. ,``Visdrone-det2019: The vision meets drone ob- ject detection in image challenge
results,'' in Proceedings of the IEEE/CVF International Conference on Computer Vision
Workshops, 2019, pp. 0-0.
A. Newell, K. Yang, and J. Deng, ``Stacked hourglass networks for human pose estimation,''
in European conference on computer vision. Springer, 2016, pp. 483-499.
S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ``Cbam: Convolutional block attention
module,'' in Proceedings of the European conference on computer vision (ECCV), 2018,
pp. 3-19.
J. Hu, L. Shen, and G. Sun, ``Squeeze-and-excitation networks,'' in Pro- ceedings
of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, ``Bam: Bottleneck attention module,''
arXiv preprint arXiv:1807.06514, 2018.
D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, ``The
unmanned aerial vehicle benchmark: Object detection and tracking,'' in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp. 370-386.
M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, ``Drone-based object counting by spatially
regularized regional proposal network,'' in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 4145-4153.
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection."
Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Author
Mingi Kim was born in Okcheon, Korea, in 1996. He received a B.S. degree in data
analysis from Hannam University, South Korea, in 2021. He is currently pursuing an
M.S. degree with the Department of Artificial Intelligence, Chung-Ang University.
Heegwang Kim was born in Seoul, Korea, in 1992. He received a B.S. degree in electronic
engineering from Soongsil University, Korea, in 2016. He received an M.S. degree in
image science from Chung-Ang University, Korea, in 2018. Currently, he is pursuing
a Ph.D. degree in image engineering at Chung-Ang University.
Chanyeong Park was born in Seoul, South Korea, in 1997. He received a B.S. degree
in computer science from Coventry University in 2021. Currently, he is pursuing an
M.S. degree in image processing at ChungAng University. His research interests include
object detection and monocular 3D object detection.
Joonki Paik was born in Seoul, South Korea, in 1960. He received a B.S. degree in
control and instrumentation engineering from Seoul National University in 1984 and
M.Sc. and Ph.D. degrees in electrical engineering and computer science from Northwestern
University in 1987 and 1990, respectively. From 1990 to 1993, he worked at Samsung
Electronics, where he designed image stabilization chipsets for consumer camcorders.
Since 1993, he has been a member of the faculty of Chung-Ang University, Seoul, Korea,
where he is currently a professor with the Graduate School of Advanced Imaging Science,
Multimedia, and Film. From 1999 to 2002, he was a visiting professor with the Department
of Electrical and Computer Engineering, University of Tennessee, Knoxville. Since
2005, he has been the director of the National Research Laboratory in the field of
image processing and intelligent systems. From 2005 to 2007, he served as the dean
of the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 2005
to 2007, he was the director of the Seoul Future Contents Convergence Cluster established
by the Seoul Research and Business Development Program. In 2008, he was a full-time
technical consultant for the System LSI Division of Samsung Electronics, where he
developed various computational photographic techniques, including an extended depth
of field system. He has served as a member of the Presidential Advisory Board for
Scientific/Technical Policy with the Korean Government and is currently serving as
a technical consultant for the Korean Supreme Prosecutor's Office for computational
forensics. He is a two-time recipient of the Chester-Sall Award from the IEEE Consumer
Electronics Society, the Academic Award from the Institute of Electronic Engineers
of Korea, and the Best Research Professor Award from Chung-Ang University. He has
served the Consumer Electronics Society of the IEEE as a member of the editorial board,
vice president of international affairs, and director of sister and related societies
committee.