Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 05, p.369-378

ISSN (online) :

2287-5255

Received : 22 December 2022Revised : 9 April 2023Accepted : 15 April 202330 October 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.5.369

Regular Paper

Light-weight Deep Neural Network for Small Vehicle Detection using Model-scale YOLOv4

KimMingi¹ KimHeegwang² ParkChanyeong² PaikJoonki^1,^2,^*

(Graduate School of Artificial Intelligence, Chung-Ang University / Seoul 06974, Korea mgkim@ipis.cau.ac.kr)
(Department of Image, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University / Seoul 06974, Korea {heegwang, chanyeong}@ipis.cau.ac.kr, paikj@cau.ac.kr )

^*Joonki Paik, paikj@cau.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In this paper, we present a light-weight deep neural network based on an efficiently scaled YOLOv4 model for detecting small objects in drone images. Since drone-captured images mainly contain small objects, we modified the YOLOv4 model by eliminating the head layer responsible for detecting large objects. This modification significantly reduced the model's parameters and processing time for non-maximum suppression (NMS). Moreover, the appropriately scaled model for small object detection can be used on a drone. To achieve a light-weight network for small object detection with minimal performance degradation, we used the attention stacked hourglass network (ASHN) for feature fusion. In extensive experiments, the proposed network outperformed the baseline network in several datasets.

Keywords

Small object detection, Deep learning, Light-weight, Attention mechanism

1. Introduction

Object detection using deep learning has become popular in various applications, and drone-based detection is particularly useful in fields such as the military, industry, security, and transportation. In particular, drone-based traffic analysis is gaining increasing attention in various applications such as traffic jam identification, illegal parking detection, and intelligent traffic control systems. However, drone images are distinct from those captured by CCTV or vehicle cameras as images taken from drones at high altitudes and varying angles and mainly contain small objects with different shapes of features. Since general object detection models require many parameters, high power consumption, and large memory size, they are not suitable for low-power embedded systems on drones.

Model scaling techniques ^[28,^29] can be used to create an efficient model that can be used on a drone or low-power embedded system. A general model scaling technique reduces the size of a model by changing the depth, width, and input resolution of the backbone network ^[30,^31]. Therefore, we propose a novel efficient lightweight deep neural network model for small vehicle detection in drone images. The YOLOv4-s model is the lightest version for real-time object detection and was used as a baseline model ^[1]. Since drone images mostly contain small objects, we removed the head layer that detects large objects and performed efficient model scaling for small object detection.

To compensate for lost information, an attention stacked hourglass network (ASHN) was added to the middle level of the backbone network where feature fusion is performed. The ASHN was designed as a structure to supplement multi-scale and filter features that were lost through the previously proposed method. The purpose of this study is to design a model that can be used for specific tasks in hardware-limited environments such as drones. Through the proposed methods, we achieved an efficient light-weight deep neural network model that can detect small vehicles in drone images. The paper is organized as follows: Section 2 presents a review of related work on small object detection and model scaling techniques. Section 3 presents our proposed efficient light-weight deep neural network model. Section 4 summarizes the experimental results, and Section 5 concludes the paper.

2. Related Work

2.1 Object Detection in UAV Images

Unmanned Aerial Vehicles (UAVs) equipped with cameras can flexibly acquire ground images without geographical restrictions. Therefore, UAV images have wide applications in detecting humans, vehicles, and military targets for search and rescue operations. For robust and accurate detection from UAV images in real-life environments, state-of-the-art deep learning-based object detectors should be scaled down to reduce weight and save memory. We briefly discuss existing detectors in terms of the number of stages.

2.1.1 Two-stage Detector

Regions with convolutional neural networks features (R-CNN) ^[5] is a two-stage detector that sequentially performs region proposal and classification. To improve the performance of R-CNN, several enhanced methods have been proposed. Faster R-CNN ^[2] calculates a region of interest (ROI) through a region proposal network (RPN) instead of using selective search. The RPN improves the accuracy of learning with accelerated ROI calculation using GPU.

Cascade R-CNN ^[3] uses multiple classifiers that each receive a bounding box step by step and perform a new classification task. This is done under assumption that the bounding box produced at each step will be more accurate. The classifier of the next stage has a higher intersection over union (IoU) value than the previous stage. For balanced learning, Libra R-CNN ^[4] uses imbalanced object detection at three levels: sample, feature, and objective levels. Libra R-CNN solves the imbalance problem by integrating three new components: IoU balanced sampling, balanced feature pyramid, and balanced ${\ell}$1 loss.

2.1.2 One-stage Detector

The single shot multibox detector (SSD) ^[6] and ``you once look once'' (YOLO) ^[39] are one-stage detectors that perform classification and regional proposal at the same time. They improve the detection speed by replacing the last fully connected (FC) layer of the network with a convolution layer. SSD estimates an object using a default box with different scales and aspect ratios for each feature map cell.

M2Det ^[7] has a multi-level feature pyramid network (MLFPN) that consists of three modules to find objects with different sizes and complexity of appearance. The feature fusion module (FFM) creates an optimal feature by fusing shallow and deep features from the backbone. A thinned U-shape module (TUM) is reconstructed using the second version of the feature fusion module (FFMv2). M2Det uses multi-level multi-scale features through scale-wise feature concatenation and channel-wise attention. This model has an end-to-end form combining MLFPN and SSD.

YOLOv4 combines a CSPDarknet53-based backbone architecture with spatial pyramid pooling (SPP) ^[8] and a path aggregation network (PAN) ^[9]. This enables fast learning and inference, as well as high performance with a single GPU ^[1]. In addition, various data augmentation techniques were presented to improve the detection performance without increasing inference time.

A fast, light-weight detector is needed for drones and low-power embedded systems. However, widely used one-stage detector models have generally been tested by MSCOCO ^[10], and it is difficult to apply them to UAV images since they are taken at high altitude and various angles and have low accuracy due to complex background, small size of objects, and object changes according to the angle. Therefore, small object detection suitable for UAV images is challenging.

2.2 Small Object Detection

Detection of small objects is a challenging task in computer vision due to the limited number of pixels in the object and the imbalanced amount of information between background and objects. To detect small objects, it is common practice to make the CNN layers deeper to obtain a higher-level feature map containing semantic information of the object at the cost of losing a low-level spatial information. To solve this problem, various studies combining shallow and deep features have been proposed ^[6,^9, ^11-^14]. Combined methods can learn shallow-level features even in a deeper layer.

Another challenge is the limited amount of contextual information available for small objects, which are as small as 32$\times $32 pixels ^[15]. Local context contains very important information such as the edge, color, and texture of an object. To compensate for the lack of local pixel context, the filter size of the network can be increased, or a deconvolution layer can be added for a higher-level feature map of the image ^[16-^18].

Recently, multi-scale feature maps have been widely used in small object detection. However, the matching ratio between the feature map and the ground-truth small object is still insufficient due to inappropriate anchor adjustments. This leads to lower performance for small object detection compared to large objects. In order to solve the imbalance of small objects, several methods have been proposed to generate positive examples for small objects using multi-scale feature maps and anchor boxes ^[14, ^19-^21].

Only anchors with high IoU scores are designated as positive examples, and all others are considered to be negative, which results in a severe imbalance between positive and negative examples. To solve this problem, the weights of the network are adjusted so that positive and negative examples have similar numbers after training a machine learning model based on the data distribution ^[4,^6, ^22-^25]. Another approach involves designing a new loss function to reset the weights between positive and negative example data that are unbalanced for each training ^[26,^27].

2.3 Model Scaling

Model scaling is a technique that determines the size of a model by changing the width, depth, and resolution, which are factors that determine the size and amount of calculation of a baseline model ^[28,^29]. Model scale-up was applied using these techniques in Scaled-YOLOv4 ^[30] and EfficientDet ^[31]. Width scaling changes the number of filters (channels). It is commonly understood that a wider network can extract finer information. In contrast, depth scaling changes the number of layers. Finally, resolution scaling changes the resolution of the input image.

For EfficientDet, various conditions for model scaling were tested. Increasing width or depth makes convergence occur earlier, while increasing resolution makes the accuracy higher. In other words, the change in resolution has a great effect on the performance in model scaling. It has been proven that increasing all three elements at the same time produces the best performance. Therefore, Scaled-YOLOv4 and EfficientDet use model scaling to fix the base model and adjust the three elements to fit the model size through a factor value. However, by scaling down the model, it is possible to create a lightweight model that is smaller than the base model. Scaling down uses three elements like in scaling up, but scaling down can minimize performance degradation and is also a lightweight method.

3. The Proposed Method

3.1 Head Layer Removal Method for a Small Object.

The baseline YOLOv4-s has three head layers ^[1]. In a 640$\times $640 input image, small objects are detected in 80$\times $80 feature maps, medium objects are detected in 40$\times $40 feature maps, and large objects are detected in 20$\times $20 feature maps. However, since drone images are taken at high altitudes, most objects in the images have small or medium size. Therefore, we removed the 20$\times $20 feature map head layer and the connected neck layer for large objects.

This reduces the size of weight and processing time for the NMS process. The reduced model has six anchor boxes in two head layers, whereas the original model has nine anchor boxes in three head layers. This method is commonly used in small object detection tasks to achieve a lightweight model ^[1]. Therefore, in this study, we applied this popular method to effectively detect small objects in drone images.

3.2 Model Scaling

Fig. 1 shows the architecture of the proposed scaled-down model. The resolution of an input image is fixed to 640$\times $640 without scaling down, but the model is scaled-down in both width and depth. While scaling-down the width and depth, we tested the performance change according to each element scaling. The Visdrone2019-Det dataset ^[32] was used for training, which consists of 10 classes under various environments. In this work, the focus is on detecting small vehicles, and we conducted an experiment to find the reference points of depth scaling and width scaling using only car, bus, and truck (vehicle) classes in Visdrone2019-Det. The proposed scaling method is different from existing methods in that we changed the model structure by estimating the scaling levels of depth and width that are effective for small object feature learning.

Fig. 1. Architecture of the proposed model.

3.2.1 Model Depth Scaling

Fig. 2 shows the depth scaling of the proposed network. In order to find the reference point of depth scaling, we compared the Depth-SC-V1 model, in which the layer was removed as a starting point in the 20$\times $20 feature map, and the Depth-SC-V2 model, the starting point in the 40$\times $40 feature map, with the original model. Each model adds 1 or 2 convolution layers to match the feature map size when performing feature fusion between the neck and head layers. All of the tables that do not mention HLE have the same three head layers.

Table 1 shows the result of the depth scaling. Compared with the original model, the depth scaling model had higher performance with reduced parameters since most detection networks have global and spatial information at a shallow level, while specific and semantic information such as the size of the feature map is reduced at a deep level. However, the feature information of small objects that can be learned is very small. Therefore, if the layer becomes deeper, more down sampling is required in the feature map, and small object feature information may be lost.

This can lead to confusion in the training process. In that case, efficient learning is not possible, and unnecessary layers and filters remain with high computational load. As can be seen from this experiment, we observed that if the deep-level layer with reduced size of the feature map was removed, the features of small objects could be efficiently learned, and the recall increased. We also set a standard for efficiently learning the feature information of small objects. Therefore, the network proposed selects the depth scaling point as Depth-SC-V2.

Fig. 2. Depth scaling. Maintaining the structure of the model: down scaling an existing model and changing the structure of the model: Depth-SC-V1, Depth-SC-V2 Model. The thickness of each feature map shows the number of layers of the feature map.

Table 1. Model Scaling Results.

Model	Parameters	Precision	Recall	mAP@.5	mAP@.5:.95
YOLOv4-s(ORI)	8.06M	0.426	0.603	0.558	0.367
Depth Scaling
Depth-SC-V1	6.28M	0.402	0.595	0.556	0.368
Depth-SC-V2	6.40M	0.414	0.62	0.57	0.380
Width Scaling
Width-SC-V1	6.48M	0.392	0.609	0.555	0.365
Width-SC-V2	2.73M	0.396	0.594	0.549	0.358
Compound Scaling
Depth-SC-V2+ Width-SC-V1 Depth-SC-V2+ Width-SC-V2	5.80M 2.42M	0.417 0.386	0.628 0.612	0.579 0.555	0.39 0.369
Compound Scaling + Head Layer Elimination
Depth-SC-V2+ Width-SC-V1+HLE Depth-SC-V2+ Width-SC-V2+HLE	2.55M 1.59M	0.490 0.468	0.580 0.583	0.568 0.556	0.387 0.374

3.2.2 Model Width Scaling

For width scaling, we compared the Width-SC-V1 model with the maximum filter limit of 256 and the Width-SC-V2 model with the maximum filter limit of 128 with the original model. Table 1 shows the result of width scaling. In the case of width scaling, the number of model filters has a large effect on the model size. The lighter the model, the more the model filter affects the model size. It is clear that the more filters there are, the more detailed information can be learned. However, it is important to understand the minimum number of optimal filters when the light-weight is required.

As a result of the width scaling experiment, we observed that the more we scale down the model, the lower the performance is. Compared with the original model, the Width-SC-V1 model becomes slightly lighter while preserving similar performance. However, although the performance of Width-SC-V2 decreased slightly, it shows a significant weight reduction effect. In the case of width scaling, the Width-SC-V1 model should be adopted in terms of performance, but the Width-SC-V2 model can show good value in terms of weight reduction. Therefore, we combined Width-SC-V1 and Width-SC-V2 with depth scaling and tested compound scaling. Both Width-SC-V1 and Width-SC-V2 models were combined with depth scaling to experiment with compound scaling.

3.2.3 Model Compound Scaling

Both Width-SC-V1 and Width-SC-V2 models were combined with the Depth-SC-V2 model for compound scaling. The experimental result of compound scaling is shown in Table 1. Although the weight was reduced by scaling down the depth and width, the overall performance was maintained or even improved because of an efficient model structure for small object detection by eliminating unnecessary layers and filters. The result of removing the head layer in the first method is combined scaling and head layer elimination, as shown in Table 1. Significant light-weight was achieved by down-scaling and head layer elimination. Also, an important point to consider in the two experimental results is that performance was improved only through model scaling. This proves that the structure of the model is important in a small object detection task.

3.2.4 Model Scaling Network

Based on the experimental results, the proposed network removes the deep convolution layers at the bottom of the network from the starting point (Depth-SC-V2) where the feature map becomes 40$\times $40 in the baseline network. In addition, the network was scaled-down by combining the width scaling model (Width-SC-V1) that limits the number of filters in the network layer to 256. However, in terms of weight reduction, it can be a good option to use the model combination of Depth-SC-V1 and Widith-SC-V1. The result of a compromise between light weight and performance is the former model. Thus, we designed a network that maximizes object location information and global features by minimizing the reduction of feature maps through depth scaling.

Also, it is designed to efficiently use the minimum number of filters through width scaling. The proposed deep neural network applies a model scaling method to the depth and width of the backbone network to construct an efficient and low-power-embedded network suitable for the small object detection task. However, various feature information was lost because of model scaling.

3.3 Attention Stacked Hourglass Network

ASHN was added to compensate for feature information loss and to effectively detect small objects. ASHN is shown in Fig. 3. The existing hourglass network ^[33] is used for human pose estimation and can extract various feature information through iterative dimensionality reduction and dimensionality increase on multi-scales. In addition, by combining feature maps, it is possible to reestimate the features of the overall image. A stacked hourglass network ^[33] is a structure in which multiple hourglass networks are stacked. The structure of the hourglass network is used to compensate for the feature information loss of depth and width scaling during model scaling. In order to focus on the weight of the small objects features, an attention mechanism was added to the hourglass network to design the attention hourglass network, as shown in Fig. 3.

Fig. 3. Attention Stacked Hourglass Network.

3.3.1 Feature Fusion using Attention Stacked Hourglass Network

In the backbone network, each shallow, medium, and deep level has different feature information. M2Det ^[7] shows that traffic sign, car, and pedestrian objects have different feature characteristics and are detected at different levels and scales. Since the object in the drone image is almost small and contains little object feature information, it would be better to focus on smaller feature information.

In general, we know that low-level features are useful for detecting small objects. In order to detect small objects, not only a low-level feature, but also a high-level feature that captures the context of the image is required. Therefore, if low-level and high-level feature information is used, the detection performance will be better. Therefore, feature fusion was conducted by using attention stacked hourglass network to extract small object features from the network.

3.3.2 Attention Stacked Hourglass Network Structure

Two things were considered when designing the ASHN, as shown in Fig. 2. First, it was designed to supplement the feature information loss that can be learned from various feature map sizes through depth scaling. Second, it was designed to supplement filter information that can be used to learn various and fine features through width scaling. ASHN was designed as follows. An hourglass network was stacked 2 times, and features were extracted from 80$\times $80, 40$\times $40 and 20$\times $20 feature map sizes through max pooling twice. Also, the filter was designed to increase by 128 whenever the feature map size decreases.

We aimed to extract features specified at various scales with shallow-level feature information. It was designed to compensate for feature loss in various feature map sizes through backbone depth scaling with repeated dimensionality reduction and increase through hourglass network. By increasing the number of filters in the hourglass network, the various information of missing filters from backbone width scaling could be supplemented by feature fusion with richer information.

Additionally, in order to effectively learn small object features, ASHN was designed to focus on the weights of small objects by adding an attention module of CBAM ^[34] structure in the residual block. An attention module is largely divided into a channel attention module ^[35] and a spatial attention module ^[36]. A channel attention extracts only the critical part of the feature from each channel and extracts one value for each channel. By taking the corresponding value as a sigmoid and producing a product on the input feature map, the values that need attention in each channel have a high weight.

Spatial attention creates a feature map with one channel through max and average pooling of all channels. If the corresponding feature map is taken as a sigmoid and multiplied by the input feature map, the values that need attention in the entire channel have a high weight. Thus, channel attention pays attention to ``what,'' and spatial attention pays attention to ``where.''

CBAM ^[34] was applied to our network as a module that sequentially processes channel attention and spatial attention. Table 2 shows the result of adding ASHN to the two models adopted in Method 2. In Table 2, it can be confirmed that although the weight has been reduced compared to the original model, there is an improvement in performance. Feature fusion using ASHN contributed to performance improvement by concentrating weights on features of small objects. In particular, it can be seen that the precision has significantly increased. Small objects can be confused with the background and other objects because the object is so small. However, the proposed network in this study tries to solve this problem, and the precision was greatly improved.

Table 2. Compound + Head Layer Elimination + ASHN Experimental Results.

Model	Parameters	Precision	Recall	mAP@.5	mAP@.5:.95
YOLOv4-s(Ori)	8.06M	0.426	0.603	0.558	0.367
Depth-SC-V2+ Width-SC-V1+ HLE+ASHN	5.73M	0.534	0.60	0.591	0.399
Depth-SC-V2+ Width-SC-V2+ HLE+ASHN	4.77M	0.502	0.579	0.571	0.384

4. Experimental Result

An experiment was implemented in an RTX 3090 (24G) environment. YOLOv4-s (Ori) and our network model were tested under the same environmental conditions with batch size of 16, learning rate of 0.00261, and 640$\times $640 image size. Through k-means, the original model was trained by setting 9 anchor boxes, and our network model was trained by setting 6 anchor boxes.

Table 3 shows the comparison results of parameters, precision, recall, and mAP for each dataset. Precision means misclassification in object detection. Also, recall is an important indicator in small object detection. Recall means non-detection. mAP is an area value for precision and recall measured at each IoU and is used as an evaluation metric for detection. In this study, vehicles (cars, trucks, and buses) were evaluated for precision and recall based on IoU 0.5. In Visdrone2019-Det, car and van classes were processed as cars.

Our network has 5.73M parameters, which is about 1.4 times lighter than the original YOLOv4-s (ORI) model. In addition, high mAP was recorded by supplementing small object feature information in the image through feature fusion using ASHN. In particular, we can see that precision has much difference compared to other performance indicators. Precision is an indicator of false detection. Our network showed more robust results against false detection than the original model. In addition, our networks are not limited to small vehicles but are also sufficiently applicable to small objects of various classes.

Table 4 shows the experimental results for 10 classes (pedestrian, person, car, van, bus, truck, motor, bicycle, awning tricycle, and tricycle) used in the Visdrone-DET2019 challenge. The results obtained through this experiment are as follows. When designing a model for a small object detection task, the deep depth of the model may confuse the learning of features of small objects. Therefore, it can be seen that the scale of depth plays an important role in small object detection task. In addition, it can be seen that feature fusion using ASHN contributes to the improvement of model performance by extracting small object features of various scales and various filters.

The accuracy and model size of the general object detection model have a proportional relationship. However, it is impossible to build a model with good performance in all tasks. The contribution in this study is to suggest a method to efficiently construct a model that can detect small objects and use it in a low-power embedded environment. A more efficient model can be designed by constructing a model differently for each domain, application, and class.

Fig. 4 shows a comparison result image. Fig. 4 shows the experimental result of the Visdrone-DET2019 test. A car is expressed in red, a truck is green, and bus is blue. It performed better than the original model at various angles, altitudes, and illuminance. As shown in Table 3, our network is robust against misclassification. In the low-light environment, the original model had misclassification and non-detection for classification, but our network did detection correctly. Also, unlike the original model, which detects the background as a class at high altitude, our network detects small objects well.

Fig. 4. Experiment results. The following is the experimental result of Visdrone-DET2019 test. Each image consists of 3 pairs. The top is ground truth (GT), the middle is the original model, and the bottom is our network.

Table 3. Experimental Results with Various Datasets.

Model

Parameters

Precision

Recall

mAP@.5

mAP@.5:.95

Visdrone-DET2019 [32]

YOLOv4-s(Ori)

Our Network

8.06M

5.73M

0.426

0.534

0.603

0.60

0.558

0.591

0.367

0.399

UAVDT [37]

YOLOv4-s(Ori)

Our Network

8.06M

5.73M

0.655

0.785

0.991

0.992

0.747

0.774

CARPK [38]

YOLOv4-s(Ori)

Our Network

8.06M

5.73M

0.505

0.682

0.996

0.997

0.996

0.997

0.819

0.844

Table 4. Visdrone2019-Det 10 Class Experimental Results.

Model	Parameters	Precision	Recall	mAP@.5	mAP@.5:.95
Ori	8.08M	0.296	0.426	0.364	0.213
Our	5.74M	0.372	0.429	0.391	0.233

5. Conclusion

In this paper, we proposed an efficient lightweight and deep neural network model for small vehicle detection in a drone environment. Considering the drone environment with a high proportion of small objects, lightweight was performed by applying head layer elimination to detect large size objects and efficient model scaling. In addition, we supplemented the lost information with feature fusion using ASHN and focused on feature information on small objects. As a result, the model parameters were reduced by 1.4 times compared to the original model. Also, mAP showed higher performance.

The model could be applied in various ways, such as identification of a traffic jam, illegal parking detection, and an intelligent traffic system. In addition, it could be applied to various systems such as CCTV and portable cameras, which are low-power embedded environments, as well as drones. Lastly, our network is not limited to vehicles and can be used for small object detection tasks of various classes.

ACKNOWLEDGMENTS

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant, which is funded by the Korean government (MSIT) (2021-0-01341, Artificial Intelligence Graduate School Program (Chung-Ang University)), and financially supported by the Institute of Civil-Military Technology Cooperation Program funded by the Defense Acquisition Program Administration and Ministry of Trade, Industry and Energy of Korean government under grant No. UM20311RD3.

REFERENCES

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ``Yolov4: Optimal speed and accuracy of object detection,'' arXiv preprint arXiv:2004.10934, 2020.

S. Ren, K. He, R. Girshick, and J. Sun, ``Faster r-cnn: Towards real- time object detection with region proposal networks,'' Advances in neural information processing systems, vol. 28, pp. 91-99, 2015.

Z. Cai and N. Vasconcelos, ``Cascade r-cnn: Delving into high quality object detection,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154-6162.

J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, ``Libra r-cnn: Towards balanced learning for object detection,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 821-830.

Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, ``Ssd: Single shot multibox detector,'' in European conference on computer vision. Springer, 2016, pp. 21-37.

Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, ``M2det: A single-shot object detector based on multi-level feature pyramid network,'' in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 9259-9266.

P. Purkait, C. Zhao, and C. Zach, ``Spp-net: Deep absolute pose regression with synthetic views,'' arXiv preprint arXiv:1712.03452, 2017.

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, ``Path aggregation network for instance segmentation,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759-8768.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, ``Microsoft coco: Common objects in context,'' in European conference on computer vision. Springer, 2014, pp. 740-755.

Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, ``A unified multi-scale deep convolutional neural network for fast object detection,'' in European conference on computer vision. Springer, 2016, pp. 354-370.

C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, ``Dssd: Deconvolu- tional single shot detector,'' arXiv preprint arXiv:1701.06659, 2017.

T. Kong, A. Yao, Y. Chen, and F. Sun, ``Hypernet: Towards accurate region proposal generation and joint object detection,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 845- 853.

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, ``Feature pyramid networks for object detection,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117-2125.

Y. Liu, P. Sun, N. Wergeles, and Y. Shang, ``A survey and performance evaluation of deep learning methods for small object detection,'' Expert Systems with Applications, p. 114602, 2021.

R. Girshick, J. Donahue, T. Darrell, and J. Malik, ``Rich feature hierarchies for accurate object detection and semantic segmentation,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587.

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, ``Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,'' IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.

F. Yu and V. Koltun, ``Multi-scale context aggregation by dilated convolutions,'' arXiv preprint arXiv:1511.07122, 2015.

Y. Li, Y. Chen, N. Wang, and Z. Zhang, ``Scale-aware trident networks for object detection,'' in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6054-6063.

B. Singh and L. S. Davis, ``An analysis of scale invariance in object detection snip,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3578-3587.

B. Singh, M. Najibi, and L. S. Davis, ``Sniper: Efficient multi-scale training,'' arXiv preprint arXiv:1805.09300, 2018.

M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, ``Augmentation for small object detection,'' arXiv preprint arXiv:1902.07296, 2019.

B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, and Q. V. Le, ``Learning data augmentation strategies for object detection,'' in European Conference on Computer Vision. Springer, 2020, pp. 566-583.

A. Shrivastava, A. Gupta, and R. Girshick, ``Training region-based object detectors with online hard example mining,'' in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761- 769.

Y. Cao, K. Chen, C. C. Loy, and D. Lin, ``Prime sample attention in object detection,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 583-11 591.

K. Chen, J. Li, W. Lin, J. See, J. Wang, L. Duan, Z. Chen, C. He, and J. Zou, ``Towards accurate one-stage object detection with ap-loss,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5119-5127.

Q. Qian, L. Chen, H. Li, and R. Jin, ``Dr loss: Improving object detection by distributional ranking,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 164-12 172.

P. Dollár, M. Singh, and R. Girshick, ``Fast and accurate model scaling,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 924-932.

M. Tan and Q. Le, ``Efficientnet: Rethinking model scaling for convolutional neural networks,'' in International Conference on Machine Learning. PMLR, 2019, pp. 6105-6114.

C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, ``Scaled-yolov4: Scaling cross stage partial network,'' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 029-13 038.

M. Tan, R. Pang, and Q. V. Le, ``Efficientdet: Scalable and efficient object detection,'' in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 781-10 790.

D. Du, P. Zhu, L. Wen, X. Bian, H. Lin, Q. Hu, T. Peng, J. Zheng, X. Wang, Y. Zhang et al. ,``Visdrone-det2019: The vision meets drone ob- ject detection in image challenge results,'' in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0-0.

A. Newell, K. Yang, and J. Deng, ``Stacked hourglass networks for human pose estimation,'' in European conference on computer vision. Springer, 2016, pp. 483-499.

S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, ``Cbam: Convolutional block attention module,'' in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19.

J. Hu, L. Shen, and G. Sun, ``Squeeze-and-excitation networks,'' in Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.

J. Park, S. Woo, J.-Y. Lee, and I. S. Kweon, ``Bam: Bottleneck attention module,'' arXiv preprint arXiv:1807.06514, 2018.

D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, ``The unmanned aerial vehicle benchmark: Object detection and tracking,'' in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 370-386.

M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, ``Drone-based object counting by spatially regularized regional proposal network,'' in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4145-4153.

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Author

Mingi Kim

Mingi Kim was born in Okcheon, Korea, in 1996. He received a B.S. degree in data analysis from Hannam University, South Korea, in 2021. He is currently pursuing an M.S. degree with the Department of Artificial Intelligence, Chung-Ang University.

Heegwang Kim

Heegwang Kim was born in Seoul, Korea, in 1992. He received a B.S. degree in electronic engineering from Soongsil University, Korea, in 2016. He received an M.S. degree in image science from Chung-Ang University, Korea, in 2018. Currently, he is pursuing a Ph.D. degree in image engineering at Chung-Ang University.

Chanyeong Park

Chanyeong Park was born in Seoul, South Korea, in 1997. He received a B.S. degree in computer science from Coventry University in 2021. Currently, he is pursuing an M.S. degree in image processing at ChungAng University. His research interests include object detection and monocular 3D object detection.

Joonki Paik

Joonki Paik was born in Seoul, South Korea, in 1960. He received a B.S. degree in control and instrumentation engineering from Seoul National University in 1984 and M.Sc. and Ph.D. degrees in electrical engineering and computer science from Northwestern University in 1987 and 1990, respectively. From 1990 to 1993, he worked at Samsung Electronics, where he designed image stabilization chipsets for consumer camcorders. Since 1993, he has been a member of the faculty of Chung-Ang University, Seoul, Korea, where he is currently a professor with the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 1999 to 2002, he was a visiting professor with the Department of Electrical and Computer Engineering, University of Tennessee, Knoxville. Since 2005, he has been the director of the National Research Laboratory in the field of image processing and intelligent systems. From 2005 to 2007, he served as the dean of the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 2005 to 2007, he was the director of the Seoul Future Contents Convergence Cluster established by the Seoul Research and Business Development Program. In 2008, he was a full-time technical consultant for the System LSI Division of Samsung Electronics, where he developed various computational photographic techniques, including an extended depth of field system. He has served as a member of the Presidential Advisory Board for Scientific/Technical Policy with the Korean Government and is currently serving as a technical consultant for the Korean Supreme Prosecutor's Office for computational forensics. He is a two-time recipient of the Chester-Sall Award from the IEEE Consumer Electronics Society, the Academic Award from the Institute of Electronic Engineers of Korea, and the Best Research Professor Award from Chung-Ang University. He has served the Consumer Electronics Society of the IEEE as a member of the editorial board, vice president of international affairs, and director of sister and related societies committee.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Light-weight Deep Neural Network for Small Vehicle Detection using Model-scale YOLOv4

Abstract

Keywords

1. Introduction

2. Related Work

2.1 Object Detection in UAV Images

2.1.1 Two-stage Detector

2.1.2 One-stage Detector

2.2 Small Object Detection

2.3 Model Scaling

3. The Proposed Method

3.1 Head Layer Removal Method for a Small Object.

3.2 Model Scaling

Fig. 1. Architecture of the proposed model.

3.2.1 Model Depth Scaling

Fig. 2. Depth scaling. Maintaining the structure of the model: down scaling an existing model and changing the structure of the model: Depth-SC-V1, Depth-SC-V2 Model. The thickness of each feature map shows the number of layers of the feature map.

Table 1. Model Scaling Results.

3.2.2 Model Width Scaling

3.2.3 Model Compound Scaling

3.2.4 Model Scaling Network

3.3 Attention Stacked Hourglass Network

Fig. 3. Attention Stacked Hourglass Network.

3.3.1 Feature Fusion using Attention Stacked Hourglass Network

3.3.2 Attention Stacked Hourglass Network Structure

Table 2. Compound + Head Layer Elimination + ASHN Experimental Results.

4. Experimental Result

Fig. 4. Experiment results. The following is the experimental result of Visdrone-DET2019 test. Each image consists of 3 pairs. The top is ground truth (GT), the middle is the original model, and the bottom is our network.

Table 3. Experimental Results with Various Datasets.

Table 4. Visdrone2019-Det 10 Class Experimental Results.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Mingi Kim

Heegwang Kim

Chanyeong Park

Joonki Paik

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing