Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 10, No. 3, p.209-218

ISSN (print) :

2287-5255

Received : 24 November 2020Revised : 24 December 2020Accepted : 19 April 2021

DOI :

https://doi.org/10.5573/IEIESPC.2021.10.3.209

Regular Paper

Improved DeepLab v3+ with Metadata Extraction for Small Object Detection in Intelligent Visual Surveillance Systems

OhHeungmin LeeMinjung KimHyungtae PaikJoonki^*

(Department of Image Engineering, Processing and Intelligent Systems Laboratory, Graduate School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University / Seoul 06974, Korea )

^* Corresponding Author: Joonki Paik, paikj@cau.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

A surveillance system deploys multiple cameras to monitor a wide area in real time to detect abnormal situations such as a crime scene, traffic accident, and natural disaster. An Increased number of cameras results in the same number of monitors, which makes human decisions or automatic decisions difficult. To solve the problem, a smart surveillance scheme has recently been proposed. The smart surveillance system automatically detects an object and provides an alarm to a surveillant. In this paper, we present a metadata extraction method for object-based video summary. The proposed method adopts deep learning-based object detection and background elimination to correctly estimate an object region. Finally, metadata extraction is performed on the estimated object information. The proposed metadata consists of the representative color, size, aspect ratio, and patch of an object. The proposed method can extract reliable metadata without motion features in both static and dynamic cameras. The proposed method can be applied to various object detection areas using complex metadata.

Keywords

Metadata, Object segmentation, Surveillance system

1. Introduction

Recently, the surveillance system has increased the number of cameras for accurately monitoring crime scenes, traffic accidents, and natural disasters. The surveillance system operation modes are distinguished as real-time monitoring and specific scene searching called video summarization. In real-time monitoring, increases of surveillance cameras enable capturing an amount of a monitoring region at the cost of increasing humans needed for surveillance. From another viewpoint, increases of cameras cause a dramatic rise in both storage cost and computational power requirement that is proportional to the number of cameras because both operation modes are performed on all cameras and videos.

In order to solve these problems, metadata extraction methods have been researched. In the surveillance system, the metadata is a descriptor set for representing an object in real time or a scene in non-real time. The conventional metadata extraction methods use color, shape, and texture features. Each of the features has different attributes, which play an important role in object detection, tracking, and recognition ^[1,^3].

The metadata generation methods are classified by camera type as static camera and non-static camera-based algorithms. The key feature of a static camera is an image with static background. Kim $\textit{et al.}$ proposed an object detection and metadata extraction method based on a Gaussian Mixture Model (GMM) and blob analysis ^[4]. Kim's method separated the foreground and background through GMM-based background modeling and then detected the object candidate region. Blob analysis determines a removed region when the region is smaller than a threshold in the detected candidate region. After a region growing process, metadata is extracted from the remaining blob.

Geronimo $\textit{et al.}$ proposed an Adaptive Gaussian Mixture Model (AGMM)-based object detection, human action, and appearance extraction method ^[5]. The methods use GMM-based object detection. A GMM-based method can detect the object by background and foreground modeling when a static camera has a proper moving object. On the other hand, if the object is static or has rare movements, the object is classified as background.

Fig. 1. Block diagram of the proposed metadata extraction algorithm.

Yuk $\textit{et al.}$ proposed motion block-based object detection for metadata extraction ^[6]. In this method, similarity of a motion block is used for region growing by merging. Paek $\textit{et al.}$ proposed Lucas-Kanade optical-flow-based moving object detection and metadata extraction by using an Active Shape Model (ASM) and color space transformation ^[7]. Paek’s method supposed that the illumination condition is stable for color space transformation. However, the illumination condition is unstable and changeable in the real world.

Jung $\textit{et al.}$ proposed a 3D modeling-based metadata extraction method ^[8]. Jung’s method performed camera calibration with object detection and tracking results. After camera calibration, ellipsoid model-based metadata extraction is executed. Yun $\textit{et al.}$ proposed representative patch selection from a detected object patch ^[9]. The accuracy of Yun’s method depends on the patch size of the object. As a result, a small object reduces the object detection performance.

As summarized above, most of the metadata extraction methods in a static camera used background subtraction or frame subtraction. However, the subtraction methods between adjacent frames are unsuitable for a non-static camera because the camera movements of a non-static camera become a candidate for an object, which decreases the accuracy of object detection in adjacent frames subtraction. To overcome the static camera problem, non-static camera-based algorithms have been studied.

Chavda $\textit{et al.}$ proposed an object detection and metadata extraction method by performing the background subtraction in the first frame of a Pan-Tilt-Zoom (PTZ) camera ^[10]. Chavda’s method cannot detect multiple objects at the same time because the PTZ camera has a limited field of view (FoV). Hou $\textit{et al.}$ proposed an object detection and metadata extraction method based on a pre-trained Deformable Part Model (DPM) and Histogram of Oriented Gradient (HoG) in a non-static camera ^[11]. Although Hou’s method can apply a non-static camera using pre-trained detectors, the performance of object detection and metadata extraction decreases for a fast-moving object and complex background.

Shandong $\textit{et al.}$ proposed object detection and metadata extraction using the entire frame trajectory based on Lagrangian particle trajectories in a non-static camera ^[12]. The object detection and metadata extraction are performed by decomposing the trajectory of both the non-static camera and the object motion for the entire trajectory. Shandong's method has a problem of object detection and metadata extraction accuracy decreasing following a decrease of the difference between the camera and object trajectories.

As described above, metadata extraction methods use the features of both object and camera motion in a non-static camera. The non-static camera-based algorithms are more challenging than static camera-based algorithms because the motion in non-static camera is generated by the movement of both the object and camera. Moreover, illumination change and swaying leaves also become motion in a video acquired by a camera. Hence, an alternative object detection method which guarantees overall object shape extraction is required for metadata extraction.

A notable method is segmentation. Seemanthini $\textit{et al.}$ proposed clustering-based segmentation and metadata extraction ^[13]. Seemanthini's method divides the part level segmentation from the input frame. After clustering, similarity of a cluster is used for region merging, which becomes an object and feeds into a metadata extraction module. Patel $\textit{et al.}$ generated a saliency map by estimating distance with an average filter and Gaussian low-pass filter on a video sequence ^[14]. Metadata is extracted by performing thresholding and a morphological operation from the saliency map.

In the segmentation-based method, the object segmentation result directly affects the accuracy of the metadata generation. As a result, in cases of segmentation of a wide-angle camera image with dominant background, inaccurate segmentation and incorrect metadata generation occur. This paper proposes segmentation-based metadata extraction in static and non-static cameras with various FoVs. The proposed method is designed as three stages in a framework including a deep learning-based detector, a segmentation module for background elimination, and metadata extraction. This paper is organized as follows. Section 2 presents deep learning-based object detection, improved DeepLab v3+ for background elimination, and the metadata extraction method. Section 3 presents comparative results among the proposed method and the existing methods in static and non-static cameras. Section 4 concludes the paper and describes future work.

2. Metadata Extraction of Static and Non-static Camera

In this section, we propose a robust metadata extraction method for applying video captured by both static and non-static cameras. The proposed method consists of three steps: i) YOLO-v3 based object detection, ii) improved DeepLab v3+-based background elimination with scale robustness, and iii) comprehensive metadata extraction. Fig. 1 shows the block diagrams of the proposed algorithm. This paper adopts a deep learning-based object detector for real-time detection in static and non-static cameras ^[15] because the proposed method only focuses on accurate metadata extraction, not object detection. Nevertheless, we additionally train the object detection model by gathering and augmenting data for enhancing the detection accuracy for diverse camera fields of view.

After fine-tuning, for accurate object segmentation of the candidate object region marked as a bounding box, the proposed method improves the DeepLab v3+ model, which eliminates the background. The proposed method enhances the model by removing a specific feature map in the feature pyramid of the original DeepLab v3+ encoder network. For the enhancement of a representative of small object detection, the proposed segmentation method designs a model to generate a new feature map, which takes detail information of a small object. Finally, metadata extraction from estimated object information without background is performed. The proposed metadata consists of the representative color, size, aspect ratio, and patch of an object.

2.1 Deep Learning-based Object Detection

The object detection module aims for accurate object shape extraction without loss of an object region in static and non-static cameras and in various FoVs of a camera. In the view of conventional object detection, the environment of surveillance cameras is closely restricted. For example, when the surveillance system uses a static and non-static camera, the FoVs of cameras are similar because the surveillance system shares the object detection module with all cameras.

Fig. 2 shows that the difference between FoVs of cameras causes a challenge. In Figs. 2(a) and (b), the pairs of the corresponding objects have different object representation because each of the cameras has extremely different FoV, which causes dissimilar size and shape in the same object. In contrast, in Figs. 2(c) and (d), objects marked with red bounding boxes have similar size and shape. These phenomena are caused by the lens FoV and viewpoint of the camera.

The consideration for the selection of a proper object detection algorithm includes the mentioned attributes and environments of the surveillance camera. In addition, object detection speed for real-time metadata extraction is considered, so we adopt a proper object detector from the existing methods. For the objective model adaptation, we executed an ablation study with deep learning-based object detectors, including YOLO v3, Faster R-CNN, and Single Shot Multiple Detector (SSD) ^[16,^17]. We finally adopted the YOLO v3 detector based on the results of the ablation study. The details of the ablation study are represented in the experimental results.

YOLO-v3 is still unsuitable for the preprocessing of metadata extraction even though the adaption is decided by an ablation study because the basic YOLO-v3 is trained with a public dataset excluding diverse FoVs and viewpoints. However, the proposed method deals with not only standard-lens and fisheye-lens cameras but also mobile and static cameras. Hence, YOLO-v3 was additionally trained by collecting and augmenting data for enhancing the performance. The dataset for fine-tuning was gathered by considering object scales and angles. We collected images from various locations and additionally train the YOLO v3 detector. Fig. 3 shows sample images of the dataset for fine tuning.

Fig. 4 presents experimental results of basic YOLO-v3 and additionally trained YOLO-v3. Fig. 4(a) shows that the original YOLO v3 detector cannot detect smaller objects. On the other hand, YOLO v3 with fine-turning can detect a smaller object region that the original YOLO v3 fails to detect. For computational efficiency, the proposed algorithm detects the object with frame intervals. We set up 24 intervals.

Fig. 2. Object detection results in images with different viewpoint and camera FoV (a)-(c) static camera, (d) Dash camera. (a) and (b) are recorded at the same place.

Fig. 3. Fine-tuning dataset.

Fig. 4. The comparison result of original YOLO v3 and fine-tuned YOLO v3 (a) result of YOLO v3 without fine-tuning, (b) result of YOLO v3 with fine-tuning.

2.2 Object Extraction without Background

The object detection results are commonly represented as bounding boxes. All detection results include the background region that is near the object. However, the background in a bounding box contaminates the color metadata because the background region is bigger than or equal to the foreground. As a result, the background color contains the equal or more information than the foreground color. In order to reduce the effect of background, the proposed method preforms an additional segmentation process to extract accurate metadata. In the segmentation process, we consider a robust multi-scale segmentation method to eliminate background in detected objects with various shape and scale.

The proposed segmentation method uses the Atrous Spatial Pyramid Pooling (ASPP) model. The ASPP structure constructs a feature pyramid hierarchy with multiple receptive fields for extracting detail information of multi-scale objects. A popular model of ASPP structure is DeepLab v3+ ^[18]. DeepLab v3+ applies an atrous convolution to extend the receptive field with the same computational cost as standard convolution. Atrous convolution is defined as:

(1)

$y\left[i\right]=\sum _{k}x\left[i+r\cdot k\right]w\left[k\right]$,

where $x$, $r$, and $w$ are the input frame, convolution filter, and atrous rate, respectively. However, the original DeepLab v3+ performs segmentation at the cost of loss of detail information in small objects.

To solve the problem, we propose an improved DeepLab v3+ model. In the feature pyramid, a wide receptive field and narrow receptive field are suitable for large-scale and small-scale objects, respectively. We utilize the relationship between the object scale and receptive field. The improved model was designed by fusing feature maps of small receptive fields that are extracted from the feature pyramid of the original DeepLab v3+.

Fig. 5 shows the proposed DeepLab v3+ model. Improved DeepLab v3+ deletes the largest feature map in the encoder network of the original DeepLab v3+ model because the largest feature map that condenses the semantic information destroys the important details of small objects. In Fig. 5, the green dashed box represents the eliminated feature map. To compensate the information of a large object and to enhance the detail information for a small object, we generate a novel feature map by concatenating feature maps with a different expansion ratio to preserve segment information of small objects. After the concatenation, we add a 3 x 3 and 1 x 1 convolution layer.

In Fig. 5, the yellow dashed box is the novel feature map. The encoder network of improved DeepLab v3+ is constructed by replacing the deleted feature map with a novel feature map. The decoder network is the same as the original DeepLab v3+. The proposed method transfers the information of both small object context and important details to the output feature map for eliminating the background. Furthermore, the network of improved DeepLab v3+ is lighter than the original because the proposed network reduces the layer.

Fig. 6 shows comparative experimental results of background elimination by the original DeepLab v3+ and the proposed method. The second and third columns of Fig. 6 represent background elimination results of the detected object in a static camera and non-static camera, respectively. As the yellow box in Figs. 6(b) and (e) shows, the original DeepLab v3 + presents a result of misclassified partial background as an object. On the other hand, as shown in Figs. 6(c) and (f), the improved DeepLab v3+ exactly eliminates the background region where the original DeepLab v3+ fails at background elimination.

Fig. 5. The framework of the proposed DeepLab v3+ model.

Fig. 6. Experimental results of comparing the background elimination of the original DeepLab v3+ and the improved DeepLab v3+ in the detected object by static and non-static cameras (a), (d) detected object region, (b), (e) original DeepLab v3+-based background elimination result, (c), (f) improved DeepLab v3+ based background elimination result.

2.3 Metadata Extraction

In this section, we describe the proposed metadata extraction method from the detected object without background. The proposed metadata consists of the representative color, size, aspect ratio, and patch of the object. The color metadata is the most effective object attribute information. Color metadata enhances the accuracy of video search by distinguishing the colors of the upper body and lower body.

Size and aspect ratio are used for verification when the detector performs false detection of a pedestrian. Generally, the height of a pedestrian is longer than the width. By using the characteristic of pedestrians, the size and aspect ratio metadata increase the efficiency of object search by excluding falsely detected objects. The patch of objects is important metadata in a multi-camera environment.

Most of the existing object search algorithms use a patch of objects with background information. Unfortunately, in a multi-camera system, each object is captured in dissimilar background. As a result, the proposed object patch metadata clearly represents an object feature. This is because the suggested patch image has no background effect.

The color metadata extraction methods are distinguished as color chip and model methods by a color name designation method. The color chip and model methods extract the representative color based on a pre-defined color name and color distribution learning, respectively. Although the color chip method is effective in specific applications, it is limited by illumination conditions. On the other hand, the model-based method learns color distribution by considering the illumination condition of a real-world environment.

In consideration of this problem, the proposed method adopts a PLSA-based generative model for extracting an object’s representative color metadata from images acquired in the real world ^[19]. The PLSA-based generative model uses a set of images from Google to extract representative color from a real-world image. To achieve this aim, this method selects 11 representative color names and learns the color distribution by collecting 250 training images for each color. the PLSA-based generative model is defined as:

(2)

$M_{C}=\sum p\left(f_{p}\left| f_{c}\right.\right)p\left(f_{c}\left| f_{D}\right.\right)$,

where $M_{C},p\left(\cdot \right),f_{p},f_{c}$, and $f_{D}$are the color metadata, conditional probability, object $L^{*}a^{*}b^{*}$color value, representative color, and detected object region, respectively.

As shown in Fig. 7(c), the PLSA-based generative model extracts robust representative color without limitation of real-world illumination conditions. The size and aspect ratio have been estimated from the object edge region as in Fig. 7(d). Fig. 7(e) shows the patch-of-object metadata. The patch-of-object metadata is extracted through pixel storage from the object without background. The presented metadata excluding color is defined as:

(3)

M S = W = f R x S − f L x S H = f T y S − f B y S M R = W / H M P ∈ f P f P ∈ f D ∩ f P ∉ f B

where $W,\,H,\,f_{Rx}^{S},\,f_{Lx}^{S},\,f_{Ty}^{S},f_{By}^{S},\,$ and $f_{B}$ are the width of object segmentation, height of object segmentation, right side of segmentation region, left side of segmentation region, top of object segmentation region, bottom of object segmentation region, and background region, respectively.

The extracted metadata is stored in a database and described in Table 1. The stored colors are the three most extracted colors using the PLSA-based generative model from objects without a background. Table 2 describes 11 color names that the PLSA-based generative model learned.

Table 1. Object metadata configuration.

Camera id	Camera identification number
Video id	Video identification number
Frame number	Frame number
Object metadata	Object identification number Object size Object aspect ratio Object representative color Object patch

Table 2. Representative color list.

Black	1
Blue	2
Brown	3
Grey	4
Green	5
Orange	6
Pink	7
Purple	8
Red	9
White	10
Yellow	11

Fig. 7. The presented metadata extraction result (a) detected object region using YOLO v3, (b) object region without background, (c) representative color metadata result, (d) object edge result, (e) object patch metadata result.

3. Experimental Results

In this section, we consider diverse environments to verify the objective performance of the proposed method. To consider diverse environments, public and handcrafted datasets were used in an experiment. The public dataset is DukeMTMC-ReID ^[20]. The DukeMTMC-ReID dataset provides a detected object region and consists of 16522 training datasets and 17661 test datasets for object re-identification. To consider the hard case, the handcrafted dataset was acquired using static cameras and dash cameras. The static camera uses a 34' standard lens and a 180' fisheye lens. And then, dush camera use a 170' fisheye lens.

The proposed method evaluates the color extraction performance by background elimination using the public and handcrafted datasets. In the experimental results, we additionally trained the YOLO v3 module by gathering and augmenting data from 20,000 pedestrians to enhance the detection performance. The improved DeepLab v3+ for background elimination was trained with the PASCAL VOC 2012 dataset.

3.1 Ablation Study to Adopt Deep Learning-based Object Detector

Table 3 presents the result of an ablation study to select a proper object detector in the proposed method. As shown in Table 3, the SSD shows the fastest detection speed of 54.73 ms, which is much lower detection performance than that of other detection methods (0.46 ms). When objects are small in images captured with a fisheye lens, the detection performance of SSD is extremely low at 0.32 and 0.37.

Faster R-CNN presents the highest detection performance of 0.84 in the ablation study. However, Faster R-CNN has the slowest detection speed of 2864.28 ms. Although this method shows accurate detection power, it has an unsuited result for the proposed method, which considers real-time object detection. On the other hand, the YOLO v3 detector recodes 0.78 detection performance <note: ambiguous>, which is a little lower than the Faster R-CNN result. Fortunately, the average detection speed is 454.69 ms, which is 6 times faster than Faster R-CNN. This is a suitable result for the proposed method for real-time object detection.

Furthermore, additionally trained YOLO v3 has 0.06 higher detection power than the original YOLO v3 result, which is the same result as Faster R-CNN. The speed of additionally trained YOLO v3 is 305.27 ms, which is faster than the original YOLO v3 and Faster R-CNN. As a result, the proposed method detects the object candidate region by using an additionally trained YOLO v3-based detector.

Table 3. Result of ablation study using existing deep learning detector.

SSD

Faster R-CNN

YOLOv3

YOLO v3 with fine-tune

34’standard lens1

0.50

(57.89 ms)

0.78

(2690.85 ms)

0.72

(458.96 ms)

0.77

(301.48 ms)

34’standard lens2

0.64

(54.15 ms)

0.94

(2593.20 ms)

0.89

(299.42 ms)

0.93

(302.07 ms)

180’fisheye lens1

0.50

(57.52 ms)

0.88

(2808.78 ms)

0.79

(432.54 ms)

0.89

(320.28 ms)

180’fisheye lens2

0.32

(52.46 ms)

0.75

(3194.20 ms)

0.74

(510.15 ms)

0.76

(301.16 ms)

170’fisheye lens

0.37

(51.68 ms)

0.84

(3034.37 ms)

0.75

(454.69 ms)

0.84

(301.40 ms)

Average

0.46

(54.73 ms)

0.84

(2864.28 ms)

0.78

(431.15 ms)

0.84

(305.27 ms)

3.2 Metadata Extraction without Background Information

The background elimination method for metadata extraction was compared using GMM, original DeepLab v3+, and an enhanced model. Fig. 8 presents the GMM-based metadata extraction results in a static camera with a 34' standard lens. As shown in Figs. 8(d) and (e), the GMM-based background elimination method presents loss for an object region and metadata from small and slow-moving objects because movement features of objects are insufficient.

Fig. 9 shows the original DeepLab v3+-based background elimination and metadata extraction results in a static camera with a 34' standard lens. The original DeepLab v3+ can eliminate the background from the slow-moving object because the deep learning-based segmentation method does not utilize the movement feature. However, as shown in Fig. 9(d), the original DeepLab v3+-based background elimination presents loss and misclassification of an object region. Metadata is then lost, as shown in Fig. 9(e), because the feature maps of original DeepLab v3+ lost semantic information from small objects.

On the other hand, Fig. 10 shows the improved DeepLab v3+-based background elimination and metadata extraction result in a static camera with a 34' standard lens. As shown in Fig. 10(d), an enhanced model presents accurate background elimination performance for the yellow box region, where the GMM and original DeepLab v3+ failed. As shown in Fig. 10(e), accurate background elimination enhances the metadata because the enhanced model does not use movement features and reinforces semantic information of small objects.

Fig. 11 shows results of both background elimination and metadata extraction in a non-static camera with a 170' fisheye lens. In a non-static camera environment, the GMM-based background elimination method cannot detect an object region. Therefore, we compared the original DeepLab v3+ and enhanced model to evaluate the performance of both background elimination and metadata extraction in non-static cameras.

Figs. 11(d) and (e) show the original DeepLab v3+-based background elimination and color metadata extraction result. The original DeepLab v3+ presents low background elimination performance because it has a problem in that background partial is classified as an object, as shown in the yellow box in Fig. 11(d). As a result, it extracts inaccurate metadata, as shown in Fig. 11(e). On the other hand, the improved DeepLab v3+-based background elimination method eliminates the background accurately for the region where the original DeepLab v3+ failed to eliminate the background, as shown in Fig. 11(f). As a result, it extracts accurate metadata, as shown in Fig. 11(g).

Fig. 12 shows the performance of DeepLab v3+ and improved DeepLab v3+-based background elimination and color metadata extraction on a static camera using a 180' zoom lens. As shown in Fig. 12(a), the static camera using a 180' zoom lens generated object distortion. For this reason, the original DeepLab v3+-based background elimination method fails to eliminate the partial background, as shown in the yellow box in Fig. 12(d), and it fails to extract the partial metadata, as shown in Fig. 12(e). On the other hand, the improved DeepLab v3+ method eliminates the background accurately for the region where the original DeepLab v3+ failed to eliminate the background, as shown in the yellow box in Fig. 12(f). As a result, it extracts accurate metadata, as shown in Fig. 12(g).

The representative color metadata extraction accuracy of the proposed method is defined as:

(3)

$Acc=\left(f_{gt}\cap f_{c}/f_{t}\right)*100,$

where $f_{gt},\,f_{c,\,}$and $f_{t}$ respectively represent color metadata ground-truth, extracted color metadata, and image number. Table 4 shows the results to evaluate the effect of the background in the metadata extraction using the DukeMTMC-ReID dataset. In Table 4, the comparison method uses objects with background, original DeepLab v3+-based background elimination, and improved DeepLab v3+-based background elimination. The metadata of objects with background shows two levels of representative color extraction performance because for objects with background, it is difficult to classify the upper body and lower body.

As shown in Table 4, the object with background presents extremely low performance of 67.54% and 52.89% in the training dataset and 66.47% and 50.62% in the test dataset. Next, as shown in Table 4, the proposed method’s upper-body clothing color extraction accuracy is 3.4% more than that of the original DeepLab v3+. Its lower-body clothing color extraction accuracy is 1.5% more than that of the original DeepLab v3+ for the training dataset. In addition, it is 3.4% better for upper-body clothing and 3.0% better for lower-body clothing for the test dataset.

Table 13 shows the result of metadata extraction for the DukeMTMC-ReID dataset. As shown in Fig. 13(b), metadata of an object with the background includes more background information than object information. Eventually, the object metadata is lower than the background metadata. The original DeepLab v3+-based method of Fig. 13(c) contains unnecessary information about the partial background. On the other hand, the improved DeepLab v3+-based method in Fig. 13(d) eliminates the background accurately and then extracts enhancement metadata and shape information.

Table 4. Performance comparison between the original DeepLab v3+ method and our method using DukeMTMC-ReID.

	DukeMTMC-ReID dataset	Upper-body clothing	Lower-body clothing
Object with background	Training dataset	67.54%	52.89%
Object with background	Test dataset	66.47%	50.62%
Original DeepLab v3+ method	Training dataset	83.8%	83.5%
Original DeepLab v3+ method	Test dataset	82.2%	81.5%
Our method	Training dataset	87.2%	85.0%
Our method	Test dataset	85.6%	84.5%

Fig. 8. GMM-based background elimination and metadata extraction result on a static camera using 34'zoom lens (a) input frame, (b) GMM frame, (c) ground-truth object, (d) object detection result using GMM, and (e) color metadata extracted result.

Fig. 9. Original DeepLab v3+-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) original DeepLab v3+-based background elimination result, (e) color metadata extracted result.

Fig. 10. Improved DeepLab v3-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) improved DeepLab v3+-based background elimination result, (e) color metadata extracted result.

Fig. 11. Comparative experiment results of DeepLab v3+ and improved DeepLab v3+ on a non-static camera using a 170’ zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object region result, (d) original DeepLab v3+-based background elimination result, (e) original DeepLab v3+-based color metadata extraction result, (f) improved DeepLab v3+-based background elimination result, (g) improved DeepLab v3+-based color metadata extraction result.

Fig. 12. Comparative experiment results of DeepLab v3+ and improved DeepLab v3+ on a static camera using a 180' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object region result, (d) original DeepLab v3+-based background elimination result, (e) original DeepLab v3+-based color metadata extraction result, (f) improved DeepLab v3+-based background elimination result, (g) color metadata extraction result.

Fig. 13. Metadata extraction result of DukeMTMC-Reid dataset (a) input object, (b) metadata of object with the background, (c) original DeepLab v3+-based metadata result, (d) improved DeepLab v3+-based metadata result.

5. Conclusion

This paper proposed a metadata extraction method to solve the problem of human effort and storage size of real-time and non-real-time monitoring in an intelligent surveillance system. The proposed method adopts the YOLO v3 detector to detect the object of interest. In this paper, we proposed an improved DeepLab v3+ that is robust to multiple scales to solve the original DeepLab v3+ problems of background elimination performance for small objects.

The improved DeepLab v3+ method was used to extract an accurate object without background. Finally, the metadata consists of an object’s representative color, size, aspect ratio, and patch extracted from the object without background. The performance of the proposed method was validated through experimental results using public datasets and handcrafted datasets. Consequently, the proposed metadata extraction method can be applied to a wide range of surveillance systems, such as object search and large public space monitoring in multi-camera and mobile camera-based surveillance systems.

ACKNOWLEDGMENTS

This work was partly supported by a grant from the Institute for Information & communications Technology Promotion (IITP) funded by the Korea government (MSIT) (2017-0-00250, Intelligent Defense Boundary Surveillance Technology Using Collaborative Reinforced Learning of Embedded Edge Camera and Image Analysis) and by the ICT R&D program of MSIP/IITP [2014-0-00077, development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis].

REFERENCES

Garcia-Lamont F., Cervantes J., Lopez A., Rodriguez L., 2018, Segmentation of images by color features: A survey, Neurocomputing, Vol. 292, pp. 1-27

Yang M., Kpalma K., Ronsin J., 2008, A survey of shape feature extraction techniques, Pattern Recognition Techniques

Humeau-Heurtier A., 2019, Texture feature extraction methods: A survey, IEEE Access, Vol. 7, pp. 8975-9000

Kim T., Kim D., Kim P., Kim P., Dec. 2016, The Design of Object-of-Interest Extraction System Utilizing Metadata Filtering from Moving Object, Journal of KIISE, Vol. 43, No. 12, pp. 1351-1355

Geronimo D., Kjellstrom H., Aug. 2014, Unsupervised surveillance video retrieval based on human action and appearance, in Proc. IEEE Int. Conf. Pattern Recognit, pp. 4630-4635

Yuk J.S-C., Wong K-Y.K., Chung R.H-Y., Chow K.P., Chin F. Y-L., Tsang K. S-H., 2007, Object based surveillance video retrieval system with realtime indexing methodology, The Proceedings of the International Conference on Image Analysis and Recognition, pp. 626-637

Paek I., Park C., Ki M., park K., Paik J., November 2007, Multiple-view object tracking using metadata, Proc. Int. Conf. Wavelet Analysis and Pattern Recognition, Vol. 1, No. 1, pp. 12-17

Jung J., Yoon I., Lee S., Paik J., June 2016, Normalized Metadata Generation for Human Retrieval Using Multiple Video Surveillance Cameras, Sensors, Vol. 16, No. 7, pp. 1-9

Yun S., Yun K., Kim S.W., Yoo Y., Jeong J., 26-29 Aug. 2014, Visual surveillance briefing system: Event-based video retrieval and summarization, In Proceedings of the 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 204-209

Chavda H. K., Dhamecha M., 2017, Moving object tracking using PTZ camera in video surveillance system, 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing

Hou L., Wan W., Lee K-H., Hwang J-N., Okopal G., Pitton J., 2017, Robust Human Tracking Based on DPM Constrained Multiple-Kernel from a Moving Camera, Journal of Signal Processing Systems, Vol. 86, No. 1, pp. 27-39

Wu S., Oreifej O., Shah M., Nov. 2011, Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories, IEEE International Conference on Computer Vision(ICCV), pp. 1419-1426

Seemanthini K., Manjunath S. S., Jan. 2018, Human detection and tracking using HOG for action recognition, Procedia Comput Science, Vol. 132, pp. 1317-1326

Patel C.I., Garg S., Zaveri T., Banerjee A., Aug. 2018, Human action recognition using fusion of features for unconstrained video sequences, Computers and Electrical Engineering, Vol. 70, pp. 284-301

Redmon J., Farhadi A., April 2018, YOLOv3: An Incremental Improvement, IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Ren S., He K., Girshick R., Sun J., 2015, Faster R-CNN: Toward Real-Time Object Detection with Region Proposal Networks, Advances in Neural Information Processing Systems 28 (NIPS)

Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C. Y., Berg A. C., September 2016, SSD: Single Shot MultiBox Detector, European Conference on Computer Vision(ECCV), pp. 21-37

Chen L. C., Zhu Y., Papandreou G., Schroff F., Adam H., 2018, Encoder-decoder with atrous separable convolution for semantic image segmentation, European Conference on Computer Vision(ECCV)

Joost van de Weijer J. V., Cordelia Schmid , Larlu D., 2009, Learning color names from real-world applications, in IEEE Transactions on Image Processing, Vol. 18, pp. 1512-1523

Zheng Z., Zheng L., Yang Y., 2017, Unlabeled Samples Generated by GAN Improve the Person Re-Identification Baseline in Vitro, arXiv preprint arXiv:1701.07717

Author

Heungmin Oh

Heungmin Oh was born in Busan, Korea, in 1994. He received a B.S. in computer engineering from Silla University, Korea, in 2020. Currently, he is pursuing an M.S. in digital imaging engineering at Chung-Ang University. His research interests include object segmentation and artificial intelligence.

Minjung Lee

Minjung Lee was born in Busan, Korea, in 1994. She received a B.S. degree in electronics engineering from Silla University in 2017 and an M.S. degree in image engineering in 2019 from Chung-Ang University. She is currently working toward a Ph.D. in image engineering at Chung-Ang University, Seoul. Her research interest includes geometric distortion correction, object parsing, and feature extraction.

Hyungtae Kim

Hyungtae Kim was born in Seoul, Korea, in 1986. He received a B.S. degree from the Department of Electrical Engineering of Suwon University in 2012 and an M.S. degree in image engineering in 2015 from Chung-Ang University. He is currently pursuing a Ph.D. in image engineering at Chung-Ang University. His research interests include multi-camera calibration based on large-scale video analysis.

Joonki Paik

Joonki Paik was born in Seoul, Korea, in 1960. He received a BSc in Control and Instrumentation Engineering from Seoul National University in 1984. He received an MSc and a PhD in Electrical Engineering and Computer Science from Northwestern University in 1987 and 1990, respectively. From 1990 to 1993, he worked at Samsung Electronics, where he designed image stabilization chip sets for consumer camcorders. Since 1993, he has been on the faculty at Chung-Ang University, Seoul, Korea, where he is currently a Professor in the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 1999 to 2002, he was a Visiting Professor in the Department of Electrical and Computer Engineering at the University of Tennessee, Knoxville. Dr. Paik was a recipient of the Chester Sall Award from the IEEE Consumer Electronics Society, the Academic Award from the Institute of Electronic Engineers of Korea, and the Best Research Professor Award from Chung-Ang University. He has served the IEEE Consumer Electronics Society as a member of the editorial board. Since 2005, he has been the head of the National Research Laboratory in the field of image processing and intelligent systems. In 2008, he worked as a full-time technical consultant for the System LSI Division at Samsung Electronics, where he developed various computational photographic techniques, including an extended depth-of-field (EDoF) system. From 2005 to 2007, he served as Dean of the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 2005 to 2007, he was Director of the Seoul Future Contents Convergence (SFCC) Cluster established by the Seoul Research and Business Development (R&BD) Program. Dr. Paik is currently serving as a member of the Presidential Advisory Board for Scientific/Technical Policy of the Korean government and is a technical consultant for the Korean Supreme Prosecutor’s Office for computational forensics.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Improved DeepLab v3+ with Metadata Extraction for Small Object Detection in Intelligent Visual Surveillance Systems

Abstract

Keywords

1. Introduction

Fig. 1. Block diagram of the proposed metadata extraction algorithm.

2. Metadata Extraction of Static and Non-static Camera

2.1 Deep Learning-based Object Detection

Fig. 2. Object detection results in images with different viewpoint and camera FoV (a)-(c) static camera, (d) Dash camera. (a) and (b) are recorded at the same place.

Fig. 3. Fine-tuning dataset.

Fig. 4. The comparison result of original YOLO v3 and fine-tuned YOLO v3 (a) result of YOLO v3 without fine-tuning, (b) result of YOLO v3 with fine-tuning.

2.2 Object Extraction without Background

(1)

Fig. 5. The framework of the proposed DeepLab v3+ model.

2.3 Metadata Extraction

(2)

(3)

Table 1. Object metadata configuration.

Table 2. Representative color list.

Fig. 7. The presented metadata extraction result (a) detected object region using YOLO v3, (b) object region without background, (c) representative color metadata result, (d) object edge result, (e) object patch metadata result.

3. Experimental Results

3.1 Ablation Study to Adopt Deep Learning-based Object Detector

Table 3. Result of ablation study using existing deep learning detector.

3.2 Metadata Extraction without Background Information

(3)

Table 4. Performance comparison between the original DeepLab v3+ method and our method using DukeMTMC-ReID.

Fig. 8. GMM-based background elimination and metadata extraction result on a static camera using 34'zoom lens (a) input frame, (b) GMM frame, (c) ground-truth object, (d) object detection result using GMM, and (e) color metadata extracted result.

Fig. 9. Original DeepLab v3+-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) original DeepLab v3+-based background elimination result, (e) color metadata extracted result.

Fig. 10. Improved DeepLab v3-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) improved DeepLab v3+-based background elimination result, (e) color metadata extracted result.

Fig. 13. Metadata extraction result of DukeMTMC-Reid dataset (a) input object, (b) metadata of object with the background, (c) original DeepLab v3+-based metadata result, (d) improved DeepLab v3+-based metadata result.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Heungmin Oh

Minjung Lee

Hyungtae Kim

Joonki Paik

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing