Mobile QR Code QR CODE

  1. (Department of Image, Chung-Ang University, Seoul, 06974, Korea hyungtae@ipis.cau.ac.kr)
  2. (Department of Artificial Intelligence, Chung-Ang University, Seoul, 06974, Korea paikj@cau.ac.kr)



Metadata generation, Abnormal recognition, Metadata retrieval, Intelligent surveillance system

1. Introduction

Video surveillance systems play a crucial role in preventing crimes, ensuring public safety, and capturing critical incidents. However, traditional surveillance systems heavily rely on human operators to monitor video streams, which can be prone to errors, inefficiencies, and limitations in real-time analysis. To overcome these challenges, intelligent video surveillance systems have emerged as a promising solution, leveraging advanced technologies such as object detection, tracking, abnormal behavior recognition, and metadata extraction and retrieval.

Intelligent video surveillance systems have witnessed remarkable advancements in object detection and tracking techniques, leading to automation and enhanced monitoring capabilities. State-of-the-art technologies have significantly improved the accuracy and efficiency of surveillance systems. Region-based methods, such as Faster R-CNN [1] and YOLO [2], have gained considerable attention for their ability to detect objects with high precision and real-time performance. Motion-based approaches, including optical flow [3] and Kalman filters based methods [4], have been widely utilized to track objects in dynamic scenes with robustness and accuracy. Additionally, other notable techniques in object detection and tracking are summarized as follows. Liu et al. proposed a real-time object detection method, called Single Shot MultiBox Detector (SSD), that achieves high accuracy by utilizing multiple layers with different resolutions [5]. Tan et al. proposed an efficient and scalable object detection framework called EfficientDet that achieves impressive performance by optimizing the trade-off between accuracy and computational efficiency [6]. A series of YOLO-based methods are proposed. These methods are ranked at the top of the state-of-the-art in real-time object detection [7-9].

Wojke et al. proposed a deep learning-based object tracking algorithm called Deep-SORT that combines appearance information with motion cues for robust and accurate tracking [10]. Zhou et al. proposed a tracking-by-detection approach called CenterTrack that incorporates object center estimation to improve tracking performance in crowded scenes [11]. Xu et al. proposed a PP-YOLOE-based anchor-free paradigm [9], and a similar anchor-free mechanism is proposed [12]. They achieve state-of-the-art performance in multiple object tracking.

Abnormal behavior recognition plays a vital role in intelligent video surveillance systems, enabling the identification and classification of unusual activities or events that deviate from normal patterns. Extensive research has been conducted in this field, utilizing machine learning algorithms, deep neural networks, and statistical models. Several notable works include the use of convolutional neural networks (CNNs) for abnormal event detection [13], anomaly detection based on spatiotemporal features [14], and the modeling of behavior using hidden Markov models (HMMs) [15].

Metadata extraction and retrieval techniques complement the intelligent video surveillance system by facilitating efficient data management and rapid scene retrieval. Metadata encompasses essential information such as object attributes, timestamps, and spatial coordinates, providing valuable context for analyzing and searching video content. Various approaches have been proposed for metadata extraction, including object-based methods [16], feature extraction algorithms [17], and deep learning-based techniques [18]. In terms of retrieval, SQL-based querying has proven effective in enabling fast and accurate retrieval of relevant scenes based on specific criteria.

The proposed intelligent video surveillance system integrates object detection, tracking, abnormal behavior recognition, metadata extraction, and retrieval modules as shown in Fig. 1. These components collectively form a comprehensive framework that enhances surveillance capabilities and streamlines the analysis process. The system architecture encompasses a network of edge cameras for real-time monitoring and an analysis server for centralized processing and storage. This scalable and adaptable architecture ensures seamless integration with existing surveillance infrastructure, maximizing the system’s efficiency and effectiveness.

This paper is organized as follows. In Section 2, we introduce the proposed system and provide an overview of how representative basic metadata is generated. Section 3 focuses on the generation of event metadata through the recognition of abnormal behaviors. We explore the techniques and algorithms employed to detect and classify these abnormal behaviors, enabling the extraction of specific event metadata. In Section 4, we delve into the design of the metadata structure and discuss the methodologies used for efficient video retrieval. In Section 5, we present the experimental results obtained through the implementation of our proposed system. Finally, in Section 6, we draw conclusions based on the findings presented throughout the paper.

Fig. 1. Overview of the proposed system. The proposed method is divided into three modules: i) Edge cameras with detectors; ii) Analysis servers including multi-object tracker and Abnormal behavior recognition; iii) Monitoring for video retrieval based on a query.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig1.png

2. The Proposed System Architecture

In this section, we introduce the general procedures of the proposed surveillance systems, including object detection and tracking, with the aim of generating representative metadata for video retrieval. As shown in Fig. 1, the proposed system consists of three parts: i) edge cameras, ii) analysis servers, and iii) Monitoring. Each part plays a specific role as follows. Edge cameras are in charge of object detection when objects appear and generate low metadata which is a seed for tracking and simple query. In analysis servers, the tracker groups detection results and extracts trajectory and representative metadata which are common information about the object such as the color of clothes and size. Anomaly recognition detects an anomaly scene and classifies the cases of abnormal behaviors which are compartmentalized into normal and event metadata. In general, generated metadata is stored in the database. Exceptionally, when the query feeds into the analysis servers, the retrieved video from the database is returned to the monitoring system.

2.1 Object Detection and Tracking

As aforementioned, there are diverse object detection and tracking methods. Since our framework focuses on video retrieval which qualifies as applicable to both edge cameras and analysis servers, when we design the proposed system, we consider the three conditions; i) the detector should detect not only normal size objects but also small objects with the lowest false positive, ii) the execution of the edge camera should be in real-time without hardware and software constraints, and iii) the tracker should associate between detected objects temporally and extract the metadata at the same time.

In addition, practical surveillance cameras have a number of kinds, where the cameras have differences in various perspectives such as field of view(FoV), resolution, computational power, frame rate, pan, tilt, zoom, and so on. These factors affect the performance of the detector. Furthermore, various scene event occurs in real surveillance situations by occlusions among objects, moving speed, illumination changes, and so on. Hence, our detector is selected to satisfy our considerations mentioned above on those cameras. The most popular used object detection methods are categorized according to their base algorithm as RCNN-based, YOLO-based, and SSD-based. The RCNN-based algorithms have a two-stage framework and are slower than the others. The SSD-based method is faster than RCNN-based detection, but misdetection occurs when small objects exist in the scene. YOLO detection is also a one-stage method similar to the SSD method. It is an advantage to applying the surveillance system handled in our cases. Furthermore, the YOLO-based detection models have been ranked at the top of the recent state-of-the-art detectors. Consequently, we adopt YOLO-based detection methods as our detector. The detailed version of YOLO is determined by evaluating the performance of the edge camera and practical monitoring scenarios.

In the past, off-the-shelf cameras often had limited computational ability, leading to a preference for tiny versions of object detection schemes. However, using lightweight detectors to reduce computational power came at the cost of less accurate detection [19]. This could result in true-negative and false-positive detection outcomes and missed objects in edge cameras. To address these issues, we perform complementary detection on the analysis server using two approaches.

The first complement of detection takes place during the tracking process. When updating the tracker, we extract candidates for tracking in proximity to the previous object location. This allows us to detect objects that may have become smaller or larger than before, thus preventing undetected objects.

The second complementation of detection occurs in anomaly recognition. Our anomaly recognition module classifies scenes into four classes: normal, falling, pushing, and crossing a wall. The "normal" class includes complex scenes, such as crowded areas with occlusions that degrade detection accuracy and may cause undetected objects in edge cameras. Once a scene is classified as normal, the anomaly detection module produces object metadata.

To enhance edge camera detection capabilities, we leverage various YOLO versions, including 3, 5, 7, and their modified sub-modules [20-22]. These versions have proven to be effective in addressing the challenges posed by edge cameras. However, we accept the PP-YOLOE detection mechanism, because PP-YOLOE uses specific software configurations such as PaddlePaddle, which are not verified on the embedded device [9].

The detected object, denoted as $O$, is represented as follows:

(1)
$ O=\left\{O_{id},x_{l},y_{t},w,h\right\}, $

where $O_{id},x_{l},y_{t},w,h$ are object id, the top left coordinates, width, and height of the bounding box, respectively.

Edge camera encodes detection results as:

(2)
$ C=\left\{C_{id},t,fn,O_{id},x_{l},y_{t},w,h\right\}, $

where $C_{id},\,\,t$ and $fn~ $represent the camera id and time, frame number.

These encoded data are called low metadata and after the encoding, low metadata are fed into analysis servers with the video stream. The low metadata collected has redundant information because the camera captures the image at least 24 fps and up to 30 fps. Thus, the enhancement of both the representativeness of the metadata and the efficiency of retrieval is achieved by applying a proper object tracker. In addition, the computational cost for anomaly detection has to be considered

As we mentioned at the head of this section, the objects tracking module in analysis servers associates among the low metadata and extracts the fine-grained metadata at the same time. Meanwhile, the computational cost in the analysis server is decoupled for incorporated tracking, metadata generation, and anomaly detection. In our analysis server architecture, object tracking and metadata generation are incorporated as one module, and independent anomaly detection supports object detection in challenging scenarios. Hence, we select the scale-adaptive version of the distractor-aware tracker(DATs) as our tracker [23]. The DAT method uses the color histogram of the object,$O$, and its surrounding region,$S$. The scale adaptive ability of DAT is helpful to apply to our systems because most cameras have viewpoint changes that cause the scale variation of objects, and the proposed metadata includes the aspect ratio related to scale.

We define the basic tracking procedures for applying the DAT tracker. Tracking procedures are organized with i) birth of the tracker, ii) update, and iii) death of the tracker. When the low metadata is fed into the tracking module, initializing the tracker, is called tracker birth. The tracker birth,$O_{t}$, is defined with (1):

(3)

$O_{t}=argmax\left(s_{v}\left(O\right)s_{d}\left(O\right)\right),$

$s_{v}\left(O\right)=\sum _{x\in O}\left(x\in O|b_{x}\right),$ and

$s_{d\left(O\right)}=\sum _{x\in O}\,\exp \left(-\frac{|x-c_{t-1}|^{2}}{2\sigma ^{2}}\right),$

where $\mathrm{s}_{\mathrm{v}}\left(\cdot \right)$is the voting score and $\mathrm{s}_{\mathrm{d}}\left(\cdot \right)$ is the distance score from object center,$\mathrm{c}$. $\mathrm{bx}$ is the bin assigned to the color components of the detected object.

In the tracking process, the tracker is updated as follows:

(4)
$ O_{t}^{new}=\arg max\left(s_{v}\left(O_{t,i}\right)s_{d}\left(O_{t,i}\right)\right), \\ s_{v\left(O_{t,i}\right)}=\sum _{x\in O_{t,i}}P_{1}\colon t-1\left(x\in O|b_{x}\right), \\ s_{d\left(O_{t,i}\right)}=\sum _{x\in O_{t,i}}\exp \left(-\frac{|x-c_{t-i}|^{2}}{2\sigma ^{2}}\right), $

where $O_{t},i$ are detected objects as candidates to track. $c_{t-1}$ is the center of the previous object.

The death of the tracker occurs when the tracked object disappears. In the Death of Tracker, low-level metadata are fine-grained and representative metadata are generated. The trajectory of the object,$T_{O},~ $is estimated from a sequence of detected objects by the tracker and defined as;

(5)
$ T_{O}=\left\{x_{s},y_{s},x_{\left\{\frac{1}{3}\right\}},y_{\left\{\frac{1}{3}\right\}},x_{\left\{\frac{2}{3}\right\}},y_{\left\{\frac{2}{3}\right\}},x_{e},y_{e}\right\}, $

where $x$ and $y$ are horizontal and vertical coordinates. Subscripts $s,1/3,2/3$ and $e$ represent the start, 1/3, 2/3 and end points of trajectory, respectively. The trajectory becomes one of representative metadata for retrieval, and the bounding box information is stored separately. The precise algorithms are described in the following subsection.

2.2 Color and Aspect Ratio Metadata

In this subsection, the representative methods of generating color and ratio metadata are explained. The color and aspect ratio of the object are important keys to specify the object among recorded stream including objects. In addition, the change of object attributes in recorded video by standard cameras is rare. hence the color and aspect ratio are adopted as metadata in the proposed system. To extract the representative color, we adopt the probabilistic latent semantic analysis(PLSA) based generative model [24]. PLSA-based model is trained with the Google images data set to generate the color distribution. The object color distribution is accumulated by the update process of the tracker. As a result, we can easily extract the representative color by comparing the object distributions with the reference distributions. The PLSA model-based color extraction is robust to changing illumination and viewpoint because the training of the reference distributions contains a brightness range from low to high. The number of proposed representative colors is 11. The colors are black, blue, brown, gray, green, orange, pink, purple, red, white, and yellow.

The aspect ratio is estimated by normalizing the integration of width and height in the tracking process. The aspect ratio,$R_{O}$, is denoted as:

(6)
$ R_{O}=\frac{H_{O}}{W_{O}}=\frac{\sum _{{H_{i}}\in {O_{s,e}}}H_{i}}{\sum _{{W_{i}}\in {O_{s,e}}}W_{i}}, $

where $H_{O}~ $ and $W_{O}$ are normalized height and width of the object. $H_{i}$ and $W_{i}$ are height and width of $i-th$ which is contained in trajectories. The reason for using aspect ratio as the metadata is the aspect ratio has less effect than size by the change of camera resolution without preprocessing such as calibration. Furthermore, (6) implies the normalized aspect ratio. In both the denominator and numerator, the number of tracked frames is simplified.

The trajectory, representative color, and aspect ratio configure the basic metadata. These metadata are used for query-based retrieval. For instance, each of the metadata is respectively used to generate a single query, and also a combined metadata query is available. Single and combined query-based retrieval is generally used in searching for normal scenes. Since object detection and tracking are designed with the assumption that the appearance and behavior of objects follow common sense, this metadata-based retrieval is difficult to specify the abnormal scenes. Furthermore, the proportion between normal and abnormal is extremely dominated by normal. So, the metadata query-based abnormal search is inefficient. In order to solve this problem, we propose anomaly recognition to produce event metadata.

3. Event Metadata Generation by Abnormal Behavior Recognition

We propose the two-stream convolution neural networks based abnormal behavior recognition. The structure of the proposed abnormal recognition is demonstrated in Fig. 2. The proposed network fuses the 2D-CNN [25,26] and 3D-CNN [27,28] architecture. In the 2D-CNN stream, the input video converts to the image by frame morphing, as shown in Fig. 3. The spatial and temporal feature vectors are concatenated by using fully connected layer fusion.

For training the proposed network, we collect the large-scale behavior dataset. Our dataset consists of 162 training and 12 test videos that are captured by surveillance cameras. As shown in Fig. 4, the proposed anomaly recognition aims to classify the four categories; pushing(assault), falling, crossing wall, and normal.

The proposed anomaly recognition approach differs from existing common abnormal recognition methods, particularly concerning the number of abnormal cases it addresses. There are numerous classes of abnormal behavior, including assault, swooning, fighting, wandering, trespass, and more, which can be categorized into two groups based on whether they can be defined as rule-based or not. However, training models for abnormal recognition can be time-consuming and often require training videos rather than just object detection and tracking data. Therefore, it becomes essential to adopt a practical number of classes.

As a result, our abnormal recognition model specifically focuses on handling the challenge of defining abnormal classes such as fighting, crossing a wall, and falling. As mentioned in Section 2.1, our dataset for abnormal recognition incorporates complex relationships between objects in various scenes. This feature empowers our anomaly detection network to provide robust support to the detector, particularly in challenging and complex scenes.

By concentrating on these specific cases and leveraging the complex relationship information, our anomaly detection approach optimally enhances the overall performance and efficiency of the system.

Fig. 2. The network architecture of the proposed two-stream abnormal recognition; upper and lower streams are used for spatial and temporal, respectively.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig2.png
Fig. 3. The transition images of the frame morphing. The afterimage effects are shown in both figures.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig3.png
Fig. 4. Sampled frame images of the proposed large-scale abnormal recognition dataset. Each image is sequentially sampled from the abnormal training videos.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig4.png

4. Metadata Structure Design and Video Retrieval

This section introduces the SQL-based metadata structure and the specific video retrieval according to the query. SQL is specialized in query-based programming. For this reason, we exploit SQL as our database management program. The proposed metadata structure has a hierarchy that consists of a common DB as a parent and two children DB including object DB and abnormal db. The separated hierarchical metadata structure is more efficient and faster than a single compound metadata structure. Because when we search the clip with a specific query, the separated structure can concentrate the query by blinding other data. Table 1 shows the parent structure of the proposed SQL-based metadata. In Table 1, $Order$ is the number of orders stacked in SQL. $Cam\_ id$ is the identification number of the camera. $Fr\_ no$ is the number of frames recorded. $obj\_ id,$ $x,y,w,h,Color1,$ $Color2$, and $Color3$ indicate the id of a detected object, location information, and representative colors up to rank 3.

Table 2 represents the trajectory metadata structure which is the one of child metadata structures. The conciseness of the database in Table 2 is higher than the structure in Table 1. In Table 2, $Start\_ fr$ and $End\_ fr$ are the frame numbers of objects that appear and disappear, respectively. $w$ and $h$ mean average width and height of the object, which corresponds to $object\_ id$. Similar to $w$ and $h\,,$ $Color1,$ $Color2,$ and $Color3$ indicate the dominant representative colors up to rank 3.

Table 3 shows the metadata structure of the event according to the anomaly behaviors. The organization of event metadata is similar to the common data structure. The aforementioned abnormal behavior classes include pushing, falling, crossing wall, and normal.

However, the normal class is absent in the event metadata. Because the results as a normal class in anomaly recognition are concatenated into a common database.

Fig. 5 represents our retrieval system configuration. The combination of queries is activated by checking the check button. The detailed query option is able to fill out the box or select the color. After the query is set up, the search results are listed in the bottom left. When selecting the video path, the summarized video is played on the right panel.

Fig. 5. The example of retrieval based on the user query. The selectable queries are located at the left-top and the results of the search are listed below the query. The right side of the window displays a selected video stream.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig5.png
Table 1. The practical database example of the parent structure in the proposed metadata. Detected objects are stored.

Order

Cam_id

Fr_no

Obj_id

x

y

w

h

Color1

Color2

Color3

0

1

1

0

264

347

137

412

1

4

2

1

1

1

1

1485

446

357

178

1

4

2

2

1

1

2

352

476

357

178

1

2

4

3

1

2

0

275

344

129

416

1

2

4

4

1

2

1

1484

447

358

177

1

2

3

5

1

2

2

360

475

238

150

1

4

9

6

1

3

0

284

343

122

414

1

2

4

7

1

3

1

1484

446

356

178

1

2

3

8

1

3

2

381

470

197

163

4

1

7

9

1

4

0

295

341

123

414

1

4

2

10

1

4

1

1487

445

355

180

1

9

7

11

1

4

2

1310

361

277

174

4

2

3

Table 2. Configuration of object trajectory database structure for simple and fast query-based data retrieval.

Obj_id

Cam_id

Start_fr

End_fr

w

h

Color1

Color2

Color3

0

0

96

229

130

171

5

9

4

1

0

53

112

79

145

5

8

4

2

0

265

394

81

118

5

8

7

Table 3. Example of the proposed abnormal database structure.

Obj_id

Cam_id

Fr_no

x

y

w

h

Pushing

Falling

Crossing Wall

1

0

321

1416

261

79

145

1

0

0

1

0

322

1392

270

78

144

1

0

0

1

0

323

1354

294

79

144

1

0

0

1

0

324

1362

294

78

144

1

0

0

1

0

325

1358

296

78

144

1

0

0

1

0

326

1354

296

78

144

1

0

0

1

0

327

1342

310

78

144

1

0

0

1

0

328

1338

298

78

144

1

0

0

1

0

329

1330

306

78

144

1

0

0

1

0

330

1270

332

78

144

1

0

0

1

0

331

1274

332

78

144

1

0

0

5. Experimental Results

This section presents the qualitative and quantitative experimental results for evaluating the performance of the proposed method. The experiments encompass color-based retrieval and behavior recognition on both public and handcrafted datasets. However, object detection and tracking are not covered here, as we employ existing methods as modules in our system.

The first set of experiments focuses on color extraction and retrieval verification using the pedestrian color-naming dataset [29]. In this experiment, test images are fed into our model, and the model outputs the predicted color. The rank 1 color prediction is considered as the representative color of the person. When we include predictions up to rank 3, the accuracy of color extraction increases. However, for better intuitive understanding, we generate a confusion matrix using only the rank 1 experiment. Fig. 6 illustrates the confusion matrix of representative color extraction based on the results from the pedestrian color-naming database.

The cell colors in the confusion matrix are proportional to the cell scores, with higher scores depicted as darker blue and lower scores shown as white. As evident in Fig. 6, similar colors such as white and gray, as well as the triple of orange, brown, and red, exhibit lower accuracy in color extraction. This outcome is due to the representative color location existing in a continuous color space. To improve the accuracy of predicted colors, we consider representative colors up to rank 3.

The second experiment involves analyzing images captured during the Frankfurt Marathon using a smartphone. The main objective is to verify the practical applicability of our method under real-world conditions. Our Frankfurt Marathon dataset comprises 332 images with mean heights and widths of 374 and 149 pixels, respectively. Fig. 7 displays the validation images used for color extraction-based retrieval, as well as the 8 selected points required for the third experiment, which assesses color extraction performance under varying illumination conditions.

As depicted in Fig. 7, the validation images encompass both the Frankfurt Marathon DB and internet-collected images. In the Frankfurt data, we focus solely on the 8 specific colors, as people wearing other colors are not present during the marathon.

The accuracy of the color-based retrieval in Table 4 is calculated using the following equation:

(7)
$ Accuracy=\frac{Results}{GT}\times 100, $

where $Results$ represents the number of color prediction results, and $GT$ indicates the number of ground truths.

Table 4 provides the measurements for true positive, false positive, and accuracy, with ranks determined based on the first row. According to the table, the proposed color-based method achieves an accuracy of 86.23%.

This experiment demonstrates the robustness of color extraction and retrieval based on extracted color under varying illumination conditions. As observed in Table 4, the accuracy for most colors gradually increases, except for orange. The lower accuracy for orange, pink, and red can be attributed to overlapping color distributions and unequal region sizes of each color in the color space.

The third set of experiments investigates the impact of illumination changes on color extraction. We specifically selected eight points from the same color region, such as clothes, to evaluate their color extraction under varying illumination conditions. The test images containing these selected points are displayed in Fig. 7.

Fig. 8 presents one of the results related to the orange color. In the figure, each column represents the value and predicted color of a point. Despite allowing for up to rank 3 predictions, the model confidently assigns the first rank result with 100% confidence for each point. This justification supports our decision to consider up to rank 3 results. In our approach, each pixel is attributed to one of the representative colors. However, the region formed by the accumulation of pixels may consist of one color or more than one color due to illumination variations. To account for this, we retrained the PLSA-based model, aiming to control the effects of illumination changes, and thus, adopted the predicted color up to rank 3. This experiment validates the reasonableness of our system design. For further detailed experimental results, please refer to the Appendix, where we have expanded on the figures and conducted an in-depth analysis of the outcomes.

As stated in section 3, we conducted experiments for abnormal behavior recognition using our database, which consists of three classes of abnormal behavior. These experiments were implemented in TensorFlow version 3.5 and tested on a system equipped with an Intel i7 CPU with 16GB of RAM and an NVIDIA GEFORCE GTX 1080ti. For the testing phase, we generated 297 test video clips by cropping specific segments from the test videos.

Fig. 9 shows the prediction result and its confidence in the recognition of abnormal behavior. The class with the highest confidence becomes the final behavior class. As shown in Fig. 9, the proposed method can classify the anomaly scenes.

Table 5 presents the accuracy of the proposed method. Accuracy is estimated using (7). In Table 5, the first row and column present the prediction results and video categories, respectively. Table 5 shows that the proposed method achieves 88.55%.

Fig. 6. Confusion matrix of color prediction on the pedestrian color-naming dataset as a public database.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig6.png
Fig. 7. Configuration and information of validation images for representative color extraction.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig7.png
Fig. 8. Impact of illumination changes on predicted color extraction results.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig8.png
Fig. 9. The results of abnormal behavior prediction; pushing(fight), normal, and falling.
../../Resources/ieie/IEIESPC.2024.13.6.541/fig9.png
Table 4. Accuracy of Representative Color Extraction on the Frankfurt Marathon Dataset. This experiment includes 8 representative colors as the other 3 representative colors were not acquired in the Frankfurt marathon.

Color

Black

Blue

Gray

Green

Orange

Pink

Purple

Red

Total

Rank1(%)

93.1

59.7

37.5

68.7

15.0

20.6

50.0

50.7

86.23

Rank2(%)

97.5

86.6

45.8

90.6

30.0

62.0

100

74.6

Rank3(%)

97.5

96.1

91.6

92.1

30.0

75.8

100

83.5

Table 5. Configuration of object trajectory database structure for simple and fast query-based data retrieval.

Results

Pushing

Falling

Crossing Wall

Normal

Count

Pushing

120

3

1

12

136

Falling

0

71

0

9

80

Crossing Wall

0

0

72

8

80

Accuracy

88.85%

6. Conclusion

This paper introduced an intelligent surveillance system incorporating anomaly recognition and metadata-based retrieval, aimed at efficiently and accurately retrieving specific objects. A key highlight of the proposed system is its modular design, which improves portability and adaptability. The color and aspect ratio extraction method showcased in this work demonstrates its versatility in both detection and tracking tasks. Furthermore, the standalone abnormal recognition module provides additional flexibility.

Moving forward, future research could focus on exploring instance segmentation as a potential replacement for object detection, thereby improving the accuracy of metadata extraction. Additionally, a comprehensive survey of tracking methods could be undertaken to consolidate the tracking and recognition networks, leading to enhanced system performance. By further advancing these areas, the proposed intelligent surveillance system can be strengthened, offering increased precision and reliability for object retrieval applications.

Overall, this paper has laid the foundation for an intelligent surveillance system that combines anomaly recognition, metadata-based retrieval, and modular design. The suggested future directions open exciting opportunities for further advancement, contributing to the development of even more robust and effective surveillance systems in the field.

ACKNOWLEDGMENTS

This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT). [No. 2014-0-00077, Development of global multi-target tracking and event prediction techniques based on real-time large-scale video analysis and 2021-0-01341, Artificial Intelligence Graduate School Program(Chung- Ang University)]

REFERENCES

1 
S. Ren, et al., ``Faster r-cnn: Towards real-time object detection with region proposal networks,'' Advances in neural information processing systems 2015.DOI
2 
J. Redmon, et al., ``A. You only look once: Unified, real-time object detection,'' In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779-788, 2016.DOI
3 
P. Voigtlaender et al., "MOTS: Multi-Object Tracking and Segmentation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 7934-7943. Article (CrossRef Link)URL
4 
F, Fahime et al., ``Probabilistic Kalman filter for moving object tracking,'' Signal Processing: Image Communication 82, 115751, 2020.DOI
5 
W. Liu, et al,. ``Ssd: Single shot multibox detector,'' In Proceedings of the Computer Vision-ECCV 2016, Proceedings, Part I 14. Springer, pp. 21-37, 2016.DOI
6 
M. Tan. et al., ``Efficientdet: Scalable and efficient object detection,'' In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781-10790, 2020.DOI
7 
C. Li. et al., ``Yolov6 v3. 0: A full-scale reloading,'' arXiv preprint arXiv:2301.05586 2023.DOI
8 
C.Y. Wang. et al. ``YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,'' In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464-7475, 2023.DOI
9 
S. Xu. et al. ``PP-YOLOE: An evolved version of YOLO,'' arXiv preprint arXiv:2203.16250 2022.DOI
10 
N. Wojke. et al. ``Simple online and realtime tracking with a deep association metric,'' In Proceedings of the 2017 IEEE international conference on image processing (ICIP). IEEE, pp. 3645-3649, 2017.DOI
11 
X. Zhou. et al. ``Objects as points,'' arXiv preprint arXiv:1904.07850 2019.DOI
12 
Y.H. Wang. et al. ``SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking,'' arXiv preprint arXiv:2211.08824 2022.DOI
13 
M. Sakurada. et al. ``Anomaly detection using autoencoders with nonlinear dimensionality reduction,'' In Proceedings of the Proceedings of the MLSDA 2014 2nd workshop on machine learning for sensory data analysis, pp. 4-11, 2014.DOI
14 
V. Mahadevan. et al. Anomaly detection in crowded scenes. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition. IEEE, pp. 1975-1981, 2010.DOI
15 
A.B. Chan. et al. ``Modeling, clustering, and segmenting video with mixtures of dynamic textures,'' IEEE transactions on pattern analysis and machine intelligence, 30, pp. 909-926, 2008.DOI
16 
J. Sivic. et al. ``A. Video Google: Efficient visual search of videos,'' Toward category-level object recognition, pp. 127-144, 2006.DOI
17 
I. Laptev. et al. ``On space-time interest points,'' International journal of computer vision, 64, pp. 107-123, 2005.DOI
18 
J. Carreira. et al. ``action recognition? a new model and the kinetics dataset,'' In Proceedings of the proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299-6308, 2017.DOI
19 
P. Jiang. et al. ``A Review of Yolo algorithm developments,'' Procedia Computer Science, 199, pp. 1066-1073, 2022.DOI
20 
J. Redmon. et al. ``A. Yolov3: An incremental improvement,'' arXiv preprint arXiv:1804.02767 2018.DOI
21 
G. Jocher. et al. ``Ultralytics/yolov5: v3. 0,'' 2020.DOI
22 
C.Y. Wang. et al. ``YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,'' arXiv preprint arXiv:2207.02696 2022.DOI
23 
Possegger, H.; Mauthner, T.; Bischof, H. In defense of color-based model-free tracking. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2113-2120.DOI
24 
J. Weijer. et al. ``Learning color names for real-world applications,'' IEEE Transactions on Image Processing, 18, pp. 1512-1523, 2009.DOI
25 
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 2014.DOI
26 
O. Russakovsky. et al. ``Imagenet large scale visual recognition challenge,'' International journal of computer vision, 115, pp. 211-252, 2015.DOI
27 
D. Tran. et al. ``Learning spatiotemporal features with 3d convolutional networks,'' In Proceedings of the Proceedings of the IEEE international conference on computer vision, pp. 4489-4497, 2015.DOI
28 
A. Karpathy. et al. ``Large-scale video classification with convolutional neural networks,'' In Proceedings of the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725-1732, 2014.DOI
29 
Z. Cheng. et al. ``Pedestrian color naming via convolutional neural network,'' In Proceedings of the Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, Revised Selected Papers, Part II 13. Springer, pp. 35-51, 2017.DOI
Hyungtae Kim
../../Resources/ieie/IEIESPC.2024.13.6.541/au1.png

Hyungtae Kim received his B.S degree in the department of Electrical Engineering from the University of Suwon, Korea in 2012. He then pursued his M.S. degree in image processing at Chung-Ang University, Korea, in 2015. He is currently pursuing a Ph.D. degree in image processing at Chung-Ang University, Korea. He was involved in various projects including the development of high performance visual big data discovery platform for large-scale real-time data analysis (DeepView) funded by MS&ICT and research of a video recording system for vehicles funded by Hyundai NGV. His research interests include applied optics, computer vision and functional safety, SOTIF in SDV.

Joongchol Shin
../../Resources/ieie/IEIESPC.2024.13.6.541/au2.png

Joongchol Shin received the B.S. degree in information and communication engineering from Gyeong-Sang National University, Jinju, South Korea, in 2017 and the M.S. degree in digital imaging engineering from Chung-Ang University, Seoul, South Korea, where he is currently working toward the Ph.D. degree in digital imaging engineering. His research interests include digital image restoration, object recognition, and video analysis.

Seokmok Park
../../Resources/ieie/IEIESPC.2024.13.6.541/au3.png

Seokmok Park was born in Pusan, Korea, in 1991. He received a BSc in Mechatronics Engineering from Kyungsung University, Korea, in 2016. He received an MSc in Image Engineering at Chung-Ang University, Korea, in 2018. Currently, he is pursuing a PhD in Image Engineering at Chung-Ang University. His research interests include intelligent vehicle systems, including camera calibration, object detection and recognition, and optical character recognition.

Joonki Paik
../../Resources/ieie/IEIESPC.2024.13.6.541/au4.png

Joonki Paik was born in Seoul, South Korea, in 1960. Dr. Paik obtained his B.Sc. degree in control and instrumentation engineering from Seoul National University, Seoul, South Korea, in 1984. He then pursued his M.Sc. and Ph.D. degrees in electrical engineering and computer science at Northwestern University, Evanston, IL, USA, graduating in 1987 and 1990, respectively. From 1990 to 1993, Dr. Paik worked at Samsung Electronics, where he played a vital role in designing image stabilization chip sets for consumer camcorders. Since 1993, he has held a professorship at Chung-Ang University. Additionally, he served as a Visiting Professor at the Department of Electrical and Computer Engineering, University of Tennessee, Knoxville, TN, USA, from 1999 to 2002. Dr. Paik has held various leadership positions, including Full-time Technical Consultant for the System LSI Division of Samsung Electronics in 2008, Vice President of the IEEE Consumer Electronics Society from 2016 to 2018, President of the Institute of Electronics and Information Engineers (IEIE) in 2018, Executive Vice President of Research at Chung-Ang University in 2020, and Provost of Chung-Ang University from 2020 until the present. Furthermore, he has been the Dean of the Artificial Intelligence Graduate School at Chung-Ang University since 2021 and the Director of the Military AI Education Program supported by Ministry of Defense of Korea, at Chung-Ang University since 2022.