1. Introduction
Video surveillance systems play a crucial role in preventing crimes, ensuring public
safety, and capturing critical incidents. However, traditional surveillance systems
heavily rely on human operators to monitor video streams, which can be prone to errors,
inefficiencies, and limitations in real-time analysis. To overcome these challenges,
intelligent video surveillance systems have emerged as a promising solution, leveraging
advanced technologies such as object detection, tracking, abnormal behavior recognition,
and metadata extraction and retrieval.
Intelligent video surveillance systems have witnessed remarkable advancements in object
detection and tracking techniques, leading to automation and enhanced monitoring capabilities.
State-of-the-art technologies have significantly improved the accuracy and efficiency
of surveillance systems. Region-based methods, such as Faster R-CNN [1] and YOLO [2], have gained considerable attention for their ability to detect objects with high
precision and real-time performance. Motion-based approaches, including optical flow
[3] and Kalman filters based methods [4], have been widely utilized to track objects in dynamic scenes with robustness and
accuracy. Additionally, other notable techniques in object detection and tracking
are summarized as follows. Liu et al. proposed a real-time object detection method,
called Single Shot MultiBox Detector (SSD), that achieves high accuracy by utilizing
multiple layers with different resolutions [5]. Tan et al. proposed an efficient and scalable object detection framework called
EfficientDet that achieves impressive performance by optimizing the trade-off between
accuracy and computational efficiency [6]. A series of YOLO-based methods are proposed. These methods are ranked at the top
of the state-of-the-art in real-time object detection [7-9].
Wojke et al. proposed a deep learning-based object tracking algorithm called Deep-SORT
that combines appearance information with motion cues for robust and accurate tracking
[10]. Zhou et al. proposed a tracking-by-detection approach called CenterTrack that incorporates
object center estimation to improve tracking performance in crowded scenes [11]. Xu et al. proposed a PP-YOLOE-based anchor-free paradigm [9], and a similar anchor-free mechanism is proposed [12]. They achieve state-of-the-art performance in multiple object tracking.
Abnormal behavior recognition plays a vital role in intelligent video surveillance
systems, enabling the identification and classification of unusual activities or events
that deviate from normal patterns. Extensive research has been conducted in this field,
utilizing machine learning algorithms, deep neural networks, and statistical models.
Several notable works include the use of convolutional neural networks (CNNs) for
abnormal event detection [13], anomaly detection based on spatiotemporal features [14], and the modeling of behavior using hidden Markov models (HMMs) [15].
Metadata extraction and retrieval techniques complement the intelligent video surveillance
system by facilitating efficient data management and rapid scene retrieval. Metadata
encompasses essential information such as object attributes, timestamps, and spatial
coordinates, providing valuable context for analyzing and searching video content.
Various approaches have been proposed for metadata extraction, including object-based
methods [16], feature extraction algorithms [17], and deep learning-based techniques [18]. In terms of retrieval, SQL-based querying has proven effective in enabling fast
and accurate retrieval of relevant scenes based on specific criteria.
The proposed intelligent video surveillance system integrates object detection, tracking,
abnormal behavior recognition, metadata extraction, and retrieval modules as shown
in Fig. 1. These components collectively form a comprehensive framework that enhances surveillance
capabilities and streamlines the analysis process. The system architecture encompasses
a network of edge cameras for real-time monitoring and an analysis server for centralized
processing and storage. This scalable and adaptable architecture ensures seamless
integration with existing surveillance infrastructure, maximizing the system’s efficiency
and effectiveness.
This paper is organized as follows. In Section 2, we introduce the proposed system
and provide an overview of how representative basic metadata is generated. Section
3 focuses on the generation of event metadata through the recognition of abnormal
behaviors. We explore the techniques and algorithms employed to detect and classify
these abnormal behaviors, enabling the extraction of specific event metadata. In Section
4, we delve into the design of the metadata structure and discuss the methodologies
used for efficient video retrieval. In Section 5, we present the experimental results
obtained through the implementation of our proposed system. Finally, in Section 6,
we draw conclusions based on the findings presented throughout the paper.
Fig. 1. Overview of the proposed system. The proposed method is divided into three modules: i) Edge cameras with detectors; ii) Analysis servers including multi-object tracker and Abnormal behavior recognition; iii) Monitoring for video retrieval based on a query.
2. The Proposed System Architecture
In this section, we introduce the general procedures of the proposed surveillance
systems, including object detection and tracking, with the aim of generating representative
metadata for video retrieval. As shown in Fig. 1, the proposed system consists of three parts: i) edge cameras, ii) analysis servers,
and iii) Monitoring. Each part plays a specific role as follows. Edge cameras are
in charge of object detection when objects appear and generate low metadata which
is a seed for tracking and simple query. In analysis servers, the tracker groups detection
results and extracts trajectory and representative metadata which are common information
about the object such as the color of clothes and size. Anomaly recognition detects
an anomaly scene and classifies the cases of abnormal behaviors which are compartmentalized
into normal and event metadata. In general, generated metadata is stored in the database.
Exceptionally, when the query feeds into the analysis servers, the retrieved video
from the database is returned to the monitoring system.
2.1 Object Detection and Tracking
As aforementioned, there are diverse object detection and tracking methods. Since
our framework focuses on video retrieval which qualifies as applicable to both edge
cameras and analysis servers, when we design the proposed system, we consider the
three conditions; i) the detector should detect not only normal size objects but also
small objects with the lowest false positive, ii) the execution of the edge camera
should be in real-time without hardware and software constraints, and iii) the tracker
should associate between detected objects temporally and extract the metadata at the
same time.
In addition, practical surveillance cameras have a number of kinds, where the cameras
have differences in various perspectives such as field of view(FoV), resolution, computational
power, frame rate, pan, tilt, zoom, and so on. These factors affect the performance
of the detector. Furthermore, various scene event occurs in real surveillance situations
by occlusions among objects, moving speed, illumination changes, and so on. Hence,
our detector is selected to satisfy our considerations mentioned above on those cameras.
The most popular used object detection methods are categorized according to their
base algorithm as RCNN-based, YOLO-based, and SSD-based. The RCNN-based algorithms
have a two-stage framework and are slower than the others. The SSD-based method is
faster than RCNN-based detection, but misdetection occurs when small objects exist
in the scene. YOLO detection is also a one-stage method similar to the SSD method.
It is an advantage to applying the surveillance system handled in our cases. Furthermore,
the YOLO-based detection models have been ranked at the top of the recent state-of-the-art
detectors. Consequently, we adopt YOLO-based detection methods as our detector. The
detailed version of YOLO is determined by evaluating the performance of the edge camera
and practical monitoring scenarios.
In the past, off-the-shelf cameras often had limited computational ability, leading
to a preference for tiny versions of object detection schemes. However, using lightweight
detectors to reduce computational power came at the cost of less accurate detection
[19]. This could result in true-negative and false-positive detection outcomes and missed
objects in edge cameras. To address these issues, we perform complementary detection
on the analysis server using two approaches.
The first complement of detection takes place during the tracking process. When updating
the tracker, we extract candidates for tracking in proximity to the previous object
location. This allows us to detect objects that may have become smaller or larger
than before, thus preventing undetected objects.
The second complementation of detection occurs in anomaly recognition. Our anomaly
recognition module classifies scenes into four classes: normal, falling, pushing,
and crossing a wall. The "normal" class includes complex scenes, such as crowded areas
with occlusions that degrade detection accuracy and may cause undetected objects in
edge cameras. Once a scene is classified as normal, the anomaly detection module produces
object metadata.
To enhance edge camera detection capabilities, we leverage various YOLO versions,
including 3, 5, 7, and their modified sub-modules [20-22]. These versions have proven to be effective in addressing the challenges posed by
edge cameras. However, we accept the PP-YOLOE detection mechanism, because PP-YOLOE
uses specific software configurations such as PaddlePaddle, which are not verified
on the embedded device [9].
The detected object, denoted as $O$, is represented as follows:
where $O_{id},x_{l},y_{t},w,h$ are object id, the top left coordinates, width, and
height of the bounding box, respectively.
Edge camera encodes detection results as:
where $C_{id},\,\,t$ and $fn~ $represent the camera id and time, frame number.
These encoded data are called low metadata and after the encoding, low metadata are
fed into analysis servers with the video stream. The low metadata collected has redundant
information because the camera captures the image at least 24 fps and up to 30 fps.
Thus, the enhancement of both the representativeness of the metadata and the efficiency
of retrieval is achieved by applying a proper object tracker. In addition, the computational
cost for anomaly detection has to be considered
As we mentioned at the head of this section, the objects tracking module in analysis
servers associates among the low metadata and extracts the fine-grained metadata at
the same time. Meanwhile, the computational cost in the analysis server is decoupled
for incorporated tracking, metadata generation, and anomaly detection. In our analysis
server architecture, object tracking and metadata generation are incorporated as one
module, and independent anomaly detection supports object detection in challenging
scenarios. Hence, we select the scale-adaptive version of the distractor-aware tracker(DATs)
as our tracker [23]. The DAT method uses the color histogram of the object,$O$, and its surrounding region,$S$.
The scale adaptive ability of DAT is helpful to apply to our systems because most
cameras have viewpoint changes that cause the scale variation of objects, and the
proposed metadata includes the aspect ratio related to scale.
We define the basic tracking procedures for applying the DAT tracker. Tracking procedures
are organized with i) birth of the tracker, ii) update, and iii) death of the tracker.
When the low metadata is fed into the tracking module, initializing the tracker, is
called tracker birth. The tracker birth,$O_{t}$, is defined with (1):
where $\mathrm{s}_{\mathrm{v}}\left(\cdot \right)$is the voting score and $\mathrm{s}_{\mathrm{d}}\left(\cdot
\right)$ is the distance score from object center,$\mathrm{c}$. $\mathrm{bx}$ is the
bin assigned to the color components of the detected object.
In the tracking process, the tracker is updated as follows:
where $O_{t},i$ are detected objects as candidates to track. $c_{t-1}$ is the center
of the previous object.
The death of the tracker occurs when the tracked object disappears. In the Death of
Tracker, low-level metadata are fine-grained and representative metadata are generated.
The trajectory of the object,$T_{O},~ $is estimated from a sequence of detected objects
by the tracker and defined as;
where $x$ and $y$ are horizontal and vertical coordinates. Subscripts $s,1/3,2/3$
and $e$ represent the start, 1/3, 2/3 and end points of trajectory, respectively.
The trajectory becomes one of representative metadata for retrieval, and the bounding
box information is stored separately. The precise algorithms are described in the
following subsection.
2.2 Color and Aspect Ratio Metadata
In this subsection, the representative methods of generating color and ratio metadata
are explained. The color and aspect ratio of the object are important keys to specify
the object among recorded stream including objects. In addition, the change of object
attributes in recorded video by standard cameras is rare. hence the color and aspect
ratio are adopted as metadata in the proposed system. To extract the representative
color, we adopt the probabilistic latent semantic analysis(PLSA) based generative
model [24]. PLSA-based model is trained with the Google images data set to generate the color
distribution. The object color distribution is accumulated by the update process of
the tracker. As a result, we can easily extract the representative color by comparing
the object distributions with the reference distributions. The PLSA model-based color
extraction is robust to changing illumination and viewpoint because the training of
the reference distributions contains a brightness range from low to high. The number
of proposed representative colors is 11. The colors are black, blue, brown, gray,
green, orange, pink, purple, red, white, and yellow.
The aspect ratio is estimated by normalizing the integration of width and height in
the tracking process. The aspect ratio,$R_{O}$, is denoted as:
where $H_{O}~ $ and $W_{O}$ are normalized height and width of the object. $H_{i}$
and $W_{i}$ are height and width of $i-th$ which is contained in trajectories. The
reason for using aspect ratio as the metadata is the aspect ratio has less effect
than size by the change of camera resolution without preprocessing such as calibration.
Furthermore, (6) implies the normalized aspect ratio. In both the denominator and numerator, the number
of tracked frames is simplified.
The trajectory, representative color, and aspect ratio configure the basic metadata.
These metadata are used for query-based retrieval. For instance, each of the metadata
is respectively used to generate a single query, and also a combined metadata query
is available. Single and combined query-based retrieval is generally used in searching
for normal scenes. Since object detection and tracking are designed with the assumption
that the appearance and behavior of objects follow common sense, this metadata-based
retrieval is difficult to specify the abnormal scenes. Furthermore, the proportion
between normal and abnormal is extremely dominated by normal. So, the metadata query-based
abnormal search is inefficient. In order to solve this problem, we propose anomaly
recognition to produce event metadata.
3. Event Metadata Generation by Abnormal Behavior Recognition
We propose the two-stream convolution neural networks based abnormal behavior recognition.
The structure of the proposed abnormal recognition is demonstrated in Fig. 2. The proposed network fuses the 2D-CNN [25,26] and 3D-CNN [27,28] architecture. In the 2D-CNN stream, the input video converts to the image by frame
morphing, as shown in Fig. 3. The spatial and temporal feature vectors are concatenated by using fully connected
layer fusion.
For training the proposed network, we collect the large-scale behavior dataset. Our
dataset consists of 162 training and 12 test videos that are captured by surveillance
cameras. As shown in Fig. 4, the proposed anomaly recognition aims to classify the four categories; pushing(assault),
falling, crossing wall, and normal.
The proposed anomaly recognition approach differs from existing common abnormal recognition
methods, particularly concerning the number of abnormal cases it addresses. There
are numerous classes of abnormal behavior, including assault, swooning, fighting,
wandering, trespass, and more, which can be categorized into two groups based on whether
they can be defined as rule-based or not. However, training models for abnormal recognition
can be time-consuming and often require training videos rather than just object detection
and tracking data. Therefore, it becomes essential to adopt a practical number of
classes.
As a result, our abnormal recognition model specifically focuses on handling the challenge
of defining abnormal classes such as fighting, crossing a wall, and falling. As mentioned
in Section 2.1, our dataset for abnormal recognition incorporates complex relationships
between objects in various scenes. This feature empowers our anomaly detection network
to provide robust support to the detector, particularly in challenging and complex
scenes.
By concentrating on these specific cases and leveraging the complex relationship information,
our anomaly detection approach optimally enhances the overall performance and efficiency
of the system.
Fig. 2. The network architecture of the proposed two-stream abnormal recognition; upper and lower streams are used for spatial and temporal, respectively.
Fig. 3. The transition images of the frame morphing. The afterimage effects are shown in both figures.
Fig. 4. Sampled frame images of the proposed large-scale abnormal recognition dataset. Each image is sequentially sampled from the abnormal training videos.
4. Metadata Structure Design and Video Retrieval
This section introduces the SQL-based metadata structure and the specific video retrieval
according to the query. SQL is specialized in query-based programming. For this reason,
we exploit SQL as our database management program. The proposed metadata structure
has a hierarchy that consists of a common DB as a parent and two children DB including
object DB and abnormal db. The separated hierarchical metadata structure is more efficient
and faster than a single compound metadata structure. Because when we search the clip
with a specific query, the separated structure can concentrate the query by blinding
other data. Table 1 shows the parent structure of the proposed SQL-based metadata. In Table 1, $Order$ is the number of orders stacked in SQL. $Cam\_ id$ is the identification
number of the camera. $Fr\_ no$ is the number of frames recorded. $obj\_ id,$ $x,y,w,h,Color1,$
$Color2$, and $Color3$ indicate the id of a detected object, location information,
and representative colors up to rank 3.
Table 2 represents the trajectory metadata structure which is the one of child metadata structures.
The conciseness of the database in Table 2 is higher than the structure in Table 1. In Table 2, $Start\_ fr$ and $End\_ fr$ are the frame numbers of objects that appear and disappear,
respectively. $w$ and $h$ mean average width and height of the object, which corresponds
to $object\_ id$. Similar to $w$ and $h\,,$ $Color1,$ $Color2,$ and $Color3$ indicate
the dominant representative colors up to rank 3.
Table 3 shows the metadata structure of the event according to the anomaly behaviors. The
organization of event metadata is similar to the common data structure. The aforementioned
abnormal behavior classes include pushing, falling, crossing wall, and normal.
However, the normal class is absent in the event metadata. Because the results as
a normal class in anomaly recognition are concatenated into a common database.
Fig. 5 represents our retrieval system configuration. The combination of queries is activated
by checking the check button. The detailed query option is able to fill out the box
or select the color. After the query is set up, the search results are listed in the
bottom left. When selecting the video path, the summarized video is played on the
right panel.
Fig. 5. The example of retrieval based on the user query. The selectable queries are located at the left-top and the results of the search are listed below the query. The right side of the window displays a selected video stream.
Table 1. The practical database example of the parent structure in the proposed metadata. Detected objects are stored.
Order
|
Cam_id
|
Fr_no
|
Obj_id
|
x
|
y
|
w
|
h
|
Color1
|
Color2
|
Color3
|
0
|
1
|
1
|
0
|
264
|
347
|
137
|
412
|
1
|
4
|
2
|
1
|
1
|
1
|
1
|
1485
|
446
|
357
|
178
|
1
|
4
|
2
|
2
|
1
|
1
|
2
|
352
|
476
|
357
|
178
|
1
|
2
|
4
|
3
|
1
|
2
|
0
|
275
|
344
|
129
|
416
|
1
|
2
|
4
|
4
|
1
|
2
|
1
|
1484
|
447
|
358
|
177
|
1
|
2
|
3
|
5
|
1
|
2
|
2
|
360
|
475
|
238
|
150
|
1
|
4
|
9
|
6
|
1
|
3
|
0
|
284
|
343
|
122
|
414
|
1
|
2
|
4
|
7
|
1
|
3
|
1
|
1484
|
446
|
356
|
178
|
1
|
2
|
3
|
8
|
1
|
3
|
2
|
381
|
470
|
197
|
163
|
4
|
1
|
7
|
9
|
1
|
4
|
0
|
295
|
341
|
123
|
414
|
1
|
4
|
2
|
10
|
1
|
4
|
1
|
1487
|
445
|
355
|
180
|
1
|
9
|
7
|
11
|
1
|
4
|
2
|
1310
|
361
|
277
|
174
|
4
|
2
|
3
|
Table 2. Configuration of object trajectory database structure for simple and fast query-based data retrieval.
Obj_id
|
Cam_id
|
Start_fr
|
End_fr
|
w
|
h
|
Color1
|
Color2
|
Color3
|
0
|
0
|
96
|
229
|
130
|
171
|
5
|
9
|
4
|
1
|
0
|
53
|
112
|
79
|
145
|
5
|
8
|
4
|
2
|
0
|
265
|
394
|
81
|
118
|
5
|
8
|
7
|
Table 3. Example of the proposed abnormal database structure.
Obj_id
|
Cam_id
|
Fr_no
|
x
|
y
|
w
|
h
|
Pushing
|
Falling
|
Crossing Wall
|
1
|
0
|
321
|
1416
|
261
|
79
|
145
|
1
|
0
|
0
|
1
|
0
|
322
|
1392
|
270
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
323
|
1354
|
294
|
79
|
144
|
1
|
0
|
0
|
1
|
0
|
324
|
1362
|
294
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
325
|
1358
|
296
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
326
|
1354
|
296
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
327
|
1342
|
310
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
328
|
1338
|
298
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
329
|
1330
|
306
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
330
|
1270
|
332
|
78
|
144
|
1
|
0
|
0
|
1
|
0
|
331
|
1274
|
332
|
78
|
144
|
1
|
0
|
0
|
5. Experimental Results
This section presents the qualitative and quantitative experimental results for evaluating
the performance of the proposed method. The experiments encompass color-based retrieval
and behavior recognition on both public and handcrafted datasets. However, object
detection and tracking are not covered here, as we employ existing methods as modules
in our system.
The first set of experiments focuses on color extraction and retrieval verification
using the pedestrian color-naming dataset [29]. In this experiment, test images are fed into our model, and the model outputs the
predicted color. The rank 1 color prediction is considered as the representative color
of the person. When we include predictions up to rank 3, the accuracy of color extraction
increases. However, for better intuitive understanding, we generate a confusion matrix
using only the rank 1 experiment. Fig. 6 illustrates the confusion matrix of representative color extraction based on the
results from the pedestrian color-naming database.
The cell colors in the confusion matrix are proportional to the cell scores, with
higher scores depicted as darker blue and lower scores shown as white. As evident
in Fig. 6, similar colors such as white and gray, as well as the triple of orange, brown, and
red, exhibit lower accuracy in color extraction. This outcome is due to the representative
color location existing in a continuous color space. To improve the accuracy of predicted
colors, we consider representative colors up to rank 3.
The second experiment involves analyzing images captured during the Frankfurt Marathon
using a smartphone. The main objective is to verify the practical applicability of
our method under real-world conditions. Our Frankfurt Marathon dataset comprises 332
images with mean heights and widths of 374 and 149 pixels, respectively. Fig. 7 displays the validation images used for color extraction-based retrieval, as well
as the 8 selected points required for the third experiment, which assesses color extraction
performance under varying illumination conditions.
As depicted in Fig. 7, the validation images encompass both the Frankfurt Marathon DB and internet-collected
images. In the Frankfurt data, we focus solely on the 8 specific colors, as people
wearing other colors are not present during the marathon.
The accuracy of the color-based retrieval in Table 4 is calculated using the following equation:
where $Results$ represents the number of color prediction results, and $GT$ indicates
the number of ground truths.
Table 4 provides the measurements for true positive, false positive, and accuracy, with ranks
determined based on the first row. According to the table, the proposed color-based
method achieves an accuracy of 86.23%.
This experiment demonstrates the robustness of color extraction and retrieval based
on extracted color under varying illumination conditions. As observed in Table 4, the accuracy for most colors gradually increases, except for orange. The lower accuracy
for orange, pink, and red can be attributed to overlapping color distributions and
unequal region sizes of each color in the color space.
The third set of experiments investigates the impact of illumination changes on color
extraction. We specifically selected eight points from the same color region, such
as clothes, to evaluate their color extraction under varying illumination conditions.
The test images containing these selected points are displayed in Fig. 7.
Fig. 8 presents one of the results related to the orange color. In the figure, each column
represents the value and predicted color of a point. Despite allowing for up to rank
3 predictions, the model confidently assigns the first rank result with 100% confidence
for each point. This justification supports our decision to consider up to rank 3
results. In our approach, each pixel is attributed to one of the representative colors.
However, the region formed by the accumulation of pixels may consist of one color
or more than one color due to illumination variations. To account for this, we retrained
the PLSA-based model, aiming to control the effects of illumination changes, and thus,
adopted the predicted color up to rank 3. This experiment validates the reasonableness
of our system design. For further detailed experimental results, please refer to the
Appendix, where we have expanded on the figures and conducted an in-depth analysis
of the outcomes.
As stated in section 3, we conducted experiments for abnormal behavior recognition
using our database, which consists of three classes of abnormal behavior. These experiments
were implemented in TensorFlow version 3.5 and tested on a system equipped with an
Intel i7 CPU with 16GB of RAM and an NVIDIA GEFORCE GTX 1080ti. For the testing phase,
we generated 297 test video clips by cropping specific segments from the test videos.
Fig. 9 shows the prediction result and its confidence in the recognition of abnormal behavior.
The class with the highest confidence becomes the final behavior class. As shown in
Fig. 9, the proposed method can classify the anomaly scenes.
Table 5 presents the accuracy of the proposed method. Accuracy is estimated using (7). In Table 5, the first row and column present the prediction results and video categories, respectively.
Table 5 shows that the proposed method achieves 88.55%.
Fig. 6. Confusion matrix of color prediction on the pedestrian color-naming dataset as a public database.
Fig. 7. Configuration and information of validation images for representative color extraction.
Fig. 8. Impact of illumination changes on predicted color extraction results.
Fig. 9. The results of abnormal behavior prediction; pushing(fight), normal, and falling.
Table 4. Accuracy of Representative Color Extraction on the Frankfurt Marathon Dataset. This experiment includes 8 representative colors as the other 3 representative colors were not acquired in the Frankfurt marathon.
Color
|
Black
|
Blue
|
Gray
|
Green
|
Orange
|
Pink
|
Purple
|
Red
|
Total
|
Rank1(%)
|
93.1
|
59.7
|
37.5
|
68.7
|
15.0
|
20.6
|
50.0
|
50.7
|
86.23
|
Rank2(%)
|
97.5
|
86.6
|
45.8
|
90.6
|
30.0
|
62.0
|
100
|
74.6
|
Rank3(%)
|
97.5
|
96.1
|
91.6
|
92.1
|
30.0
|
75.8
|
100
|
83.5
|
Table 5. Configuration of object trajectory database structure for simple and fast query-based data retrieval.
Results
|
Pushing
|
Falling
|
Crossing Wall
|
Normal
|
Count
|
Pushing
|
120
|
3
|
1
|
12
|
136
|
Falling
|
0
|
71
|
0
|
9
|
80
|
Crossing Wall
|
0
|
0
|
72
|
8
|
80
|
Accuracy
|
|
|
|
|
88.85%
|
6. Conclusion
This paper introduced an intelligent surveillance system incorporating anomaly recognition
and metadata-based retrieval, aimed at efficiently and accurately retrieving specific
objects. A key highlight of the proposed system is its modular design, which improves
portability and adaptability. The color and aspect ratio extraction method showcased
in this work demonstrates its versatility in both detection and tracking tasks. Furthermore,
the standalone abnormal recognition module provides additional flexibility.
Moving forward, future research could focus on exploring instance segmentation as
a potential replacement for object detection, thereby improving the accuracy of metadata
extraction. Additionally, a comprehensive survey of tracking methods could be undertaken
to consolidate the tracking and recognition networks, leading to enhanced system performance.
By further advancing these areas, the proposed intelligent surveillance system can
be strengthened, offering increased precision and reliability for object retrieval
applications.
Overall, this paper has laid the foundation for an intelligent surveillance system
that combines anomaly recognition, metadata-based retrieval, and modular design. The
suggested future directions open exciting opportunities for further advancement, contributing
to the development of even more robust and effective surveillance systems in the field.
ACKNOWLEDGMENTS
This work was supported by Institute for Information & Communications Technology Planning
& Evaluation (IITP) grant funded by the Korea government (MSIT). [No. 2014-0-00077,
Development of global multi-target tracking and event prediction techniques based
on real-time large-scale video analysis and 2021-0-01341, Artificial Intelligence
Graduate School Program(Chung- Ang University)]
REFERENCES
S. Ren, et al., ``Faster r-cnn: Towards real-time object detection with region proposal
networks,'' Advances in neural information processing systems 2015.
J. Redmon, et al., ``A. You only look once: Unified, real-time object detection,''
In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779-788, 2016.
P. Voigtlaender et al., "MOTS: Multi-Object Tracking and Segmentation," 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA,
2019, pp. 7934-7943. Article (CrossRef Link)
F, Fahime et al., ``Probabilistic Kalman filter for moving object tracking,'' Signal
Processing: Image Communication 82, 115751, 2020.
W. Liu, et al,. ``Ssd: Single shot multibox detector,'' In Proceedings of the Computer
Vision-ECCV 2016, Proceedings, Part I 14. Springer, pp. 21-37, 2016.
M. Tan. et al., ``Efficientdet: Scalable and efficient object detection,'' In Proceedings
of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pp. 10781-10790, 2020.
C. Li. et al., ``Yolov6 v3. 0: A full-scale reloading,'' arXiv preprint arXiv:2301.05586
2023.
C.Y. Wang. et al. ``YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for
real-time object detectors,'' In Proceedings of the Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 7464-7475, 2023.
S. Xu. et al. ``PP-YOLOE: An evolved version of YOLO,'' arXiv preprint arXiv:2203.16250
2022.
N. Wojke. et al. ``Simple online and realtime tracking with a deep association metric,''
In Proceedings of the 2017 IEEE international conference on image processing (ICIP).
IEEE, pp. 3645-3649, 2017.
X. Zhou. et al. ``Objects as points,'' arXiv preprint arXiv:1904.07850 2019.
Y.H. Wang. et al. ``SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking,''
arXiv preprint arXiv:2211.08824 2022.
M. Sakurada. et al. ``Anomaly detection using autoencoders with nonlinear dimensionality
reduction,'' In Proceedings of the Proceedings of the MLSDA 2014 2nd workshop on machine
learning for sensory data analysis, pp. 4-11, 2014.
V. Mahadevan. et al. Anomaly detection in crowded scenes. In Proceedings of the IEEE
computer society conference on computer vision and pattern recognition. IEEE, pp.
1975-1981, 2010.
A.B. Chan. et al. ``Modeling, clustering, and segmenting video with mixtures of dynamic
textures,'' IEEE transactions on pattern analysis and machine intelligence, 30, pp.
909-926, 2008.
J. Sivic. et al. ``A. Video Google: Efficient visual search of videos,'' Toward category-level
object recognition, pp. 127-144, 2006.
I. Laptev. et al. ``On space-time interest points,'' International journal of computer
vision, 64, pp. 107-123, 2005.
J. Carreira. et al. ``action recognition? a new model and the kinetics dataset,''
In Proceedings of the proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 6299-6308, 2017.
P. Jiang. et al. ``A Review of Yolo algorithm developments,'' Procedia Computer Science,
199, pp. 1066-1073, 2022.
J. Redmon. et al. ``A. Yolov3: An incremental improvement,'' arXiv preprint arXiv:1804.02767
2018.
G. Jocher. et al. ``Ultralytics/yolov5: v3. 0,'' 2020.
C.Y. Wang. et al. ``YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for
real-time object detectors,'' arXiv preprint arXiv:2207.02696 2022.
Possegger, H.; Mauthner, T.; Bischof, H. In defense of color-based model-free tracking.
In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 2113-2120.
J. Weijer. et al. ``Learning color names for real-world applications,'' IEEE Transactions
on Image Processing, 18, pp. 1512-1523, 2009.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556 2014.
O. Russakovsky. et al. ``Imagenet large scale visual recognition challenge,'' International
journal of computer vision, 115, pp. 211-252, 2015.
D. Tran. et al. ``Learning spatiotemporal features with 3d convolutional networks,''
In Proceedings of the Proceedings of the IEEE international conference on computer
vision, pp. 4489-4497, 2015.
A. Karpathy. et al. ``Large-scale video classification with convolutional neural networks,''
In Proceedings of the Proceedings of the IEEE conference on Computer Vision and Pattern
Recognition, pp. 1725-1732, 2014.
Z. Cheng. et al. ``Pedestrian color naming via convolutional neural network,'' In
Proceedings of the Computer Vision-ACCV 2016: 13th Asian Conference on Computer Vision,
Taipei, Taiwan, November 20-24, Revised Selected Papers, Part II 13. Springer, pp.
35-51, 2017.
Hyungtae Kim received his B.S degree in the department of Electrical Engineering from
the University of Suwon, Korea in 2012. He then pursued his M.S. degree in image processing
at Chung-Ang University, Korea, in 2015. He is currently pursuing a Ph.D. degree in
image processing at Chung-Ang University, Korea. He was involved in various projects
including the development of high performance visual big data discovery platform for
large-scale real-time data analysis (DeepView) funded by MS&ICT and research of a
video recording system for vehicles funded by Hyundai NGV. His research interests
include applied optics, computer vision and functional safety, SOTIF in SDV.
Joongchol Shin received the B.S. degree in information and communication engineering
from Gyeong-Sang National University, Jinju, South Korea, in 2017 and the M.S. degree
in digital imaging engineering from Chung-Ang University, Seoul, South Korea, where
he is currently working toward the Ph.D. degree in digital imaging engineering. His
research interests include digital image restoration, object recognition, and video
analysis.
Seokmok Park was born in Pusan, Korea, in 1991. He received a BSc in Mechatronics
Engineering from Kyungsung University, Korea, in 2016. He received an MSc in Image
Engineering at Chung-Ang University, Korea, in 2018. Currently, he is pursuing a PhD
in Image Engineering at Chung-Ang University. His research interests include intelligent
vehicle systems, including camera calibration, object detection and recognition, and
optical character recognition.
Joonki Paik was born in Seoul, South Korea, in 1960. Dr. Paik obtained his B.Sc. degree
in control and instrumentation engineering from Seoul National University, Seoul,
South Korea, in 1984. He then pursued his M.Sc. and Ph.D. degrees in electrical engineering
and computer science at Northwestern University, Evanston, IL, USA, graduating in
1987 and 1990, respectively. From 1990 to 1993, Dr. Paik worked at Samsung Electronics,
where he played a vital role in designing image stabilization chip sets for consumer
camcorders. Since 1993, he has held a professorship at Chung-Ang University. Additionally,
he served as a Visiting Professor at the Department of Electrical and Computer Engineering,
University of Tennessee, Knoxville, TN, USA, from 1999 to 2002. Dr. Paik has held
various leadership positions, including Full-time Technical Consultant for the System
LSI Division of Samsung Electronics in 2008, Vice President of the IEEE Consumer Electronics
Society from 2016 to 2018, President of the Institute of Electronics and Information
Engineers (IEIE) in 2018, Executive Vice President of Research at Chung-Ang University
in 2020, and Provost of Chung-Ang University from 2020 until the present. Furthermore,
he has been the Dean of the Artificial Intelligence Graduate School at Chung-Ang University
since 2021 and the Director of the Military AI Education Program supported by Ministry
of Defense of Korea, at Chung-Ang University since 2022.