ShinDong-yeon1
LeeSeong-won1,*
-
(Department of Computer Engineering, Kwangwoon University / Seoul 01897, Korea )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Multi-object tracking, Swimming dataset, Scene detection network, cIoU
1. Introduction
Recent advances in artificial intelligence have significantly impacted various fields,
with multi-object tracking (MOT) being one of the most widely used applications in
image processing. MOT involves monitoring multiple objects simultaneously within continuous
frames. It is utilized for tracking objects in embedded systems from diverse fields
to detect abnormal behavior [1], and plays a crucial role in autonomous driving [2].
An essential objective is to identify and differentiate multiple objects. Achieving
this requires not only effective detection of specific objects but continuous and
prolonged tracking of their presence. It is crucial that a detected object remains
distinguishable from other objects throughout the tracking process. Temporary non-detection
during tracking, or the inability to confirm its identity as the same object from
the previous image, results in recognizing it as a different object, significantly
diminishing accuracy from the tracking process.
Leveraging diverse design strategies, deep learning networks have demonstrated outstanding
performance in object detection. Their superior performance ensures robust capabilities
in simultaneously and persistently tracking multiple objects. This tracking-by-detection
method involves receiving the coordinate values of the object from its detection in
every frame. State-of-the-art (SOTA) tracking networks employ one-stage detectors
[3,4] to enhance accuracy, capitalizing on their speed and high precision [5-7].
While existing deep learning tracking networks primarily focus on videos featuring
static surroundings and well-defined object boundaries, such as pedestrians on roads
[8,9], this paper introduces a unique dataset centered on swimming [10,11]. In this dataset, the surrounding environment is highly dynamic, and occlusion is
particularly severe. Several distinctions set the swimming dataset apart from conventional
MOT datasets. First, frequent occlusion is not only caused by other objects but also
arises from dynamic surroundings such as water and intense spray. Secondly, even mild
spray can introduce noise, posing challenges for object detection. Third, swimmers
exhibit substantial posture deformations owing to rapid movements, especially during
departure and turns, and even while swimming in a straight line. These distinctive
characteristics are responsible for a more significant degradation in tracking accuracy
compared to other datasets.
In this paper, the FairMOT [12] tracking-by-detection method is employed as a baseline architecture. In addition
to FairMOT’s object detection and Re-ID responsible for adjusting object IDs, a scene
detection network accounting for environmental changes to object detection and tracking
is integrated into CenterNet [4] (the feature extraction network of FairMOT). This additional network adjusts the
value of the IoU metric at the object detection stage and the predicted object location
of the Kalman filter.
In object detection, a modified IoU metric (cIoU) transforms the existing IoU metric
to allow consideration of not only the overlap between the detected bounding box and
the predicted bounding box but also additional factors such as center point distance
and aspect ratio. In cIoU, a scene-specific metric is achieved by assigning weights
to each input parameter based on detected scene information from each frame. Performance
is further enhanced by adaptively adjusting the ratio reflecting the predicted object
location of the Kalman filter to the detected object bounding box from the network.
2. Related Work
2.1 CenterNet
CenterNet serves as the detection branch of the FairMOT network. It employs an anchor-free
method for object detection, where the key point representing the center of the object
is identified as the peak point in a heatmap generated using a Gaussian kernel. The
height and width of the object are then determined based on these center points. The
object's bounding box is predicted using the calculated height and width, and the
offset of the center point is predicted in order to mitigate errors introduced by
the stride applied during key point generation.
CenterNet's architecture utilizes backbone networks such as ResNet [14], Hourglass [15], and DLA [16] to predict key points on a heatmap. Features from these backbone networks are transmitted
to each head, which predicts heatmap, box size, and offset. Each head trains values
through respective losses, contributing to the final prediction of the object's bounding
box.
2.2 Deep SORT
In Deep SORT [17] (a tracking algorithm based on the tracking-by-detection method), improvements were
made to address the ID switching problem in SORT [18]. The network output Re-ID feature is applied to the matching algorithm, enhancing
the handling of bounding boxes from the detection network.
Deep SORT calculates the distance between the predicted value from the Kalman filter
and the detection value from the previous frame to track the object's location. Cascade
matching and IoU matching are performed in Deep SORT for this purpose. Cascade matching
assesses similarity through cosine distance, allocating and matching using the Hungarian
algorithm. IoU matching calculates distance through 1-IoU for unmatched detection
and prediction values, with matching also done through the Hungarian algorithm based
on these distance values.
2.3 Swimmer Tracking
In one study, three modules were developed for swimmer detection and tracking, comprising
background modeling, swimmer detection, and swimmer tracking [19]. Swimmers and swimming pool backgrounds are separated using background modeling,
and pixels related to the swimmer are grouped through a mean-shift clustering algorithm.
Following this image pre-processing, swimmers are detected using a cascaded boosting
learning algorithm (a type of machine learning). However, several issues arose in
the detection results. The accuracy varied based on the pixel size of the detector.
In [19], using a detector size of 10 x 10 pixels, the lowest hit rate observed was 56%. Detection
accuracy was affected by the dynamic background changes. Moreover, incorrect detections
were attributed to spray, a characteristic present in the swimming dataset. To address
false detections caused by spray, an appropriate threshold had to be set, and pre-processing
was implemented. In [19], a significant difference was observed between the pre-processed and non–pre-processed
images, with bounding box overlap even in the pre-processed image.
The study in [11] proposed a network system designed to automatically determine the number of strokes
by a swimmer from overhead race videos (ORVs). ORVs are captured for viewing or analytical
purposes, and are taken with broadcast or professional camera equipment, encompassing
scenarios with and without camera movement. In [11], a network based on the VGG16 architecture [20] was trained to predict swimming strokes. The stroke cycle was defined as a sine curve
in the range [0,1], with 1 corresponding to the stroke going past the ear and 0 when going past the
body. Additionally, YOLOv3~[21] was employed to detect the swimmers. For a comparative analysis based on network
size, both YOLOv3-416-tiny using Darknet15 as the backbone and YOLOv3-416 using Darknet53
as the backbone were utilized and compared. The model pretrained on the COCO dataset
was employed for training, and the SORT algorithm was implemented for tracking. The
tracking results revealed a high multi-object tracking accuracy (MOTA) score of 89.34%
for the training set, but the MOTA score for the test set was significantly lower
at 11.21%. While the system performed well for the swimming class, which had the most
data and was easier to track, other classes exhibited low MOTA values due to data
scarcity, changes in camera viewpoint, water refraction, and occlusion by individuals
near the swimming pool.
3. The Proposed Method
3.1 Scene Detection Network
We incorporated a scene detection head into the existing FairMOT network to classify
the corresponding scene in every frame. Following the backbone network, the FairMOT
network was segmented into heatmap, box size, offset, and Re-ID heads, respectively,
to generate output values. Additionally, a head for classifying swimming event scenes
was introduced, and class labeling for each frame was applied to the swimming dataset
to train the network to classify current input images. The approximate network architecture
is shown in Fig. 1. Classes were categorized into Swimming, Turning, Diving, Finish, and On-block. The
Swimming class (Fig. 2), encompasses scenes after swimmers dive into the water (i.e., bodies submerged);
they swim until just before the return point, swim again after the turn, and finish
swimming at the endpoint. However, since swimmers race at varying speeds, the moment
they initiate the turn differs. Consequently, in that scene, the class switches from
Swimming to Turning.
Fig. 1. The Architecture of the Entire Network.
Fig. 2. An Example of the Swimming Class.
In the freestyle competition, when swimmers reach the return point (Fig. 3), the Turning class is from the moment swimmers begin to turn their bodies while
putting their heads into the water until last swimmer reaches that point. For the
breaststroke competition, the Turning class starts at the point where swimmers reach
out and touch the wall until they push off the return point with their feet and resume
the breaststroke.
Fig. 3. An Example of the Turning Class.
As depicted in Fig. 4, Diving is classified as such from the moment the players' hands fall off the blocks
until their whole bodies enter the water.
Fig. 4. An Example of the Diving Class.
As shown in Fig. 5, the Finish class is from the moment swimmers stop at the end of the race through
to the movement they stand up while reaching out to the wall.
Fig. 5. An Example of the Finish Class.
The On-block class is from the moment swimmers wait on the starting blocks before
the start of the race until their hands fall off the blocks, as shown in Fig. 6. In the backstroke, the preparation posture differs in the On-block class because
swimmers hold the rods under the blocks and wait, as shown in Fig. 7, up until the moment their hands stop holding the rods.
Fig. 6. An Example of the On-block Class.
Fig. 7. An Example of the Backstroke On-block Class.
3.2 The cIoU Metric
When tracking within the FairMOT network, subsequent movement of the object is predicted
using cascade matching and IoU matching in Deep SORT. We propose a method of weighting
the cIoU formula that additionally considers the distance between the center points
of the two bounding boxes and the aspect ratio from the formula used in IoU matching
by Deep SORT. We multiply the formula obtaining the distance between the center points
by $\left(1-IoU\right)^{2}$, as shown in (1), to reduce the penalty for the distance
between the center points of the bounding box when the IoU is high.
Conversely, if the IoU is a low value, indicating significant inconsistency, it can
be excluded from matching by imposing a penalty through the distance between the center
points of the bounding boxes. Moreover, when the aspect ratio undergoes significant
changes by introducing weight $\omega $, a penalty is applied during matching to achieve
more accuracy.
We propose a method for adjusting the weights of these formulas according to the class
of the detected scene. If the class is for the image input from the previously proposed
scene detection network, the optimal $\omega $ value for each class, determined through
the grid search method, is applied in the cIoU matching step.
3.3 Adjusting the Kalman Filter Hyper-parameter
We propose a method for adjusting position and velocity (hyper-parameters of the Kalman
filter) based on the scene class. Initially, we determine the optimal hyper-parameter
value for each class by varying the parameter values to 100, 10, 1, 1/10, and 1/100
by using a grid search method. Table 1 displays the optimal Kalman filter hyper-parameter value for each class identified
through the grid search. Afterward, when the input image is classified in the scene
detection network, the optimal hyper-parameter value for each class of the Kalman
filter is set based on the class.
Table 1. Optimized Hyper-parameters for each Class.
Parameter
Class
|
Position
|
Velocity
|
Swimming
|
100
|
10
|
Turning
|
100
|
100
|
Diving
|
1 / 100
|
1 / 100
|
Finish
|
100
|
10
|
On-block
|
1 / 100
|
1 / 100
|
4. Experiments
As shown in Table 2, we initially conducted experiments with sequence videos classified by using the
existing FairMOT and measured with tracking performance through the metrics in multi-object
tracking.
Table 2. Metrics from FairMOT Tracking.
Metrics
Videos
|
IDF1
|
IDP
|
IDR
|
MT
|
PT
|
ML
|
MOTA
|
MOTP
|
Swimming class
|
SWIM-02
|
79.7%
|
81.6%
|
78%
|
7
|
1
|
0
|
85.4%
|
23.8%
|
SWIM-03
|
86.5%
|
88.5%
|
84.6%
|
7
|
1
|
0
|
83.4%
|
23.5%
|
Turning, Swimming classes
|
SWIM-07
|
49.4%
|
59.6%
|
42.2%
|
3
|
4
|
1
|
47.5%
|
28.3%
|
SWIM-08
|
49.3%
|
54.7%
|
45%
|
3
|
5
|
0
|
53.5%
|
30.6%
|
On-block, Diving, Swimming classes
|
SWIM-27
|
35.3%
|
41.8%
|
30.5%
|
1
|
7
|
0
|
34.9%
|
27.4%
|
SWIM-31
|
42.2%
|
48.2%
|
37.5%
|
3
|
5
|
0
|
51.3%
|
22.3%
|
Swimming, Finish classes
|
SWIM-06
|
52.3%
|
57.5%
|
48%
|
2
|
6
|
0
|
52.2%
|
26.8%
|
SWIM-14
|
66.4%
|
72.6%
|
61.1%
|
3
|
5
|
0
|
47.4%
|
28.9%
|
In SWIM-02 and -03, which consist solely of the Swimming class, the MOTA score was
high at 85.4% and 83.4%, respectively. In SWIM-02 and -03, swimmers move consistently
without significant changes in movements, allowing for stable tracking.
In SWIM-07 and -08 (sequences where swimmers reach the return point and turn), MOTA
scores were 47.5% and 53.5%, respectively. The lower MOTA scores compared to the Swimming
class can be attributed to the camera angle being from the side, as shown in Fig. 2. Players farther from the camera are less visible than those who are closer, which
makes tracking them challenging. Additionally, when swimmers turn, they are covered
by spray from movements in the water, leading to ID switching due to the significant
changes in motion.
SWIM-27 and -31, with MOTA scores of 34.9% and 51.3%, respectively, are sequences
that include On-block, Diving, and Swimming classes. In these sequences, the most
ID switching occurred when transitioning the class from On-block to Diving. When swimmers
extend their bodies from a crouching position into the dive, significant changes in
features and IoU take place, leading to failed matches in both cascade matching and
IoU matching in Deep SORT. Due to significant changes in motion, the IDs of some swimmers
switched not only when transitioning from On-block to Diving but also during the diving
motion.
SWIM-31 exhibited a higher MOTA score compared to SWIM-27. The reason is that when
transitioning from On-block to Diving, the IDs of swimmers changed, and these IDs
were subsequently well-maintained. The ID changed at the moment the class switched
due to the change in IoU, but afterward, tracking was maintained through Re-ID features.
The difference in SWIM-27 was that it was easy to find Re-ID features from having
more data on topless male swimmers during training. On the other hand, SWIM-27 had
relatively more difficulty finding Re-ID features because female swimmers wore full-body
swimsuits.
SWIM-06 and -14 are sequences that contain scenes at the end of the swimming competition,
and the sequences were completed with the Finish class. The two sequences showed MOTA
scores of 52.2% and 47.2%, respectively. When shooting the Finish class, the composition
changed to filming next to the swimming pool, resulting in one swimmer being covered
by the referee, leading to ID switching. There was also a case where ID switching
occurred due to spray in the swimming scene, and further ID switching occurred due
to changes in motion and features as the Swimming class transitioned to the Finish
class.
We experimented by changing the IoU formula to the cIoU formula in IoU matching by
Deep SORT, and by weighting the aspect ratio. We first used a grid search to find
the weight of the optimal aspect ratio for each class. As a result, Diving and On-block
classes showed optimal MOTA values when the weight was 41, and the rest of the classes
showed the highest values when it was 1. Based on these values, we set the optimal
weight value according to the scene through the scene detection network. We show the
results in Table 3. SWIM-02, -03, -06, and -14 do not appear to have affected IoU matching, and the
MOTA score slightly increased in SWIM-07, -27, and -31. SWIM-02, -03, -06, and -14
are sequences in which significant IoU changes do not occur and were not affected
by cIoU formula changes. SWIM-07, -27, and -31 have classes with large IoU changes,
such as On-block, Diving, and Turning, so the cIoU formula affected performance.
Table 3. Metrics from cIoU’s Weight Adjustment Tracking.
Metrics
Videos
|
IDF1
|
IDP
|
IDR
|
MT
|
PT
|
ML
|
MOTA
|
MOTP
|
Swimming class
|
SWIM-02
|
79.7%
|
81.6%
|
78%
|
7
|
1
|
0
|
85.4%
|
23.8%
|
SWIM-03
|
86.5%
|
88.5%
|
84.6%
|
7
|
1
|
0
|
83.4%
|
23.5%
|
Turning, Swimming classes
|
SWIM-07
|
49%
|
59%
|
41.9%
|
3
|
4
|
1
|
47.7%
|
28.4%
|
SWIM-08
|
49.3%
|
54.7%
|
45%
|
3
|
5
|
0
|
53.5%
|
30.6%
|
On-block, Diving, Swimming classes
|
SWIM-27
|
37.8%
|
45%
|
32.6%
|
1
|
6
|
1
|
35.5%
|
26.8%
|
SWIM-31
|
41%
|
46.9%
|
36.5%
|
3
|
5
|
0
|
51.8%
|
21.9%
|
Swimming, Finish classes
|
SWIM-06
|
52.3%
|
57.5%
|
48%
|
2
|
6
|
0
|
52.2%
|
26.8%
|
SWIM-14
|
66.2%
|
72.5%
|
60.9%
|
3
|
5
|
0
|
47.2%
|
28.9%
|
In addition, through the scene detection network, hyper-parameters of the Kalman filter,
the position, and the velocity can be found and applied to each scene, as shown in
Table 1. We present the results in Table 4. SWIM-02 and -03, which only have the Swimming class, showed improved performance
in IDF1 and MOTA scores by applying hyper-parameters optimal for the Swimming class.
In SWIM-07 and -08, where Turning and Swimming appear, IDF1 showed a significant improvement
in performance, and in SWIM-07, the MOTA score improved by 5.2 compared to the original
FairMOT. In SWIM-07 and -27, adjustment of the Kalman filter's hyper-parameters was
not significantly effective. As seen in Table 1, the optimal value for Diving and On-block classes was not significantly affected
by the hyper-parameter weight of the Kalman filter, and the fact that the hyper-parameter
value was set to an inappropriate value in switching from Diving to Swimming caused
performance degradation. In SWIM-06 and -14, performance improvement was achieved
in IDF1 through hyper-parameter adjustment, and ID was maintained even when the swimmer
was occluded by the referee. Additionally, ID switching also showed improvement at
the moment Swimming changed to Finish.
Table 4. Metrics from cIoU’s Weight and the Kalman Filter Hyper-parameter Adjustment.
Metrics
Videos
|
IDF1
|
IDP
|
IDR
|
MT
|
PT
|
ML
|
MOTA
|
MOTP
|
Swimming class
|
SWIM-02
|
88.3%
|
90.2%
|
86.6%
|
8
|
0
|
0
|
87.7%
|
24.2%
|
SWIM-03
|
91.9%
|
93.8%
|
90.1%
|
7
|
1
|
0
|
84.1%
|
23.4%
|
Turning, Swimming classes
|
SWIM-07
|
71.2%
|
84.4%
|
61.6%
|
4
|
3
|
1
|
52.7%
|
28.2%
|
SWIM-08
|
73.9%
|
80.3%
|
68.5%
|
2
|
6
|
0
|
53.4%
|
32.4%
|
On-block, Diving, Swimming classes
|
SWIM-27
|
37.5%
|
45.5%
|
32%
|
1
|
6
|
1
|
35.5%
|
26.7%
|
SWIM-31
|
40.2%
|
46.9%
|
35.2%
|
3
|
5
|
0
|
46.9%
|
21.9%
|
Swimming, Finish classes
|
SWIM-06
|
62.5%
|
68%
|
57.8%
|
2
|
6
|
0
|
47%
|
28%
|
SWIM-14
|
70.4%
|
75.8%
|
65.8%
|
2
|
6
|
0
|
48.1%
|
28.8%
|
5. Conclusion
In this paper, we proposed a method to enhance performance by incorporating additional
factors. We achieved this by modifying the IoU formula in the IoU matching of Deep
SORT into the cIoU formula, thereby improving tracking performance on swimming datasets
using the FairMOT tracking network. MOTA scores were enhanced by up to 0.6% in specific
sequences by using this method. Additionally, the MOTA score saw an improvement of
up to 5.2% through the adjustment of hyper-parameters in the Kalman filter based on
class.
In future studies, we intend to enhance multi-object tracking performance by individually
analyzing and considering the scene characteristics of each object, especially when
dealing with swimmers detected in various states of motion, such as Turning.
ACKNOWLEDGMENTS
This work was supported by a National Research Foundation of Korea (NRF) grant funded
by the Korea government (MSIT) (NRF-2021R1F1A1060183), by a Korea Institute for Advancement
of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0017124, HRD Program
for Industrial Innovation), and by a Research Grant from Kwangwoon University in 2021.
REFERENCES
Shehzed, Ahsan, Ahmad Jalal, and Kibum Kim. "Multi-person tracking in smart surveillance
system for crowd counting and normal/abnormal events detection." 2019 international
conference on applied and engineering mathematics (ICAEM). IEEE, 2019.
Guo, Lie, et al. "Pedestrian tracking based on camshift with kalman prediction for
autonomous vehicles." International Journal of Advanced Robotic Systems 13.3 (2016):
120.
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection."
Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint
arXiv:1904.07850(2019).
Aharon, Nir, Roy Orfaig, and Ben-Zion Bobrovsky. "BoT-SORT: Robust associations multi-pedestrian
tracking." arXiv preprint arXiv:2206.14651 (2022).
Wang, Yu-Hsiang. "SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking." arXiv
preprint arXiv:2211.08824 (2022).
Maggiolino, Gerard, et al. "Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification."
arXiv preprint arXiv:2302.11813 (2023).
Milan, Anton, et al. "MOT16: A benchmark for multi-object tracking." arXiv preprint
arXiv: 1603.00831 (2016).
Dendorfer, Patrick, et al. "Mot20: A benchmark for multi object tracking in crowded
scenes." arXiv preprint arXiv:2003.09003 (2020).
Woinoski, Timothy, Alon Harell, and Ivan V. Bajic. "Towards automated swimming analytics
using deep neural networks." arXiv preprint arXiv:2001.04433 (2020).
Woinoski, Timothy, and Ivan V. Bajić. "Swimmer stroke rate estimation from overhead
race video." 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).
IEEE, 2021.
Zhang, Yifu, et al. "Fairmot: On the fairness of detection and re-identification in
multiple object tracking." International Journal of Computer Vision 129 (2021): 3069-3087.
Zheng, Zhaohui, et al. "Distance-IoU loss: Faster and better learning for bounding
box regression." Proceedings of the AAAI conference on artificial intelligence. Vol.
34. No. 07. 2020.
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of
the IEEE conference on computer vision and pattern recognition. 2016.
Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human
pose estimation." Computer Vision-ECCV 2016: 14th European Conference, Amsterdam,
The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International
Publishing, 2016.
Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018.
Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking
with a deep association metric." 2017 IEEE international conference on image processing
(ICIP). IEEE, 2017.
Bewley, Alex, et al. "Simple online and realtime tracking." 2016 IEEE international
conference on image processing (ICIP). IEEE, 2016.
Sha, Long, et al. "Understanding and analyzing a large collection of archived swimming
videos." IEEE Winter Conference on Applications of Computer Vision. IEEE, 2014.
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint
arXiv: 1804.02767 (2018).
Author
Dong-yeon Shin received a B.S. in Electronic Engineering from Korea National University
of Transportation in 2022. Currently, he is pursuing an M.S. in Computer Engineering
from Kwangwoon University, South Korea. His research interests include multi-object
tracking and deep learning.
Seong-won Lee (Member, IEEE) received a B.Sc. and an M.Sc. in control and instrumentation
engi-neering from Seoul National University, South Korea, in 1988 and 1990, respectively,
and a Ph.D. in electrical engineering from the University of Southern California in
2003. From 1990 to 2004, he worked on VLSI/SoC design at Samsung Electronics Company
Ltd., South Korea. Since March 2005, he has been a Professor with the Department of
Computer Engineering, Kwangwoon University, Seoul, South Korea. His research interests
include image signal processing, signal processing SoC, edge AI systems, and computer
architectures.