1. Introduction
Visual object tracking is an essential task in computer vision, which aims to localize
the target object through consecutive frames of a given video. It has diverse applications,
including intelligent surveillance, autonomous driving, human-machine interaction,
and AR/VR. Considerable efforts have been made to design reliable trackers owing to
such a wide range of applications. On the other hand, complicated real-world environments,
e.g., deformations, occlusions, and fast motions, still impose great difficulties
in resolving the problem of visual object tracking.
Recently, many studies have devised deep neural network-based trackers owing to the
success of deep learning. In particular, Siamese network architecture and its variants
have been popularly adopted for visual object tracking because they efficiently compute
the correlation between the target object and candidate frames in the latent space
while successfully considering the shape variations and background clutters. Specifically,
SiamRPN [1] applied the region proposal network to the Siamese architecture for adopting local
constraints, which plays an important role in boosting the tracking accuracy. SiamRPN++
[2] further enabled the use of rich information encoded by the deep backbone network,
e.g., ResNet-50 [28]. Most recently, SiamAttn [3] developed self-attention and cross-attention modules to highlight target-relevant
features in the Siamese architecture efficiently. This model also attempted to refine
such re-calibrated features via bounding box and mask heads and showed significant
performance improvement. Although Siamese-based trackers have brought notable progress,
more robust features are still needed to allow ambiguities from diverse real-world
scenarios.
The challenge of visual object tracking comes from a lack of semantic understanding
of the target object and its surroundings. The authors previously studied how to embed
such semantics in the latent space more accurately. As a result, it was figured out
that the feedback operations are quite helpful in overcoming the challenges by drifting,
missing, and confusion with target-like background clutter. This paper proposes a
simple yet powerful method for visual object tracking, called the Siamese feedback
network (SiamFB). The key idea of the proposed method is to exploit the feedback block
for progressively learning the semantic features, which have good ability to guide
the model to be robust to a variety of variations, including background clutters in
an iterative manner. The proposed feedback block is composed of attention and refinement
modules. Specifically, features encoded from the deep backbone network are re-calibrated
in a manner similar to that reported elsewhere [3], and those are directly refined both channel-wisely and spatially in the feedback
block. By feeding this high-level information back to the input of each feedback block,
the semantic relationship between the target object and the search region can be learned,
which leads to reliable tracking performance even in complex scenarios. Unlike SiamAttn
[3], the proposed method does not require an additional branch, e.g., branches for mask
learning. The main contributions of this paper can be summarized as follows:
· The authors propose to apply feedback operations to the Siamese architecture for
successfully learning the semantic relationship between the features of the target
object and the search region even under complicated situations.
· Instead of adopting additional branches for refining extracted features, which has
been used widely in previous approaches, this paper proposes to conduct this process
implicitly within the feedback block by emphasizing the attentive features again both
spatially and channel-wisely.
· This paper highlights the effectiveness of the proposed method for visual object
tracking on various benchmark datasets. In addition, various ablation studies, as
well as the performance comparison with previous approaches, are also provided in
detail.
The remainder of this paper is organized as follows. The next section presents a brief
review of visual object tracking. The proposed method, i.e., the Siamese feedback
network, is explained in detail in Section 3. The experimental results and ablation
studies on the benchmarks dataset are reported in Section 4. Section 5 reports the
conclusions.
2. Related Work
Since the problem of visual object tracking can be formulated as a matching problem
between the target template and the search region, computing correlation (i.e., similarity)
in an efficient way is considered an essential task. In this point of view, previous
methods for visual object tracking can be divided into two main groups: correlation
filter-based and neural network-based methods.
$Correlation filter-based methods.$ As a pioneer, Bolme et al. [4] utilized multiple samples to estimate the correlation filter with the minimum output
sum of squared error. Their adaptive scheme worked well under the situation guaranteeing
linearity, e.g., without sudden changes in motion. On the other hand, such an assumption
is barely maintained under real-world environments. To cope with this limitation,
Henrique et al. [5] proposed to adopt the kernel technique to consider the nonlinearity, and their work
operated very fast via the Fourier-based implementation. Owing to its high performance
and simplicity, many variants have been introduced in the vision community. For example,
Danelljan et al. [6] attempted to devise multi-resolution features with continuously learned filters to
improve the performance of the convolution operations. They further introduced a lightweight
scheme based on a factorized convolution operator, a compact generative model, and
a conservative update scheme [7]. Although correlation filter-based approaches improve visual object tracking significantly,
they often cannot continuously grasp the deformable nature of the target object satisfactory
because of the lack of representation ability.
$Neural network-based methods.$ In this category, many studies have formulated the
problem of visual object tracking as a problem of discriminative detection. To resolve
this problem efficiently, Siamese-based trackers have attracted considerable attention
because of their high performance and speed. Specifically, Bertinetto et al. [8] adopted the Siamese network to calculate the score map indicating the position of
the target object in a deep architecture. Li et al. [1] proposed to apply the region proposal network [9], which has been widely employed for object detection and segmentation, to the Siamese-based
tracker to improve localization accuracy. Furthermore, they extended their scheme
to take advantage of the rich contextual information through the deep backbone network
[2]. On the other hand, Zhu et al. [10] attempted to alleviate the imbalanced distribution of training samples based on the
learning distractor-aware Siamese networks. Yu et al. [3] proposed a deformable Siamese attention network to adaptively allow high-level features
in a class-agnostic manner while overcoming the limited receptive fields. For more
robust tracking against background clutter, Tan et al. [23] applied the non-local blocks [24], which consider the relationship between entire positions of the features, to the
Siamese network. In addition to the classification and regression branches of the
typical Siamese-based tracker, Zhou et al. [25] introduced an anchor-free localization branch, which accurately estimates the target
center. Most recently, Jiang et al. [26] combined two networks trained separately, i.e., one for a mutual interaction between
the target and search branches of the Siamese network and the other for utilizing
various levels of information by the feature fusion technique. Although such Siamese-based
trackers have shown many possibilities, they still suffer from ambiguities from various
real-world environments.
3. Proposed Method
Inspired by the power of the feedback operation shown in resolving the problem of
generative tasks [11, 12, 27], this paper proposes exploiting the feedback block to progressively refine the features
of the target template and the search region. By feeding the high-level information
back to the input of the feedback block, the proposed method can encode more meaningful
features for the tracking task. This iterative process guides the network model to
learn underlying properties of the corresponding input even under ambiguous cases,
which makes the convolution result more accurate in the Siamese-based network. Fig. 1 shows the overall architecture of the proposed method.
Fig. 1. Overall architecture of the proposed method (Siamese feedback network). Note that it consists of the proposed feedback blocks and Siamese region proposal network (SiamRPN).
3.1 Architecture Details
The proposed method, i.e., Siamese feedback network (SiamFB), utilizes a five-stage
ResNet-50 [28] as the backbone network. In particular, the features encoded by the last three stages
of ResNet-50 are fed into the proposed feedback block, composed of the attention and
refinement modules. The corresponding results are then input into three SiamRPN blocks
[2] to generate the response map via convolution operations. This map is finally used
for classification and bounding box heads to predict the candidate region containing
the target object, as shown in Fig. 1.
Fig. 2 presents the detailed architecture of the proposed feedback block. For the attention
module, principal components are arranged in a manner similar to that reported by
Danelljan et al. [3]. Spatial self-attention and channel self-attention are computed at each branch, while
cross-attention is calculated by utilizing the features from both the target and search
branches together. For further refinement, attentive features are adaptively re-calibrated
as follows: first of all, the sigmoid function is applied to attentive features to
extract the balancing factor based on its nonlinear response. Subsequently, the attentive
features are re-calibrated both spatially and channel-wisely via the modules introduced
previously [13] and [14], respectively, and corresponding results are combined using the balancing factor
as follows:
where $T$ is the index of feedback operations. $k^{T}$ denotes the balancing factor,
i.e., the output of the sigmoid function in Fig. 2. $F_{ch}^{T}$ and $F_{sp}^{T}$ denote the output features of the channel and spatial
calibration blocks with input $F_{att}^{T}$ (i.e., attentive features in Fig. 2), respectively. Such re-calibrated features were combined adaptively because the
balancing factor is determined differently according to the input of the refinement
module (i.e., attentive features). The output of the refinement module returns to
the input of the feedback block, as shown in Fig. 2. Note that this feedback loop is concatenated to the input feature from the corresponding
branch as follows:
where the function $C\left(x_{1},x_{2}\right)$ denotes the concatenation of $x_{1}$
and $x_{2}$ along the channel direction. These final results are obtained separately
from branches for target and search region and then fed into the SiamRPN block [2] for the tracking task, as shown in Fig. 1.
3.2 Loss Function
The proposed network is trained based on the weighted sum of two loss functions, i.e.,
one for the classification and the other for the bounding box regression, as introduced
by Li et al. [2]. A negative log-likelihood loss [22] was adopted for the classification loss as follows:
where $y_{i}$ and $\hat{y}_{i}$ denote the label of the ground truth and the predicted
score estimated from the softmax function, respectively. Note that this classification
loss is computed for both positive and negative pairs. On the other hand, the bounding
box regression loss was calculated using weighted $l_{1}$ loss as follows:
where $g_{i}$ is the ground truth for the parameters of the bounding box, i.e., $\left(x,y\right)$
positions of the center point $\left(i=0,1\right)$ and scales of the bounding box
$\left(i=2,3\right)$, and $\hat{g}_{i}$ is the estimated result of the corresponding
ones. $w$ denotes the weighting factor for the positive sample whose IoU value between
the anchor box and the corresponding ground truth is greater than 0.6. The anchor
box is treated as a negative sample if the IoU value is smaller than 0.3, and other
cases are discarded, thus $w$ is set to zero. $n$ is the number of positive anchor
boxes, and $\varepsilon $ is a very small value ($1\times 10^{-6}$ in this implementation)
to avoid zero division. By using the sum of these loss functions, the proposed network
effectively learns to track the target objects as follows:
where the balancing factor $\lambda =1.2$ was used in the present work, which was
set through extensive experiments.
Fig. 2. Detailed architecture of the proposed feedback block, which consists of attention module and refinement module. Note that this figure illustrates the process of feedback operation from the viewpoint of the target branch shown in Fig. 1.
4. Experimental Results
This Section reports the experimental results based on benchmark datasets for visual
object tracking. The performance comparison with state-of-the-art methods is also
provided in detail.
4.1 Training
All the parameters for the proposed architecture were tuned using the stochastic gradient
descent (SGD) method for 20 epochs. The momentum and weight decaying factor were set
to $9\times 10^{-1}$ and $1\times 10^{-4}$, respectively. The learning rate was increased
linearly from $4\times 10^{-4}$ to $2\times 10^{-3}$ for the first five epochs to
warm up, and decreased exponentially from $2\times 10^{-3}$ to $2\times 10^{-4}$ for
the last 15 epochs. The backbone network is only trained for the last 10 epochs. The
five representative benchmarks, i.e., MS COCO2017 [15], ImageNet-VID, ImageNet-DET [16], YouTube-VOS [17], and YouTube-BoundingBoxes [18] datasets, were used for training. Specifically, 117,266 images from the training
set of MS COCO2017 [15] were collected for learning the proposed model. This study also used 3,862 and 333,474
images of ImageNet-VID and ImageNet-DET [16], which are datasets for object detection in videos and images for the 2015 ILSVRC
competition, respectively. Moreover, 3,000 and 175,495 images were obtained from YouTube-VOS
[17] and YouTube-BoundingBoxes [18], which were constructed based on the high-resolution YouTube videos. Note that the
images from ImageNet-VID and YouTube-VOS [17] were used in duplicate because of their insufficient numbers. The training batch
is composed of eight samples selected randomly from those five benchmark datasets.
The resolutions of the target template and the search region were $127\times 127$
pixels and $255\times 255$ pixels, respectively. Data augmentation techniques, e.g.,
shift and color transformation, were used to alleviate the overfitting problem. The
proposed method was implemented on the PyTorch framework, with four NVIDIA GTX Titan
Xp GPUs.
4.2 Performance Evaluations
The performance of the trackers was evaluated based on two benchmarks, i.e., VOT2016
[19] and VOT2018 [20] datasets, which have been used most widely for visual object tracking. VOT2016 [19] and VOT2018 [20] consist of 60 video sequences taken under various scenarios, e.g., sports games,
roads, and animals. For quantitative evaluation, the methods were tested in terms
of accuracy (A), robustness (R), and expected average overlap (EAO). The accuracy
(A) measures the average overlap between the estimated bounding box and the corresponding
ground truth during successful tracking, whereas the robustness (R) counts the number
of failures on tracking that require a reset process for a target region. In addition,
the expected average overlap (EAO) is the ratio of the average overlap while the model
chases the target object with no failure, reflecting both accuracy and robustness.
To help understand the accuracy (A) metric, Fig. 3 presents the overlap values, which were in the range between 0 and 1, on several
frames. The average of these overlap values in all video sequences determines the
accuracy (A). As shown in the first column of Fig. 3, the overlap values are close to 1 when the model tracks the target object successfully.
On the other hand, the overlaps have small values when the model fails to estimate
the target region accurately, as shown in the second column of Fig. 3.
Fig. 3. Some examples that show the overlap values of the bounding box estimated by SiamFB and the corresponding ground truth. Note that the overlap values are represented at the right bottom of each image.
The efficiency and robustness of the proposed method were assessed by comparing the
present results with eleven representative methods, i.e., two correlation-based methods
(C-COT [6] and ECO [7]) and nine neural network-based methods (SiamFC [8], SiamRPN [1], DaSiamRPN [10], SiamRPN++ [2], SiamMask [21], SiamAttn [3], Nocal-Siam [23], SiamCAN [25], and M-F-Siam [26]), for visual object tracking. Table 1 shows the corresponding result. The proposed tracker achieved 0.63 (0.63) accuracy,
0.16 (0.23) robustness, and 0.558 (0.482) expected average overlap on the VOT2016
(VOT2018) datasets, respectively. In particular, the SiamFB has a clear performance
gain of 6.3% for the EAO metric compared to SiamAttn [3] on the VOT2016 dataset, demonstrating that the proposed method can provide reliable
performance even without mask-based guidance. Furthermore, the proposed tracker achieved
the top performance in terms of the EAO metric, which outperformed the state-of-the-art
methods, while showing competitive accuracy and robustness with other previous methods
both on the VOT2016 and VOT2018 datasets. These results suggest that it is possible
to perform successful visual object tracking using only the simple feedback operations
and the refinement module in the proposed feedback block, without other complicated
techniques.
The processing speed in terms of frames per second (FPS) was produced on a single
NVIDIA GeForce RTX 2080 Ti GPU to analyze the potential in the real-time running of
the proposed tracker. Table 2 lists the average speeds of the proposed method and other representative methods.
SiamFB achieves 68 FPS, which is sufficient for real-time applications. Moreover,
the proposed method outperformed most Siamese network-based methods. These results
show that the proposed method has a sufficient ability to perform real-time tracking.
Table 1. Performance on VOT2016 and VOT2018
Trackers
|
VOT2016
|
VOT2018
|
A ↑
|
R ↓
|
EAO ↑
|
A ↑
|
R ↓
|
EAO ↑
|
C-COT
[6]
|
0.54
|
0.24
|
0.331
|
0.49
|
0.32
|
0.267
|
ECO
[7]
|
0.55
|
0.20
|
0.375
|
0.48
|
0.28
|
0.276
|
SiamFC
[8]
|
0.53
|
0.46
|
0.235
|
0.50
|
0.59
|
0.188
|
SiamRPN
[1]
|
0.56
|
0.26
|
0.344
|
-
|
-
|
-
|
DaSiamRPN [10]
|
0.61
|
0.22
|
0.411
|
0.59
|
0.28
|
0.383
|
SiamRPN++
[2]
|
0.64
|
0.20
|
0.464
|
0.60
|
0.23
|
0.415
|
SiamMask
[21]
|
0.67
|
0.23
|
0.442
|
0.64
|
0.30
|
0.387
|
SiamAttn
[3]
|
0.68
|
0.15
|
0.525
|
0.63
|
0.16
|
0.470
|
Nocal-Siam
[23]
|
0.62
|
0.09
|
0.554
|
0.59
|
0.16
|
0.474
|
SiamCAN
[25]
|
0.64
|
0.15
|
0.513
|
0.61
|
0.18
|
0.462
|
M-F-Siam
[26]
|
0.58
|
0.27
|
0.335
|
-
|
-
|
-
|
SiamFB
(Ours)
|
0.63
|
0.16
|
0.558
|
0.63
|
0.23
|
0.482
|
Table 2. Comparison of the speed on the VOT dataset
Trackers
|
Processing speed (FPS)
|
C-COT [6]
|
0.3
|
ECO [7]
|
6
|
SiamFC [8]
|
86
|
SiamRPN [1]
|
160
|
DaSiamRPN [10]
|
160
|
SiamRPN++ [2]
|
35
|
SiamMask [21]
|
55
|
SiamAttn [3]
|
33
|
Nocal-Siam [23]
|
38
|
SiamCAN [25]
|
45
|
M-F-Siam [26]
|
43
|
SiamFB (Ours)
|
68
|
* Note that all speeds of the previous methods are reported in the environment of
each previous work.
Fig. 4 presents several results of visual object tracking on the VOT2016 dataset. The top
row in Fig. 4 is the first frame of each sequence, in which the target template is initialized,
and the others show the tracking results by the representative trackers. The proposed
method successfully chased the target object even under various ambiguities, e.g.,
occlusions and scale changes, whereas previous methods often show drifting results,
as shown in the leftmost two columns of Fig. 4. Because the Siamese-based trackers rely on the correlation result, target-like background
clutter might cause confusion, leading to a performance drop. This problem was overcome
by refining target-relevant features more precisely via the feedback block (see the
third and fourth columns of Fig. 4). Furthermore, as shown in the fifth and sixth columns of Fig. 4, the SiamFB quickly grasps the target objects after the occlusion while other models
still struggle to find the target. Fig. 5 shows the tracking results for the video acquired from the outdoor environment. The
proposed method was robust to various obstacles. Despite the occlusions caused by
trees or bollards and background clutter (e.g., bicycles and kickboards), the tracker
successfully chased the target object through consecutive frames. The proposed method
was also robust to a change in scales and appearances. Therefore, the Siamese feedback
network is effective in visual object tracking even under complicated real-world environments.
Fig. 4. Some examples of visual object tracking on the VOT2016 dataset. Note that the results by the proposed method and other three previous methods are represented with different colors.
Fig. 5. Results of the proposed method for our outdoor video. The frames were sampled every three seconds from the video sequence. In particular, the second and the fourth examples of the bottom row effectively show the robustness of the proposed method in the complex scene.
Fig. 6. Some examples of the failure cases by the proposed method on the VOT2016 dataset. Note that these results were obtained under low-light conditions.
4.3 Ablation Studies
Several comparative experiments were conducted to verify the effectiveness of the
proposed method. First, Table 3 shows the effects of using the refinement module and the feedback loop. The baseline
model indicates the architecture without the refinement module and the feedback loop
in Fig. 2. As shown in Table 3, re-calibration of the features in spatial and channel-wise manner can help improve
the tracking performance. In particular, the performance was improved further when
a combination of spatial and channel calibration was used, compared to applying only
one of them to the baseline model. Furthermore, the feedback operations also boosted
the tracking performance. By refining the re-calibrated features via the feedback
loop, the network could learn more meaningful information about the target object
and the search region. Consequently, the proposed method achieved 0.558 (0.482) in
EAO, which is 9.8% (4.9%) higher than the performance of the baseline model on the
VOT2016 (VOT2018) datasets (see the comparison between the top and bottom rows of
Table 3). Therefore, the feedback block has a significant impact on the performance improvement
in visual object tracking.
Moreover, the tracking performance was analyzed according to the number of feedback
loops, as shown in Table 4. Note that these experiments were conducted with architectures containing the refinement
module. As shown in Table 4, the EAO value was improved greatly as the number of feedback loops was increased
from $T=0$ to $T=2$, whereas the performance decreased when $T=3$ on both the VOT2016
and VOT2018 datasets. The feedback operations effectively improved the performance,
but too many iterations probably caused overfitting of the model. In addition, the
robustness was increased slightly (i.e., the performance is dropped) when $T=$1 and
2 on the VOT2018 dataset. This is because information, which is refined slightly incorrectly
in the first step of the feedback block, can accumulate improperly during the additional
feedback loop and makes some errors for target localization. Based on these results,
$T=2$ was adopted for the proposed method, which still outperformed the previous methods.
Table 3. Performance analysis of the proposed method according to changes in the network architectures.
Methods
|
VOT2016
|
VOT2018
|
A ↑
|
R ↓
|
EAO ↑
|
∆EAO
|
A ↑
|
R ↓
|
EAO ↑
|
∆EAO
|
Baseline
|
0.62
|
0.23
|
0.460
|
-
|
0.61
|
0.20
|
0.433
|
-
|
Baseline+SC
|
0.65
|
0.20
|
0.497
|
+3.7%
|
0.61
|
0.19
|
0.448
|
+1.5%
|
Baseline+CC
|
0.66
|
0.19
|
0.511
|
+5.1%
|
0.62
|
0.19
|
0.451
|
+1.8%
|
Baseline+SC+CC
|
0.62
|
0.17
|
0.515
|
+5.5%
|
0.61
|
0.18
|
0.457
|
+2.4%
|
Baseline+SC+CC+Feedback (Ours)
|
0.63
|
0.16
|
0.558
|
+9.8%
|
0.63
|
0.23
|
0.482
|
+4.9%
|
* Note that SC and CC denote spatial and channel calibration in the refinement module,
respectively.
Table 4. Performance analysis of the proposed method according to the number of feedback operations.
Number of feedbacks
|
VOT2016
|
VOT2018
|
A ↑
|
R ↓
|
EAO ↑
|
A ↑
|
R ↓
|
EAO ↑
|
T=0
|
0.62
|
0.17
|
0.515
|
0.61
|
0.18
|
0.457
|
T=1
|
0.63
|
0.15
|
0.532
|
0.63
|
0.22
|
0.471
|
T=2
|
0.63
|
0.16
|
0.558
|
0.63
|
0.23
|
0.482
|
T=3
|
0.62
|
0.16
|
0.523
|
0.61
|
0.19
|
0.443
|
4.4 Discussion and Future Work
Based on various experimental results shown in previous subsections, the advantages
and disadvantages of the proposed method can be summarized as follows:
Strong points: SiamFB successfully overcame the various ambiguities under complicated
real-world environments, as demonstrated by the qualitative and quantitative performance
that outperformed previous methods. The proposed method also satisfied the real-time
speed. These advantages were illustrated using a simple feedback process that does
not require additional parameters to learn.
Weak points: The proposed method often suffered from failure cases, as shown in Fig. 6. Specifically, unclear boundaries and blurred textures confuse the tracker under
low-light conditions. Thus it fails to chase the target object consistently.
A low-light image enhancement process will be applied to the SiamFB for more accurate
tracking to compensate for the shortcomings of the proposed method mentioned above.
Adding brightness reduction in data augmentation may also help make the model robust
to low-light conditions. Finally, practical studies for implementation on the embedding
platforms will also be considered.
5. Conclusion
A novel method for visual object tracking was proposed. The feedback operation was
adopted with a simple refinement module to extract target-relevant features more accurately.
This iterative process efficiently guides the model to learn the target appearance
and its surroundings against ambiguities. Based on various experimental results, the
advantages and properties of the proposed method were analyzed in detail. The proposed
method was effectively applied to the problem of visual object tracking.
ACKNOWLEDGMENTS
This work was supported by Institute of Information & Communications Technology Planning
& Evaluation(IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00207,
Immersive Media Research Laboratory)
REFERENCES
Li B., Yan J., Wu W., Zhu Z., Hu X., Jun. 2018, High performance visual tracking with
Siamese region proposal network, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
pp. 8971-8980
Li B., Wu W., Wang Q., Zhang F., Xing J., Yan J., Jun. 2019, SiamRPN++: evolution
of Siamese visual tracking with very deep networks, in Proc. IEEE Int. Conf. Comput.
Vis. Pattern Recognit., pp. 4282-4291
Yu Y., Xiong Y., Huang W., Scott M. R., Jun. 2020, Deformable Siamese attention networks
for visual object tracking, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
pp. 6728-6737
Bolme D. S., Beveridge J. R., Draper B. A., Lui Y. M., Jun. 2010, Visual object tracking
using adaptive correlation filter, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
pp. 2544-2550
Henrique J. F., Caseiro R., Martins P., Batista J., Mar. 2015, High-speed tracking
with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., Vol.
37, No. 5, pp. 583-596
Danelljan M., Robinson A., Khan F. S., Felsberg M., Oct. 2016, Beyond correlation
filters: learning continuous convolution operators for visual tracking, in Proc. Eur.
Conf. Comput. Vis., pp. 1-16
Danelljan M., Bhat G., Khan F. S., Felsberg M., Jun. 2017, ECO: efficient convolution
operators for tracking, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp.
6638-6646
Bertinetto L., Valmadre J., Henriques J. F., Vedaldi A., Torr P. H. S., Nov. 2016,
Fully-convolutional Siamese networks for object tracking, in Proc. Eur. Conf. Comput.
Vis., pp. 850-865
Ren S., He K., Girshick R., Sun J., Jun. 2017, Faster R-CNN: towards real-time object
detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell.,
Vol. 39, No. 6, pp. 1137-1149
Zhu Z., Wang Q., Li B., Wu W., Yan J., Hu W., Sep. 2018, Distractor-aware Siamese
networks for visual object tracking, in Proc. Eur. Conf. Comput. Vis., pp. 101-117
Li Z., Yang J., Liu Z., Yang X., Jeon G., Wu W., Jun. 2019, Feedback network for image
super-resolution, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 3862-3871
Kim J., Kim W., Dec. 2020, Attentive feedback feature pyramid network for shadow detection,
IEEE Signal Process. Lett., Vol. 27, pp. 1964-1968
Zhao T., Wu X., Jun. 2019, Pyramid feature attention network for saliency detection,
in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 3085-3094
Hu J., Shen L., Sun G., Jun. 2018, Squeeze and excitation networks, in Proc. IEEE
Int. Conf. Comput. Vis. Pattern Recognit., pp. 7131-7141
Lin T-Y., Maire M., Belongie S., Bourdev L., Girshick R., Hays J., Perona P., Ramanan
D., Zitnick C. L., Dollar P., Sep. 2014, Microsoft COCO: common objects in context,
in Proc. Eur. Conf. Comput. Vis., pp. 740-755
Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy
A., Khosla A., Bernstein M., Berg A. C., Fei-Fei L., 2015, ImageNet large scale visual
recognition challenge, Int. J. Comput. Vis., Vol. 115, No. 3, pp. 211-252
Xu N., Yang L., Fan Y., Yang J., Yue D., Liang Y., Price B., Cohen S., Huang T., Sep.
2018, YouTube-VOS: sequence-to-sequence video object segmentation, in Proc. Eur. Conf.
Comput. Vis., pp. 585-601
Real E., Shlens J., Mazzocchi S., Pan X., Vanhoucke V., Jul. 2017, YouTube-BoundingBoxes:
a large high-precision human-annotated data set for object detection in video, in
Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 5296-5305
Kristan M., et al. , Oct. 2016, The visual object tracking with VOT2016 challenge
results, in Proc. Eur. Conf. Comput. Vis.
Kristan M., et al. , Sep. 2018, The sixth visual object tracking VOT2018 challenge
results, in Proc. Eur. Conf. Comput. Vis.
Wang Q., Zhang L., Bertinetto L., Hu W., Torr P. H. S., Jun. 2019, Fast online object
tracking and segmentation: a unifying approach, in Proc. IEEE Int. Conf. Comput. Vis.
Pattern Recognit., pp. 1328-1338
Yao H., Zhu D-L., Jiang B., Yu P., Oct. 2019, Negative log likelihood ratio loss for
deep neural network classification, in Proc. Future Tech. Conf., pp. 276-282
Tan H., Zhang X., Zhang Z., Lan L., Zhang W., Luo Z., 2021, Nocal-Siam: Refining visual
features and response with advanced non-local blocks for real-time Siamese tracking,
IEEE Trans. Image Process., Vol. 30, pp. 2656-2668
Wang X., Girshick R., Gupta A., He K., Jun. 2018, Non-local neural networks, in Proc.
IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7794-7803
Zhou W., Wen L., Zhang L., Du D., Luo T., Wu Y., 2021, SiamCAN: Real-time visual tracking
based on Siamese center-aware network, IEEE Trans. Image Process., Vol. 30, pp. 3597-3609
Jiang M., Zhao Y., Kong J., Aug. 2021, Mutual learning and feature fusion Siamese
networks for visual object tracking, IEEE Trans. Circuits Syst. Video Technol., Vol.
31, No. 8, pp. 3154-3167
Li Q., Li Z., Lu L., Jeon G., Liu K., Yang X., Sep. 2019, Gated multiple feedback
network for image super-resolution, in Proc. Brit. Mach. Vis. Conf., pp. 1-12
He K., Zhang X., Ren S., Sun J., Jun. 2016, Deep residual learning for image recognition,
in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 770-778
Author
Mi-Gyeong Gwon is currently pursuing a B.S. degree with the Department of Electrical
and Electronics Engineering, Konkuk University, Seoul, South Korea. Her current research
interests include object detection and tracking, scene understanding, image enhancement,
and colorization.
Jinhee Kim received his B.S. degree in the Department of Electrical and Electronics
Engineering and a M.S. degree in Electronic, Information and Communication Engineering
from Konkuk University, Seoul, South Korea, in 2020 and 2021, respectively. He is
currently working at Hyundai Motor Company. His research interests include computer
vision, object detection and tracking, instance segmentation, and image enhancement.
This work was done when he was at Konkuk University.
Gi-Mun Um received his B.S, M.S., and Ph.D. degrees in electronic engineering from
Sogang University, Seoul, Rep. of Korea, in 1991, 1993, and 1998, respectively. Since
1998, he has worked for Electronics and Telecommunications Research Institute, Daejeon,
Rep. of Korea, and he is currently with the Realistic Media Research Section. He has
worked as a visiting research scientist at Communications Research Center Canada from
2001 to 2002. He participated in “F. IoT-ASM (F.747.8): requirements and reference
architecture for audience-selectable media service framework in the IoT environment”
as an editor of the ITU-T SG16 from 2014 to 2015. He is now working on 360VR, Light
Field Video, AR./VR/XR, and network-based media processing. His main research interests
include computer vision and multi-view/3D/AR video.
HeeKyung Lee received her B.S. degree in computer engineering from Yeungnam University,
Daegu, Rep. of Korea, in 1999, and her M.S. degree in engineering from the Information
and Communication University, Daejeon, Rep. of Korea, in 2002. Since 2002, she has
worked for Electronics and Telecommunications Research Institute Daejeon, Rep. of
Korea, where she is now serving as a senior member of the engineering staff. She participated
in “TV-Anytime” standardization and IPTV Metadata standardization. She was also involved
in the development of gaze tracking technology. Currently, she is working on 360VR,
AR, and MR. Her research interests include personalized service via metadata, HCI,
Gaze Tracking, Bi-directional advertisement and video content analysis, and VR/AR/MR.
Jeongil Seo was born in Goryoung, Korea, in 1971. He received his Ph.D. degree
in electronics from Kyoung-pook National University (KNU), Daegu, Korea, in 2005 for
his work on audio signal processing systems. He worked as a member of the engi-neering
staff at the Laboratory of Semiconductor, LG-semicon, Cheongju, Korea, from 1998 until
2000. He has worked as a director at the Immersive Media Research Section, Electronics
and Telecommuni-cations Research Institute (ETRI), Daejeon, Korea, since 2000. His
research activities include image and video processing, audio processing, and realistic
broadcasting and media service systems.
Seong Yong Lim received his BS and M.S. in electrical engineering from the Korea
Advanced Institute of Science and Technology (KAIST) in 1999 and 2011, respectively.
His research interests include a wide range of field-of-view applications, real-time
video processing, and network-based inference.
Seung-Jun Yang received his B.S. and M.S. degree in computer science from Suncheon
National University and Chonnam National University in 1999 and 2001, respectively.
Since 2001, he has been a principal researcher in the media research division of ETRI,
where he has developed advanced digital television technology, such as data broadcasting,
personalized broadcasting, emotional broadcasting, assistive broadcasting for the
disabled, and ultra-wide vision technology. He is currently working on the research
of the fundamental media·contents technologies for hyper-realistic media space.
Wonjun Kim received his B.S. degree from the Department of Electronic Engineering,
Sogang University, Seoul, South Korea, in 2006, M.S. degree from the Department of
Information and Communications, Korea Advanced Institute of Science and Technology
(KAIST), Daejeon, South Korea, in 2008, and Ph.D. degree from the Department of Electrical
Engineering, KAIST, in 2012. From September 2012 to February 2016, he was a Research
Staff Member of the Samsung Advanced Institute of Technology (SAIT), South Korea.
Since March 2016, he has been with the Department of Electrical and Electronics Engineering,
Konkuk University, Seoul, where he is currently an Associate Professor. His research
interests include image and video understanding, computer vision, pattern recognition,
and biometrics, emphasizing background subtraction, saliency detection, face, and
action recognition. He has served as a Regular Reviewer for over 30 international
journal articles, including the IEEE Transactions on Image Processing, IEEE Transactions
on Circuits and Systems for Video Technology, IEEE Transactions on Multimedia, IEEE
Transactions on Cybernetics, IEEE Access, IEEE Signal Processing, Letters, and so
on.