Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 11, No. 1, p.24-33

ISSN (print) :

2287-5255

Received : 20 October 2021Revised : 11 November 2021Accepted : 17 November 2021

DOI :

https://doi.org/10.5573/IEIESPC.2021.11.1.24

Regular Paper

Siamese Feedback Network for Visual Object Tracking

GwonMi-Gyeong¹ KimJinhee¹ UmGi-Mun² LeeHeeKyung² SeoJeongil² LimSeong Yong² YangSeung-Jun² Kim,Wonjun^1*

(Department of Electrical and Electronics Engineering, Konkuk University, Seoul 05029, Korea {kmk3942, tyt8131, wonjkim}@konkuk.ac.kr )
(Immersive Media Research Section, Electronics and Telecommunications Research Institute, Daejeon 34129, Korea {gmum, lhk95, seoji, seylim, sjyang}@etri.re.kr )

^* Corresponding Author: Wonjun Kim

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Visual object tracking, one of the main topics in computer vision, aims to chase a target object in every frame of the video sequences. In particular, Siamese-based network architectures have been adopted widely for visual object tracking due to their correlation-based nature. On the other hand, the features encoded from the target template and the search image in Siamese branches still suffer from ambiguities, which are driven by complicated real-world environments, e.g., occlusions and rotations. This paper proposes the Siamese feedback network for robust object tracking. The key idea of the proposed method is to encode target-relevant features accurately via the feedback block, which is defined by a combination of attention and refinement modules. Specifically, interdependent features are extracted through self- and cross-attention operations. Subsequently, such re-calibrated features are refined in both spatial and channel-wise manner. Those are fed back to the input of the feedback block again via the feedback loop. This is desirable because the high-level semantic information guides the feedback block to learn more meaningful properties of the target object and its surroundings. The experimental results show that the proposed method outperforms the state-of-the-art Siamese-based methods with a gain of 0.72% and 1.69% for the expected average overlap on the VOT2016 and VOT2018 datasets, respectively. Overall, the proposed method is effective for visual object tracking, even with complicated real-world scenarios.

Keywords

Visual object tracking, Siamese feedback network, Target-relevant features

1. Introduction

Visual object tracking is an essential task in computer vision, which aims to localize the target object through consecutive frames of a given video. It has diverse applications, including intelligent surveillance, autonomous driving, human-machine interaction, and AR/VR. Considerable efforts have been made to design reliable trackers owing to such a wide range of applications. On the other hand, complicated real-world environments, e.g., deformations, occlusions, and fast motions, still impose great difficulties in resolving the problem of visual object tracking.

Recently, many studies have devised deep neural network-based trackers owing to the success of deep learning. In particular, Siamese network architecture and its variants have been popularly adopted for visual object tracking because they efficiently compute the correlation between the target object and candidate frames in the latent space while successfully considering the shape variations and background clutters. Specifically, SiamRPN ^[1] applied the region proposal network to the Siamese architecture for adopting local constraints, which plays an important role in boosting the tracking accuracy. SiamRPN++ ^[2] further enabled the use of rich information encoded by the deep backbone network, e.g., ResNet-50 ^[28]. Most recently, SiamAttn ^[3] developed self-attention and cross-attention modules to highlight target-relevant features in the Siamese architecture efficiently. This model also attempted to refine such re-calibrated features via bounding box and mask heads and showed significant performance improvement. Although Siamese-based trackers have brought notable progress, more robust features are still needed to allow ambiguities from diverse real-world scenarios.

The challenge of visual object tracking comes from a lack of semantic understanding of the target object and its surroundings. The authors previously studied how to embed such semantics in the latent space more accurately. As a result, it was figured out that the feedback operations are quite helpful in overcoming the challenges by drifting, missing, and confusion with target-like background clutter. This paper proposes a simple yet powerful method for visual object tracking, called the Siamese feedback network (SiamFB). The key idea of the proposed method is to exploit the feedback block for progressively learning the semantic features, which have good ability to guide the model to be robust to a variety of variations, including background clutters in an iterative manner. The proposed feedback block is composed of attention and refinement modules. Specifically, features encoded from the deep backbone network are re-calibrated in a manner similar to that reported elsewhere ^[3], and those are directly refined both channel-wisely and spatially in the feedback block. By feeding this high-level information back to the input of each feedback block, the semantic relationship between the target object and the search region can be learned, which leads to reliable tracking performance even in complex scenarios. Unlike SiamAttn ^[3], the proposed method does not require an additional branch, e.g., branches for mask learning. The main contributions of this paper can be summarized as follows:

· The authors propose to apply feedback operations to the Siamese architecture for successfully learning the semantic relationship between the features of the target object and the search region even under complicated situations.

· Instead of adopting additional branches for refining extracted features, which has been used widely in previous approaches, this paper proposes to conduct this process implicitly within the feedback block by emphasizing the attentive features again both spatially and channel-wisely.

· This paper highlights the effectiveness of the proposed method for visual object tracking on various benchmark datasets. In addition, various ablation studies, as well as the performance comparison with previous approaches, are also provided in detail.

The remainder of this paper is organized as follows. The next section presents a brief review of visual object tracking. The proposed method, i.e., the Siamese feedback network, is explained in detail in Section 3. The experimental results and ablation studies on the benchmarks dataset are reported in Section 4. Section 5 reports the conclusions.

2. Related Work

Since the problem of visual object tracking can be formulated as a matching problem between the target template and the search region, computing correlation (i.e., similarity) in an efficient way is considered an essential task. In this point of view, previous methods for visual object tracking can be divided into two main groups: correlation filter-based and neural network-based methods.

$Correlation filter-based methods.$ As a pioneer, Bolme et al. ^[4] utilized multiple samples to estimate the correlation filter with the minimum output sum of squared error. Their adaptive scheme worked well under the situation guaranteeing linearity, e.g., without sudden changes in motion. On the other hand, such an assumption is barely maintained under real-world environments. To cope with this limitation, Henrique et al. ^[5] proposed to adopt the kernel technique to consider the nonlinearity, and their work operated very fast via the Fourier-based implementation. Owing to its high performance and simplicity, many variants have been introduced in the vision community. For example, Danelljan et al. ^[6] attempted to devise multi-resolution features with continuously learned filters to improve the performance of the convolution operations. They further introduced a lightweight scheme based on a factorized convolution operator, a compact generative model, and a conservative update scheme ^[7]. Although correlation filter-based approaches improve visual object tracking significantly, they often cannot continuously grasp the deformable nature of the target object satisfactory because of the lack of representation ability.

$Neural network-based methods.$ In this category, many studies have formulated the problem of visual object tracking as a problem of discriminative detection. To resolve this problem efficiently, Siamese-based trackers have attracted considerable attention because of their high performance and speed. Specifically, Bertinetto et al. ^[8] adopted the Siamese network to calculate the score map indicating the position of the target object in a deep architecture. Li et al. ^[1] proposed to apply the region proposal network ^[9], which has been widely employed for object detection and segmentation, to the Siamese-based tracker to improve localization accuracy. Furthermore, they extended their scheme to take advantage of the rich contextual information through the deep backbone network ^[2]. On the other hand, Zhu et al. ^[10] attempted to alleviate the imbalanced distribution of training samples based on the learning distractor-aware Siamese networks. Yu et al. ^[3] proposed a deformable Siamese attention network to adaptively allow high-level features in a class-agnostic manner while overcoming the limited receptive fields. For more robust tracking against background clutter, Tan et al. ^[23] applied the non-local blocks ^[24], which consider the relationship between entire positions of the features, to the Siamese network. In addition to the classification and regression branches of the typical Siamese-based tracker, Zhou et al. ^[25] introduced an anchor-free localization branch, which accurately estimates the target center. Most recently, Jiang et al. ^[26] combined two networks trained separately, i.e., one for a mutual interaction between the target and search branches of the Siamese network and the other for utilizing various levels of information by the feature fusion technique. Although such Siamese-based trackers have shown many possibilities, they still suffer from ambiguities from various real-world environments.

3. Proposed Method

Inspired by the power of the feedback operation shown in resolving the problem of generative tasks ^[11, ^12, ^27], this paper proposes exploiting the feedback block to progressively refine the features of the target template and the search region. By feeding the high-level information back to the input of the feedback block, the proposed method can encode more meaningful features for the tracking task. This iterative process guides the network model to learn underlying properties of the corresponding input even under ambiguous cases, which makes the convolution result more accurate in the Siamese-based network. Fig. 1 shows the overall architecture of the proposed method.

Fig. 1. Overall architecture of the proposed method (Siamese feedback network). Note that it consists of the proposed feedback blocks and Siamese region proposal network (SiamRPN).

3.1 Architecture Details

The proposed method, i.e., Siamese feedback network (SiamFB), utilizes a five-stage ResNet-50 ^[28] as the backbone network. In particular, the features encoded by the last three stages of ResNet-50 are fed into the proposed feedback block, composed of the attention and refinement modules. The corresponding results are then input into three SiamRPN blocks ^[2] to generate the response map via convolution operations. This map is finally used for classification and bounding box heads to predict the candidate region containing the target object, as shown in Fig. 1.

Fig. 2 presents the detailed architecture of the proposed feedback block. For the attention module, principal components are arranged in a manner similar to that reported by Danelljan et al. ^[3]. Spatial self-attention and channel self-attention are computed at each branch, while cross-attention is calculated by utilizing the features from both the target and search branches together. For further refinement, attentive features are adaptively re-calibrated as follows: first of all, the sigmoid function is applied to attentive features to extract the balancing factor based on its nonlinear response. Subsequently, the attentive features are re-calibrated both spatially and channel-wisely via the modules introduced previously ^[13] and ^[14], respectively, and corresponding results are combined using the balancing factor as follows:

(1)

$F_{out}^{T}=\left(1-k^{T}\right)F_{ch}^{T}+k^{T}F_{sp}^{T}$,

where $T$ is the index of feedback operations. $k^{T}$ denotes the balancing factor, i.e., the output of the sigmoid function in Fig. 2. $F_{ch}^{T}$ and $F_{sp}^{T}$ denote the output features of the channel and spatial calibration blocks with input $F_{att}^{T}$ (i.e., attentive features in Fig. 2), respectively. Such re-calibrated features were combined adaptively because the balancing factor is determined differently according to the input of the refinement module (i.e., attentive features). The output of the refinement module returns to the input of the feedback block, as shown in Fig. 2. Note that this feedback loop is concatenated to the input feature from the corresponding branch as follows:

(2)

$F_{in\_ t}^{T}=C(F_{in\_ t}^{T-1},F_{out}^{T-1})(T>0)$,

where the function $C\left(x_{1},x_{2}\right)$ denotes the concatenation of $x_{1}$ and $x_{2}$ along the channel direction. These final results are obtained separately from branches for target and search region and then fed into the SiamRPN block ^[2] for the tracking task, as shown in Fig. 1.

3.2 Loss Function

The proposed network is trained based on the weighted sum of two loss functions, i.e., one for the classification and the other for the bounding box regression, as introduced by Li et al. ^[2]. A negative log-likelihood loss ^[22] was adopted for the classification loss as follows:

(3)

$L_{cls}=-\sum _{i=0}^{1}y_{i}\log \left(\hat{y}_{i}\right)$,

where $y_{i}$ and $\hat{y}_{i}$ denote the label of the ground truth and the predicted score estimated from the softmax function, respectively. Note that this classification loss is computed for both positive and negative pairs. On the other hand, the bounding box regression loss was calculated using weighted $l_{1}$ loss as follows:

(4)

$L_{reg}=\sum _{i=0}^{3}w\left| \hat{g}_{i}-g_{i}\right| $,

(5)

w = 0, i f I o U ≤ 0 . 6 1 n + ε , o t h e r w i s e ,

where $g_{i}$ is the ground truth for the parameters of the bounding box, i.e., $\left(x,y\right)$ positions of the center point $\left(i=0,1\right)$ and scales of the bounding box $\left(i=2,3\right)$, and $\hat{g}_{i}$ is the estimated result of the corresponding ones. $w$ denotes the weighting factor for the positive sample whose IoU value between the anchor box and the corresponding ground truth is greater than 0.6. The anchor box is treated as a negative sample if the IoU value is smaller than 0.3, and other cases are discarded, thus $w$ is set to zero. $n$ is the number of positive anchor boxes, and $\varepsilon $ is a very small value ($1\times 10^{-6}$ in this implementation) to avoid zero division. By using the sum of these loss functions, the proposed network effectively learns to track the target objects as follows:

(5)

$L_{\textit{total}}=L_{cls}+\lambda L_{reg}$,

where the balancing factor $\lambda =1.2$ was used in the present work, which was set through extensive experiments.

Fig. 2. Detailed architecture of the proposed feedback block, which consists of attention module and refinement module. Note that this figure illustrates the process of feedback operation from the viewpoint of the target branch shown in Fig. 1.

4. Experimental Results

This Section reports the experimental results based on benchmark datasets for visual object tracking. The performance comparison with state-of-the-art methods is also provided in detail.

4.1 Training

All the parameters for the proposed architecture were tuned using the stochastic gradient descent (SGD) method for 20 epochs. The momentum and weight decaying factor were set to $9\times 10^{-1}$ and $1\times 10^{-4}$, respectively. The learning rate was increased linearly from $4\times 10^{-4}$ to $2\times 10^{-3}$ for the first five epochs to warm up, and decreased exponentially from $2\times 10^{-3}$ to $2\times 10^{-4}$ for the last 15 epochs. The backbone network is only trained for the last 10 epochs. The five representative benchmarks, i.e., MS COCO2017 ^[15], ImageNet-VID, ImageNet-DET ^[16], YouTube-VOS ^[17], and YouTube-BoundingBoxes ^[18] datasets, were used for training. Specifically, 117,266 images from the training set of MS COCO2017 ^[15] were collected for learning the proposed model. This study also used 3,862 and 333,474 images of ImageNet-VID and ImageNet-DET ^[16], which are datasets for object detection in videos and images for the 2015 ILSVRC competition, respectively. Moreover, 3,000 and 175,495 images were obtained from YouTube-VOS ^[17] and YouTube-BoundingBoxes ^[18], which were constructed based on the high-resolution YouTube videos. Note that the images from ImageNet-VID and YouTube-VOS ^[17] were used in duplicate because of their insufficient numbers. The training batch is composed of eight samples selected randomly from those five benchmark datasets. The resolutions of the target template and the search region were $127\times 127$ pixels and $255\times 255$ pixels, respectively. Data augmentation techniques, e.g., shift and color transformation, were used to alleviate the overfitting problem. The proposed method was implemented on the PyTorch framework, with four NVIDIA GTX Titan Xp GPUs.

4.2 Performance Evaluations

The performance of the trackers was evaluated based on two benchmarks, i.e., VOT2016 ^[19] and VOT2018 ^[20] datasets, which have been used most widely for visual object tracking. VOT2016 ^[19] and VOT2018 ^[20] consist of 60 video sequences taken under various scenarios, e.g., sports games, roads, and animals. For quantitative evaluation, the methods were tested in terms of accuracy (A), robustness (R), and expected average overlap (EAO). The accuracy (A) measures the average overlap between the estimated bounding box and the corresponding ground truth during successful tracking, whereas the robustness (R) counts the number of failures on tracking that require a reset process for a target region. In addition, the expected average overlap (EAO) is the ratio of the average overlap while the model chases the target object with no failure, reflecting both accuracy and robustness. To help understand the accuracy (A) metric, Fig. 3 presents the overlap values, which were in the range between 0 and 1, on several frames. The average of these overlap values in all video sequences determines the accuracy (A). As shown in the first column of Fig. 3, the overlap values are close to 1 when the model tracks the target object successfully. On the other hand, the overlaps have small values when the model fails to estimate the target region accurately, as shown in the second column of Fig. 3.

Fig. 3. Some examples that show the overlap values of the bounding box estimated by SiamFB and the corresponding ground truth. Note that the overlap values are represented at the right bottom of each image.

The efficiency and robustness of the proposed method were assessed by comparing the present results with eleven representative methods, i.e., two correlation-based methods (C-COT ^[6] and ECO ^[7]) and nine neural network-based methods (SiamFC ^[8], SiamRPN ^[1], DaSiamRPN ^[10], SiamRPN++ ^[2], SiamMask ^[21], SiamAttn ^[3], Nocal-Siam ^[23], SiamCAN ^[25], and M-F-Siam ^[26]), for visual object tracking. Table 1 shows the corresponding result. The proposed tracker achieved 0.63 (0.63) accuracy, 0.16 (0.23) robustness, and 0.558 (0.482) expected average overlap on the VOT2016 (VOT2018) datasets, respectively. In particular, the SiamFB has a clear performance gain of 6.3% for the EAO metric compared to SiamAttn ^[3] on the VOT2016 dataset, demonstrating that the proposed method can provide reliable performance even without mask-based guidance. Furthermore, the proposed tracker achieved the top performance in terms of the EAO metric, which outperformed the state-of-the-art methods, while showing competitive accuracy and robustness with other previous methods both on the VOT2016 and VOT2018 datasets. These results suggest that it is possible to perform successful visual object tracking using only the simple feedback operations and the refinement module in the proposed feedback block, without other complicated techniques.

The processing speed in terms of frames per second (FPS) was produced on a single NVIDIA GeForce RTX 2080 Ti GPU to analyze the potential in the real-time running of the proposed tracker. Table 2 lists the average speeds of the proposed method and other representative methods. SiamFB achieves 68 FPS, which is sufficient for real-time applications. Moreover, the proposed method outperformed most Siamese network-based methods. These results show that the proposed method has a sufficient ability to perform real-time tracking.

Table 1. Performance on VOT2016 and VOT2018

Trackers	VOT2016			VOT2018
Trackers	A ↑	R ↓	EAO ↑	A ↑	R ↓	EAO ↑
C-COT [6]	0.54	0.24	0.331	0.49	0.32	0.267
ECO [7]	0.55	0.20	0.375	0.48	0.28	0.276
SiamFC [8]	0.53	0.46	0.235	0.50	0.59	0.188
SiamRPN [1]	0.56	0.26	0.344	-	-	-
DaSiamRPN [10]	0.61	0.22	0.411	0.59	0.28	0.383
SiamRPN++ [2]	0.64	0.20	0.464	0.60	0.23	0.415
SiamMask [21]	0.67	0.23	0.442	0.64	0.30	0.387
SiamAttn [3]	0.68	0.15	0.525	0.63	0.16	0.470
Nocal-Siam [23]	0.62	0.09	0.554	0.59	0.16	0.474
SiamCAN [25]	0.64	0.15	0.513	0.61	0.18	0.462
M-F-Siam [26]	0.58	0.27	0.335	-	-	-
SiamFB (Ours)	0.63	0.16	0.558	0.63	0.23	0.482

Table 2. Comparison of the speed on the VOT dataset

Trackers	Processing speed (FPS)
C-COT [6]	0.3
ECO [7]	6
SiamFC [8]	86
SiamRPN [1]	160
DaSiamRPN [10]	160
SiamRPN++ [2]	35
SiamMask [21]	55
SiamAttn [3]	33
Nocal-Siam [23]	38
SiamCAN [25]	45
M-F-Siam [26]	43
SiamFB (Ours)	68

* Note that all speeds of the previous methods are reported in the environment of each previous work.

Fig. 4 presents several results of visual object tracking on the VOT2016 dataset. The top row in Fig. 4 is the first frame of each sequence, in which the target template is initialized, and the others show the tracking results by the representative trackers. The proposed method successfully chased the target object even under various ambiguities, e.g., occlusions and scale changes, whereas previous methods often show drifting results, as shown in the leftmost two columns of Fig. 4. Because the Siamese-based trackers rely on the correlation result, target-like background clutter might cause confusion, leading to a performance drop. This problem was overcome by refining target-relevant features more precisely via the feedback block (see the third and fourth columns of Fig. 4). Furthermore, as shown in the fifth and sixth columns of Fig. 4, the SiamFB quickly grasps the target objects after the occlusion while other models still struggle to find the target. Fig. 5 shows the tracking results for the video acquired from the outdoor environment. The proposed method was robust to various obstacles. Despite the occlusions caused by trees or bollards and background clutter (e.g., bicycles and kickboards), the tracker successfully chased the target object through consecutive frames. The proposed method was also robust to a change in scales and appearances. Therefore, the Siamese feedback network is effective in visual object tracking even under complicated real-world environments.

Fig. 4. Some examples of visual object tracking on the VOT2016 dataset. Note that the results by the proposed method and other three previous methods are represented with different colors.

Fig. 5. Results of the proposed method for our outdoor video. The frames were sampled every three seconds from the video sequence. In particular, the second and the fourth examples of the bottom row effectively show the robustness of the proposed method in the complex scene.

Fig. 6. Some examples of the failure cases by the proposed method on the VOT2016 dataset. Note that these results were obtained under low-light conditions.

4.3 Ablation Studies

Several comparative experiments were conducted to verify the effectiveness of the proposed method. First, Table 3 shows the effects of using the refinement module and the feedback loop. The baseline model indicates the architecture without the refinement module and the feedback loop in Fig. 2. As shown in Table 3, re-calibration of the features in spatial and channel-wise manner can help improve the tracking performance. In particular, the performance was improved further when a combination of spatial and channel calibration was used, compared to applying only one of them to the baseline model. Furthermore, the feedback operations also boosted the tracking performance. By refining the re-calibrated features via the feedback loop, the network could learn more meaningful information about the target object and the search region. Consequently, the proposed method achieved 0.558 (0.482) in EAO, which is 9.8% (4.9%) higher than the performance of the baseline model on the VOT2016 (VOT2018) datasets (see the comparison between the top and bottom rows of Table 3). Therefore, the feedback block has a significant impact on the performance improvement in visual object tracking.

Moreover, the tracking performance was analyzed according to the number of feedback loops, as shown in Table 4. Note that these experiments were conducted with architectures containing the refinement module. As shown in Table 4, the EAO value was improved greatly as the number of feedback loops was increased from $T=0$ to $T=2$, whereas the performance decreased when $T=3$ on both the VOT2016 and VOT2018 datasets. The feedback operations effectively improved the performance, but too many iterations probably caused overfitting of the model. In addition, the robustness was increased slightly (i.e., the performance is dropped) when $T=$1 and 2 on the VOT2018 dataset. This is because information, which is refined slightly incorrectly in the first step of the feedback block, can accumulate improperly during the additional feedback loop and makes some errors for target localization. Based on these results, $T=2$ was adopted for the proposed method, which still outperformed the previous methods.

Table 3. Performance analysis of the proposed method according to changes in the network architectures.

Methods	VOT2016				VOT2018
Methods	A ↑	R ↓	EAO ↑	∆EAO	A ↑	R ↓	EAO ↑	∆EAO
Baseline	0.62	0.23	0.460	-	0.61	0.20	0.433	-
Baseline+SC	0.65	0.20	0.497	+3.7%	0.61	0.19	0.448	+1.5%
Baseline+CC	0.66	0.19	0.511	+5.1%	0.62	0.19	0.451	+1.8%
Baseline+SC+CC	0.62	0.17	0.515	+5.5%	0.61	0.18	0.457	+2.4%
Baseline+SC+CC+Feedback (Ours)	0.63	0.16	0.558	+9.8%	0.63	0.23	0.482	+4.9%

* Note that SC and CC denote spatial and channel calibration in the refinement module, respectively.

Table 4. Performance analysis of the proposed method according to the number of feedback operations.

Number of feedbacks	VOT2016			VOT2018
Number of feedbacks	A ↑	R ↓	EAO ↑	A ↑	R ↓	EAO ↑
T=0	0.62	0.17	0.515	0.61	0.18	0.457
T=1	0.63	0.15	0.532	0.63	0.22	0.471
T=2	0.63	0.16	0.558	0.63	0.23	0.482
T=3	0.62	0.16	0.523	0.61	0.19	0.443

4.4 Discussion and Future Work

Based on various experimental results shown in previous subsections, the advantages and disadvantages of the proposed method can be summarized as follows:

Strong points: SiamFB successfully overcame the various ambiguities under complicated real-world environments, as demonstrated by the qualitative and quantitative performance that outperformed previous methods. The proposed method also satisfied the real-time speed. These advantages were illustrated using a simple feedback process that does not require additional parameters to learn.

Weak points: The proposed method often suffered from failure cases, as shown in Fig. 6. Specifically, unclear boundaries and blurred textures confuse the tracker under low-light conditions. Thus it fails to chase the target object consistently.

A low-light image enhancement process will be applied to the SiamFB for more accurate tracking to compensate for the shortcomings of the proposed method mentioned above. Adding brightness reduction in data augmentation may also help make the model robust to low-light conditions. Finally, practical studies for implementation on the embedding platforms will also be considered.

5. Conclusion

A novel method for visual object tracking was proposed. The feedback operation was adopted with a simple refinement module to extract target-relevant features more accurately. This iterative process efficiently guides the model to learn the target appearance and its surroundings against ambiguities. Based on various experimental results, the advantages and properties of the proposed method were analyzed in detail. The proposed method was effectively applied to the problem of visual object tracking.

ACKNOWLEDGMENTS

This work was supported by Institute of Information & Communications Technology Planning & Evaluation(IITP) grant funded by the Korea government (MSIT) (No. 2018-0-00207, Immersive Media Research Laboratory)

REFERENCES

Li B., Yan J., Wu W., Zhu Z., Hu X., Jun. 2018, High performance visual tracking with Siamese region proposal network, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 8971-8980

Li B., Wu W., Wang Q., Zhang F., Xing J., Yan J., Jun. 2019, SiamRPN++: evolution of Siamese visual tracking with very deep networks, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 4282-4291

Yu Y., Xiong Y., Huang W., Scott M. R., Jun. 2020, Deformable Siamese attention networks for visual object tracking, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 6728-6737

Bolme D. S., Beveridge J. R., Draper B. A., Lui Y. M., Jun. 2010, Visual object tracking using adaptive correlation filter, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 2544-2550

Henrique J. F., Caseiro R., Martins P., Batista J., Mar. 2015, High-speed tracking with kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 37, No. 5, pp. 583-596

Danelljan M., Robinson A., Khan F. S., Felsberg M., Oct. 2016, Beyond correlation filters: learning continuous convolution operators for visual tracking, in Proc. Eur. Conf. Comput. Vis., pp. 1-16

Danelljan M., Bhat G., Khan F. S., Felsberg M., Jun. 2017, ECO: efficient convolution operators for tracking, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 6638-6646

Bertinetto L., Valmadre J., Henriques J. F., Vedaldi A., Torr P. H. S., Nov. 2016, Fully-convolutional Siamese networks for object tracking, in Proc. Eur. Conf. Comput. Vis., pp. 850-865

Ren S., He K., Girshick R., Sun J., Jun. 2017, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, No. 6, pp. 1137-1149

Zhu Z., Wang Q., Li B., Wu W., Yan J., Hu W., Sep. 2018, Distractor-aware Siamese networks for visual object tracking, in Proc. Eur. Conf. Comput. Vis., pp. 101-117

Li Z., Yang J., Liu Z., Yang X., Jeon G., Wu W., Jun. 2019, Feedback network for image super-resolution, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 3862-3871

Kim J., Kim W., Dec. 2020, Attentive feedback feature pyramid network for shadow detection, IEEE Signal Process. Lett., Vol. 27, pp. 1964-1968

Zhao T., Wu X., Jun. 2019, Pyramid feature attention network for saliency detection, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 3085-3094

Hu J., Shen L., Sun G., Jun. 2018, Squeeze and excitation networks, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7131-7141

Lin T-Y., Maire M., Belongie S., Bourdev L., Girshick R., Hays J., Perona P., Ramanan D., Zitnick C. L., Dollar P., Sep. 2014, Microsoft COCO: common objects in context, in Proc. Eur. Conf. Comput. Vis., pp. 740-755

Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A. C., Fei-Fei L., 2015, ImageNet large scale visual recognition challenge, Int. J. Comput. Vis., Vol. 115, No. 3, pp. 211-252

Xu N., Yang L., Fan Y., Yang J., Yue D., Liang Y., Price B., Cohen S., Huang T., Sep. 2018, YouTube-VOS: sequence-to-sequence video object segmentation, in Proc. Eur. Conf. Comput. Vis., pp. 585-601

Real E., Shlens J., Mazzocchi S., Pan X., Vanhoucke V., Jul. 2017, YouTube-BoundingBoxes: a large high-precision human-annotated data set for object detection in video, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 5296-5305

Kristan M., et al. , Oct. 2016, The visual object tracking with VOT2016 challenge results, in Proc. Eur. Conf. Comput. Vis.

Kristan M., et al. , Sep. 2018, The sixth visual object tracking VOT2018 challenge results, in Proc. Eur. Conf. Comput. Vis.

Wang Q., Zhang L., Bertinetto L., Hu W., Torr P. H. S., Jun. 2019, Fast online object tracking and segmentation: a unifying approach, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 1328-1338

Yao H., Zhu D-L., Jiang B., Yu P., Oct. 2019, Negative log likelihood ratio loss for deep neural network classification, in Proc. Future Tech. Conf., pp. 276-282

Tan H., Zhang X., Zhang Z., Lan L., Zhang W., Luo Z., 2021, Nocal-Siam: Refining visual features and response with advanced non-local blocks for real-time Siamese tracking, IEEE Trans. Image Process., Vol. 30, pp. 2656-2668

Wang X., Girshick R., Gupta A., He K., Jun. 2018, Non-local neural networks, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 7794-7803

Zhou W., Wen L., Zhang L., Du D., Luo T., Wu Y., 2021, SiamCAN: Real-time visual tracking based on Siamese center-aware network, IEEE Trans. Image Process., Vol. 30, pp. 3597-3609

Jiang M., Zhao Y., Kong J., Aug. 2021, Mutual learning and feature fusion Siamese networks for visual object tracking, IEEE Trans. Circuits Syst. Video Technol., Vol. 31, No. 8, pp. 3154-3167

Li Q., Li Z., Lu L., Jeon G., Liu K., Yang X., Sep. 2019, Gated multiple feedback network for image super-resolution, in Proc. Brit. Mach. Vis. Conf., pp. 1-12

He K., Zhang X., Ren S., Sun J., Jun. 2016, Deep residual learning for image recognition, in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., pp. 770-778

Author

Mi-Gyeong Gwon

Mi-Gyeong Gwon is currently pursuing a B.S. degree with the Department of Electrical and Electronics Engineering, Konkuk University, Seoul, South Korea. Her current research interests include object detection and tracking, scene understanding, image enhancement, and colorization.

Jinhee Kim

Jinhee Kim received his B.S. degree in the Department of Electrical and Electronics Engineering and a M.S. degree in Electronic, Information and Communication Engineering from Konkuk University, Seoul, South Korea, in 2020 and 2021, respectively. He is currently working at Hyundai Motor Company. His research interests include computer vision, object detection and tracking, instance segmentation, and image enhancement. This work was done when he was at Konkuk University.

Gi-Mun Um

Gi-Mun Um received his B.S, M.S., and Ph.D. degrees in electronic engineering from Sogang University, Seoul, Rep. of Korea, in 1991, 1993, and 1998, respectively. Since 1998, he has worked for Electronics and Telecommunications Research Institute, Daejeon, Rep. of Korea, and he is currently with the Realistic Media Research Section. He has worked as a visiting research scientist at Communications Research Center Canada from 2001 to 2002. He participated in “F. IoT-ASM (F.747.8): requirements and reference architecture for audience-selectable media service framework in the IoT environment” as an editor of the ITU-T SG16 from 2014 to 2015. He is now working on 360VR, Light Field Video, AR./VR/XR, and network-based media processing. His main research interests include computer vision and multi-view/3D/AR video.

HeeKyung Lee

HeeKyung Lee received her B.S. degree in computer engineering from Yeungnam University, Daegu, Rep. of Korea, in 1999, and her M.S. degree in engineering from the Information and Communication University, Daejeon, Rep. of Korea, in 2002. Since 2002, she has worked for Electronics and Telecommunications Research Institute Daejeon, Rep. of Korea, where she is now serving as a senior member of the engineering staff. She participated in “TV-Anytime” standardization and IPTV Metadata standardization. She was also involved in the development of gaze tracking technology. Currently, she is working on 360VR, AR, and MR. Her research interests include personalized service via metadata, HCI, Gaze Tracking, Bi-directional advertisement and video content analysis, and VR/AR/MR.

Jeongil Seo

Jeongil Seo was born in Goryoung, Korea, in 1971. He received his Ph.D. degree in electronics from Kyoung-pook National University (KNU), Daegu, Korea, in 2005 for his work on audio signal processing systems. He worked as a member of the engi-neering staff at the Laboratory of Semiconductor, LG-semicon, Cheongju, Korea, from 1998 until 2000. He has worked as a director at the Immersive Media Research Section, Electronics and Telecommuni-cations Research Institute (ETRI), Daejeon, Korea, since 2000. His research activities include image and video processing, audio processing, and realistic broadcasting and media service systems.

Seong Yong Lim

Seong Yong Lim received his BS and M.S. in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST) in 1999 and 2011, respectively. His research interests include a wide range of field-of-view applications, real-time video processing, and network-based inference.

Seung-Jun Yang

Seung-Jun Yang received his B.S. and M.S. degree in computer science from Suncheon National University and Chonnam National University in 1999 and 2001, respectively. Since 2001, he has been a principal researcher in the media research division of ETRI, where he has developed advanced digital television technology, such as data broadcasting, personalized broadcasting, emotional broadcasting, assistive broadcasting for the disabled, and ultra-wide vision technology. He is currently working on the research of the fundamental media·contents technologies for hyper-realistic media space.

Wonjun Kim

Wonjun Kim received his B.S. degree from the Department of Electronic Engineering, Sogang University, Seoul, South Korea, in 2006, M.S. degree from the Department of Information and Communications, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 2008, and Ph.D. degree from the Department of Electrical Engineering, KAIST, in 2012. From September 2012 to February 2016, he was a Research Staff Member of the Samsung Advanced Institute of Technology (SAIT), South Korea. Since March 2016, he has been with the Department of Electrical and Electronics Engineering, Konkuk University, Seoul, where he is currently an Associate Professor. His research interests include image and video understanding, computer vision, pattern recognition, and biometrics, emphasizing background subtraction, saliency detection, face, and action recognition. He has served as a Regular Reviewer for over 30 international journal articles, including the IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Multimedia, IEEE Transactions on Cybernetics, IEEE Access, IEEE Signal Processing, Letters, and so on.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Siamese Feedback Network for Visual Object Tracking

Abstract

Keywords

1. Introduction

2. Related Work

3. Proposed Method

Fig. 1. Overall architecture of the proposed method (Siamese feedback network). Note that it consists of the proposed feedback blocks and Siamese region proposal network (SiamRPN).

3.1 Architecture Details

(1)

(2)

3.2 Loss Function

(3)

(4)

(5)

(5)

Fig. 2. Detailed architecture of the proposed feedback block, which consists of attention module and refinement module. Note that this figure illustrates the process of feedback operation from the viewpoint of the target branch shown in Fig. 1.

4. Experimental Results

4.1 Training

4.2 Performance Evaluations

Fig. 3. Some examples that show the overlap values of the bounding box estimated by SiamFB and the corresponding ground truth. Note that the overlap values are represented at the right bottom of each image.

Table 1. Performance on VOT2016 and VOT2018

Table 2. Comparison of the speed on the VOT dataset

Fig. 4. Some examples of visual object tracking on the VOT2016 dataset. Note that the results by the proposed method and other three previous methods are represented with different colors.

Fig. 5. Results of the proposed method for our outdoor video. The frames were sampled every three seconds from the video sequence. In particular, the second and the fourth examples of the bottom row effectively show the robustness of the proposed method in the complex scene.

Fig. 6. Some examples of the failure cases by the proposed method on the VOT2016 dataset. Note that these results were obtained under low-light conditions.

4.3 Ablation Studies

Table 3. Performance analysis of the proposed method according to changes in the network architectures.

Table 4. Performance analysis of the proposed method according to the number of feedback operations.

4.4 Discussion and Future Work

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Mi-Gyeong Gwon

Jinhee Kim

Gi-Mun Um

HeeKyung Lee

Jeongil Seo

Seong Yong Lim

Seung-Jun Yang

Wonjun Kim

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing