LiYuting1
-
(Department of Electronic Engineering, Hanyang University / Korea sinclairlyt@hanyang.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Autonomous driving, Multispectral pedestrian detection, Information fusion, Convolutional neural network, Deep learning
1. Introduction
Pedestrian detection is a popular task with a long history due to its important
applications in fields such as robotics, surveillance, and autonomous driving. Existing
pedestrian detectors with visible-light cameras are limited to daytime use only. To
detect pedestrians is still a difficult task in adverse illumination environments,
such as nighttime, low-light exposures, and in shadows.
To resolve the abovementioned problems, fusion of visible-light and thermal images
is a helpful solution in order to generate diverse and complementary features serving
the detection task under challenging circumstances. A visible-light image usually
contains semantic and color information crucial to detection, but is sensitive to
the lighting. A thermal infrared image is captured in terms of radiated heat from
objects in the scene, and therefore, no external light sources are required. However,
the thermal image is typically troubled by the problem of low resolution, and a lack
of object shape and texture details. By integrating visible-light and thermal images,
these different, but complementary, characteristics can be effectively fused to achieve
a more robust perception of the environment. As shown in Fig. 1(a), the color and details of pedestrians' clothing are difficult to distinguish in thermal
images but can be easily obtained through visible-light images. As shown in Fig. 1(b), pedestrian visibility with a visible-light image is restricted because of the low-light
conditions. However, information on pedestrian intensity can be enhanced by using
thermal images. When the background is bright, the visual features of pedestrians
provided by visible-light images are more distinct from the background, whereas human
silhouettes in thermal images are ambiguous, as shown in Fig. 1(c).
To take better advantage of complementary catachrestic features from visible-light
and thermal images for pedestrian detection, it is crucial to exploit how to efficiently
integrate visible-light and thermal sensors. In recent years, visible-light and thermal-camera
fusion-based pedestrian detection has achieved considerable improvement under various
lighting conditions [1-10, 23-25]. These studies proved that visible-light and thermal-image fusion is helpful in improving
pedestrian-detection accuracy in a variety of lighting environments-both daytime and
nighttime. Fig. 2 illustrates how fusion can improve detection performance. It is
clear that the detection results significantly improve through fusion. Fig.2(a) has
one false negative and five false positives, resulting in six errors. However, two
persons were detected without error after fusion, as shown in Fig. 2(b).
The rest of this paper is organized as follows. Section 2 presents a review of
the well-known multispectral pedestrian-detection methods. Section 3 provides an overview
of several multispectral object-detection benchmarks, and compares the performance
of well-known multispectral detectors. Section 4 discusses a few challenges in detection,
and puts forward some prospects for future study. Finally, conclusions are given in
Section 5.
Fig. 1. Visualization of the complementary characteris-tics of visible images (top) and thermal images (bottom) in (a) well-illuminated conditions, (b) poor illumination, (c) with bright backgrounds.
Fig. 2. Effects of pedestrian detection with (a) only a visible-light image, (b) visible-light and thermal images.
2. Literature Review
Multispectral pedestrian detection approaches can be divided into three categories:
hand-crafted approaches, convolutional neural network (CNN)-based one-stage approaches,
and CNN-based two-stage approaches. Hand-crafted channel-based approaches need to
manually design the features that describe images. Conversely, CNN-based approaches
are able to extract features in an automatic way to avoid extracting and selecting
features manually.
2.1 Hand-crafted Methods for Multispectral Pedestrian Detection
In [1], a multispectral pedestrian detector based on the Aggregated Channel Feature (ACF)
was developed. Based on ACF, the authors extracted an intensity feature, T, for the
visible-light image, and a Thermal Histogram of Oriented Gradients (THOG) for the
thermal image, respectively. Then, ACF + T + THOG are adopted as a new combined channel
feature, and the AdaBoost classifier is used to predict pedestrians. Experiments in
this work demonstrate that the fusion of multispectral data significantly improves
detection accuracy. However, this conventional hand-crafted method requires manually
designed features from which it is difficult to extract discriminative features in
order to describe pedestrians.
2.2 CNN-based Two-stage Multispectral Pedestrian Detection Methods
Different from the approaches based on hand-crafted channel features, the CNN-based
approaches have an advantage in that they can automatically extract informative features
by self-learning. The CNN-based two-stage multispectral pedestrian detectors are typically
built upon the classic two-stage CNN: the fast R-CNN and the faster R-CNN. In the
two-stage detectors, the first stage aims at generating a group of candidate proposals,
while the second stage screens the candidate proposals to accurately locate objects
and predict the class label.
Liu et al. [2] explored how visible-light and thermal images are fused in the best position between
two branches of a faster R-CNN [11]. As shown in Fig. 3, the authors proposed four fusion architectures named Early Fusion, Halfway Fusion,
Late Fusion, and Score Fusion. The four architectures all refer to the faster R-CNN.
The experimental results show that Halfway Fusion, which fuses the two branches in
the middle layers, performs the best.
Based on [2], König et al. [3] developed a Fusion RPN and adopted a boosted decision tree (BDT) [3] to replace the original classifier that the faster R-CNN used. RPN+BDT designed a
two-stream (visible-light stream and thermal stream) RPN to generate proposals, and
utilized the BDT to classify the proposals. Before applying the BDT, they take features
from the separate layers in the visible-light and thermal streams and the fused layer.
In their work, the fused features are picked from conv4_3 layers, whereas in [2], the fused features are picked from the conv5_3 layers. The conv4_3 layers are the
most promising sources of in-depth features for the BDT, as they have been proven.
Like [2], they also use one concatenation layer for fusion, which can be improved significantly.
Chen et al. [4] proposed a multilayer fusion CNN that further improved detection accuracy. They introduced
a summation fusion strategy, instead of concatenation fusion, to fuse layers from
the visible-light and thermal streams. The experimental results demonstrated that
summation fusion is more effective than concatenation fusion. For detection, they
extract features from all three feature maps.
All the above methods fuse visible-light information and thermal information
using equal weights, which is not robust enough under changeable lighting conditions.
In [5] and [6], adaptive weighting strategies were proposed for fusing multispectral images more
effectively. An illumination-aware scheme was proposed to learn the weight for fusing
visible-light and thermal images adaptively. For example, during the daytime, higher
weight is given to the visible-light channel. At night, the higher weight is given
to the thermal channel. This kind of adaptive weighting strategy has proven useful,
since FRPN-Sum + TSS [7] and MSDS-RCNN [8] design a subnetwork to process hard negative samples, which further improves detection
accuracy. These methods jointly learn the semantic segmentation task and the pedestrian
detection task in two subnets: a proposal generation network and a classification
network. The semantic segmentation task has been proven to help boost performance
from the object detection task, since the segmentation task can force the network
to generate stronger semantic features at a high level.
Zhang et al. [23] pointed out that the existing multispectral data are not in strict alignment, which
makes it hard for deep learning-based methods to accurately fuse features from two
weakly aligned modalities. To solve this problem, they developed an Aligned Region
CNN (AR-CNN). A Region Feature Alignment (RFA) module was designed to compute the
position shift used for adaptive alignment of the region features of the two modalities.
They also reweight features to highlight more useful features while suppressing less
useful features by developing a new multimodal fusion approach. A novel Region of
Interest (RoI) jitter strategy was also proposed to improve the robustness of the
detector.
Fig. 3. Four architectures to fuse visible and thermal images for multispectral pedestrian detection, explored in [2].
2.3 CNN-based One-stage Multispectral Pedestrian Detection Methods
Although the above-mentioned two-stage detectors achieve good performance, the
computation time is high. Different from two-stage detectors, one-stage detectors
omit the region proposal generation procedure and encapsulate all operations in a
one-shot process to resample features, which achieves performance superior to the
two-stage detectors in terms of computation time.
The gated fusion double single shot detector (GFD-SSD) [9] proposes adopting two SSDs to process the input visible-light and thermal images,
which aims to achieve a better balance between detection accuracy and computation
time. In this work, two kinds of Gated Fusion Unit (GFU) are proposed to integrate
the feature maps from the middle layers of the two SSDs. The critical function of
the proposed GFU is to adaptively adjust the feature map combination between two modalities.
Besides, the authors proposed one gated fusion and four mixed fusion architectures
(named Mixed_Even, Mixed_Odd, Mixed_Early, and Mixed_Late, depending on which layers
are selected to use the GFUs) on the feature pyramid to integrate two SSDs in different
modalities. Experiments show that the GFD-SSD and its mixed fusion architecture achieved
both competitive accuracy and better inference runtime.
In [10], a RetinaNet-based fusion architecture [12] was presented. The authors exploited three kinds of fusion network based on the feature
pyramid network (FPN) [13], and they adopted a focal-loss function in RetinaNet to achieve more accurate detection
performance. However, the architecture of this method is complicated, with redundant
parameters from which the performance was not satisfactory, and the central processing
unit (CPU) time is high.
Zhang et al. proposed a Cyclic Fuse-and-Refine Module (CFRM) [24] to be implemented on a feature fusion single shot multibox detector (FSSD). The proposed
CFRM can leverage the complementary balance existing in multispectral features, and
can be incorporated into any network.
Followed by the AR-CNN, Zhou et al. proposed a Modality Balance Network (MBNet)
[25] to further resolve the problem of modality imbalance. The MBNet is built by extending
the SSD in which a Differential Modality Aware Fusion (DMAF) module is designed so
the two modalities complement each other. Besides, complementary features are selected
and aligned adaptively by the proposed illumination-aware feature alignment module.
3. Datasets and Performance Comparisons
3.1 Datasets
$\textbf{KAIST multispectral pedestrian detection dataset:}$ The KAIST dataset
[1] was collected using a dual visible-light/thermal camera installed on a car roof.
The dataset consists of 25~086 pairs of day and night scenes as visible-light/thermal
training images with two-frame skips. The test set consists of 2252 pairs of visible-light/thermal
images with 20-frame skips, of which 1455 pairs were captured in the daytime, and
797 pairs captured at night. All images are strictly pre-aligned and of the same size
(640${\times}$512). The dataset is comprehensive and diverse, containing a wide range
of pedestrian sizes, various poses for pedestrians, and partially or heavily occluded
pedestrians. In particular, the dataset includes images under various illumination
environments (e.g., overexposed, in shadow, at nighttime, at dawn, and at dusk).
$\textbf{UTokyo dataset:}$ The UTokyo dataset [14] includes 7512 groups of images, in which 3740 groups were collected during the day,
while the other 3772 groups were collected at night. All images were collected in
a university scene at one frame per second by using color, far-infrared, middle-infrared,
and near-infrared cameras. The dataset contains the following labeled categories:
bike, car, car stop, color cone, and person, in which 6066 groups of images are unaligned,
while the other 1466 groups of images are in alignment. All images are 320${\times}$256.
$\textbf{OSU color-thermal dataset:}$ The OSU color-thermal dataset [15] contains six sequences, in which three were captured under the same conditions. The
dataset contains 8544 pairs of visible-light/thermal images, which are 320${\times}$240
and well-aligned.
$\textbf{CVC-14:Visible-FIR Day-Night Pedestrian Sequence Dataset:}$ The CVC-14
[16] dataset images were taken using visible-light and far-infrared cameras during the
day and at night, providing one day set and one night set sequences. The training
set contains 3695 pairs of daytime images and 3390 pairs of nighttime images, which
includes 1500 annotated pedestrians in both daytime and nighttime images. The testing
set contains 700 pairs of images, with around 2000 annotated pedestrians from daytime
and around 1500 pedestrians from night.
Table 1. Comparisons of Detection Accuracy and Computation Time on the KAIST Test Set.
Methods
|
Miss Rate (%) in terms of different scales
|
Miss rate (%) in terms of different lighting conditions
|
CPU time on the PC (seconds per frame)
|
Reasonable scale
|
Far scale
|
Daytime
|
Nighttime
|
|
ACF + T +THOG [1]
|
54.27
|
91.42
|
64.17
|
63.99
|
0.10
|
Halfway Fusion [2]
|
37.00
|
81.59
|
36.84
|
35.49
|
0.19
|
Fusion RPN+BDT [3]
|
29.83
|
86.64
|
29.58
|
30.35
|
-
|
MLF-CNN [4]
|
25.65
|
77.05
|
25.22
|
26.60
|
0.15
|
IATDNN + IAMSS [5]
|
26.37
|
-
|
27.29
|
24.41
|
0.25
|
IAF R-CNN [6]
|
15.73
|
-
|
14.55
|
18.26
|
0.21
|
FRPN-Sum + TSS [7]
|
26.67
|
75.68
|
26.75
|
25.24
|
0.23
|
MSDS-RCNN [8]
|
11.63
|
-
|
10.60
|
13.73
|
0.23
|
GFD-SSD [9]
|
27.17
|
-
|
25.28
|
27.49
|
0.06
|
ResNet-101 + FPN +Sum [10]
|
27.60
|
-
|
27.92
|
25.77
|
0.13
|
AR-CNN [23]
|
9.34
|
-
|
9.94
|
8.38
|
-
|
CFRM_3 [24]
|
10.05
|
-
|
9.72
|
10.80
|
-
|
MBNet [25]
|
8.13
|
55.99
|
8.28
|
7.86
|
0.07
|
3.2 Performance Comparisons
This section compares the performance of 10 recently published, well-known multispectral
pedestrian detection methods. This study adopted the widely used log-average miss
rate against a false-positive per image (FPPI) range of [10-2, 100] (as suggested
by Dollár et al. [17]) to evaluate detection performance. An IoU threshold should be larger than 0.5 for
matching predicted bounding boxes to boxes of ground truth. Here, all comparisons
were evaluated using the KAIST multispectral pedestrian dataset. Table 1 compares the detection results in terms of different scales and different lighting
conditions. The results were examined using the miss rate for different scales defined
in [1]: the reasonable scale (i.e., more than 55 pixels in height), and the far-scale (i.e.,
less than 55 pixels). The results were also examined using a miss rate for different
lighting conditions: daytime and nighttime.
From Table 1, we can see that the trend in the accuracy for nighttime and daytime images is different
for each method. For daytime, the illumination conditions in the environment are usually
good. In this case, the visible-light image contains more useful information for pedestrian
detection, while the thermal image contains less useful information. On the other
hand, for nighttime, the illumination conditions in the environment are adverse. In
this case, the thermal image contains more useful information about pedestrians, because
the thermal camera captures the radiated heat of the objects in the scene, which does
not require external lighting sources. However, the visible-light image contains rare
useful information because the visible-light camera relies on external lighting. During
the training process of a deep neural network, the weights for fusing visible-light
and thermal images are automatically learned, and will be fixed after training is
finished. However, fusing weights for daytime or nighttime images should be adaptive
due to the changeable lighting conditions. That is the reason the trend in the accuracy
performance for nighttime and daytime images is different for each method.
Considering the above-mentioned situation, some work, such as IATDNN + IAMSS
[5] and IAF R-CNN [6], adopted an illumination-aware subnetwork to assign the fusion weights adaptively.
Although the miss rate decreased, compared to previous methods [1-4], the overall performance was still not satisfactory. As shown in Table1, for a reasonable-scale
case that contains both day and night images, the miss rate was 15.73% for the IAF
R-CNN [6], for example. However, this miss rate significantly improved to 11.63% with the MSDS-RCNN
[8]. This result demonstrates that the simultaneous learning of semantic segmentation
and object detection tasks is helpful for improving detection performance. However,
the CPU time for the MSDS-RCNN was 0.23 s/f, which is not real-time. Therefore, one-stage-based
methods were proposed, such as GFD-SSD [9], ResNet-101 + FPN +Sum [10], CFRM_3 [24], and MBNet [25]. Although the CPU time decreased, the miss rate for GFD-SSD [9] and ResNet-101 + FPN +Sum [10] was not satisfactory. AR-CNN [23] and MBNet [25] discovered that the non-strictly aligned visible-light and thermal data can influence
the learning accuracy of deep neural networks. They proposed additional strategies
to complement the inconsistency problem from visible-light and thermal images, and
significantly improved detection accuracy. For example, the state-of-the-art MBNet
method [25] achieved a miss rate of 8.13% for a reasonable-scale case with a CPU time at 0.07
s/f, as shown in Table 1.
Table 1 also compares computation times, which were measured using the same machine: a standard
computer under Ubuntu 16.04 with a Core i7-4790k 4.0 GHz CPU and 32 GB of random access
memory. The graphics processing unit (GPU) used for the experiment was the NVIDIA
Titan X.
4. Challenges and Future Scope
Though promising results were achieved in the multispectral pedestrian detection
area, there are still open challenges, such as detection of small-sized pedestrians,
information lost during fusion of visible-light and thermal images with different
properties, occluded-pedestrian detection, and dealing with the trade-off between
detection accuracy and speed. In the following, the above-mentioned challenges and
their possible solutions are discussed.
4.1 Small-pedestrian Detection
Existing multispectral pedestrian detectors perform well when detecting large-size
pedestrians. However, they are likely to mis-detect small pedestrians. To accurately
detect objects in a wide range of sizes is a crucial requirement in pattern recognition.
In a complicated environment that includes various types and sizes of pedestrian,
existing methods usually try to extract discriminative features from regions of interest
by using a fixed scale for the corresponding receptive field. However, it is difficult
to cover various scaled objects in a real scene by using a certain receptive field.
Besides, small pedestrians always have ambiguous appearances and blurred contours.
This makes it hard to discriminate objects from backgrounds and other overlapping
objects. Large pedestrians contain rich information for the detection task, whereas
small pedestrians are difficult to recognize.
Many methods have been proposed recently to improve feature extraction from small-scale
objects. Those methods aim to bring more context information, and to increase the
spatial resolution of feature maps [18,19]. Those methods typically add additional deconvolution layers, which is a strong and
simple strategy to efficiently enlarge the receptive field of filters, and brings
larger amounts of contextual information to avoid increasing the number of redundant
parameters or the computation time. To integrate the deconvolution technique into
a multispectral fusion network for small-sized object detection can be a future research
topic.
4.2 Fusion of Visible-light and Thermal Images
The difference in resolution between a visible-light image and an infrared image
is large, and therefore, information can be lost when performing information fusion.
In addition, infrared images may lose pedestrian thermal information in hot weather.
In this case, the fused image will contain noise, which will cause the detection rate
to decrease.
We can consider enhancing the features of infrared images before fusion with
visible-light images. One infrared image enhancement method can be introduced through
saliency feature detection. Some recent research showed desirable performance from
saliency detection by using a CNN [20]. The objective of saliency detection is to discriminate the differences between targets.
Saliency detection can reduce the complexity of the background so vital targets can
be easily detected, which is helpful in enhancing the region of interest (the pedestrian
region) in thermal images. It is natural to detect vital pixels that belong to salient
regions that might contain pedestrians through learning informative features by using
saliency detection methods. To this end, it is worth exploring how to take advantage
of saliency detection to enhance thermal images.
4.3 Partially or Heavily Occluded Pedestrian Detection
To detect occluded pedestrians is another challenging task for all pedestrian
detectors. It is essential to resolve this limitation. This common challenge has been
studied by many researchers [21]. A pedestrian can be regarded as an integration of different body parts. We can use
a CNN to learn the features of each body part, and produce a corresponding score.
A low score denotes that the body part is occluded, whereas a high score denotes a
body part that is not occluded. During training, the features of each body part are
combined. The combined features are adopted to classify and localize the pedestrian.
The issues we need to resolve in the future are the number of body parts to be selected
and how to efficiently combine the features of each body part in IR images with a
low resolution. Additionally, we can resolve the occlusion issue through a loss function
by taking advantage of supervised learning. We can think about designing the loss
function to allow proposals to get rid of overlapped, non-largest ground truth bounding
boxes. The aim is to force the proposal to focus on real objects and to avoid false
objects so that the mis-detection rate due to occlusions can be reduced.
4.4 The Trade-off between Detection Accuracy and Speed
For autonomous driving, accurate and real-time pedestrian detection is necessary.
To balance the trade-off between accuracy and computation time is a common problem
in the object detection area. Most recent work [22] proposes a lightweight auxiliary network based on SSD. In combination with the existing
bottom-up and top-down networks, a bidirectional network was proposed. Experiments
show that these two strategies can improve accuracy and save computing time. To explore
light CNNs for multispectral pedestrians will be attractive in the future, because
much of the information from a CNN is redundant.
5. Conclusion
This paper reviewed multispectral pedestrian detection methods proposed in recent
years, dividing them into three categories and reviewing each of them. Then, four
multispectral pedestrian detection datasets were explained, and the performance of
13 well-known, recently published multispectral pedestrian detection methods were
compared. Finally, several current problems were discussed, with future research directions
for multispectral pedestrian detection suggested.
ACKNOWLEDGMENTS
The author thanks Yunfan Chen for her great help during the whole process.
REFERENCES
Hwang S., Park J., Kim N., et al., June 2015, Multispectral pedestrian detection:
benchmark dataset and baseline., Proc. IEEE Conf. Computer Vision and Pattern Recognition,
Boston, MA, USA, pp. 1037-1045
Liu J., Zhang S., Wang S., et al. , September 2016, Multispectral deep neural networks
for pedestrian detection., Proc. British Machine Vision Conf., York, UK, pp. 1-13
König D., Adam M., Jarvers C., e tal. , July 2017, Fully convolutional region proposal
networks for multispectral person detection., Proc. IEEE Workshop on Computer Vision
and Pattern Recognition, Honolulu, HI, USA, pp. 243-250
Chen Y., Xie H., Shin H., 2018, Multi-layer fusion techniques using a CNN for multispectral
pedestrian detection., IET Computer Vision, Vol. 12, No. 8, pp. 1179-1187
uan D., Cao Y., Yang J., Cao Y., Yang M.Y., 2019, Fusion of multispectral data through
illumination-aware deep neural networks for pedestrian detection., Information Fusion,
Vol. 50, pp. 148-157
Li C., Song D., Tong R., Tang M., 2019, Illumination-aware faster R-CNN for robust
multispectral pedestrian detection., Pattern Recognition, Vol. 85, pp. 161-171
Guan D., Cao Y., Yang J., Cao Y., Tisse C.L., 2018, Exploiting fusion architectures
for multispectral pedestrian detection and segmentation., Applied optics, Vol. 57,
No. 18, pp. d108-D116
Li C., Song D., Tong R., Tang M., 2018, Multispectral pedestrian detection via simultaneous
detection and segmentation., arXiv preprint arXiv:1808.04818.
Zheng Y., Izzat I.H., Ziaee S., 2019, GFD-SSD: Gated Fusion Double SSD for Multispectral
Pedestrian Detection., arXiv preprint arXiv:1903.06999.
Pei D., Jing M., Liu H., Jiang L., Sun F., 2020, A Fast RetinaNet Fusion Framework
for Multi-spectral Pedestrian Detection., Infrared Physics & Technology
Ren S., He K., Girshick R., Sun J., 2015, Faster r-cnn: Towards real-time object detection
with region proposal networks., In Advances in neural information processing systems,
pp. 91-99
Lin T.Y., Goyal P., Girshick R., He K., Dollár P., 2017, Focal loss for dense object
detection., In Proceedings of the IEEE international conference on computer vision,
pp. 2980-2988
Lin T.Y., Dollár P., Girshick R., He K., Hariharan B., Belongie S., 2017, Feature
pyramid networks for object detection., In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 2117-2125
Takumi K., Watanabe K., Ha Q., et al., Mountain View, Multispectral object detection
for autonomous vehicles., Proc. Thematic Workshops of ACM Multimedia
Davis J.W., Sharma V., 2007, Background-subtraction using contour-based fusion of
thermal and visible imagery, Comput. Vis. Image Underst., Vol. 106, No. 2-3, pp. 162-182
CVC-14: Visible-FIR Day-Night Pedestrian Sequence Dataset
Dollár P., Wojek C., Schiele B., et al. , 2012, Pedestrian detection: an evaluation
of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 34, No. 4,
pp. 743-761
u C-Y, Liu W, Ranga A, Tyagi A, 2017, , DSSD: deconvolutional single shot detector.,
arXiv:1701.06659
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL, 2017, DeepLab: semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected
CRFs., IEEE Trans Pattern Anal Mach Intell 40, Vol. 4, pp. 834-848.
Wang W., Lai Q., Fu H., Shen J., 2019, , Salient object detection in the deep learning
era: An in-depth survey., arXiv preprint arXiv:1904.09146.
Zhang S., Wen L., Bian X., Lei Z., Li S.Z., 2018, Occlusion-aware r-cnn: detecting
pedestrians in a crowd., In Proceedings of the European Conference on Computer Vision,
pp. 637-653
Wang T., Anwer R.M., Cholakkal H., Khan F.S., Pang Y., Shao L., 2019, Learning rich
features at high-speed for single-shot object detection., In Proceedings of the IEEE
International Conference on Computer Vision, pp. 1971-1980
Zhang L., Zhu X.Y., Chen X.Y., Yang X., Lei Z., L Z.Y., 2019, Weakly aligned cross-modal
learning for multispectral pedestrian detection., In Proceedings of the IEEE International
Conference on Computer Vision., pp. 5127-5137
Zhang H., Fromont E., Lefevre S., Avignon B., 2020, Multispectral Fusion for Object
Detection with Cyclic Fuse-and-Refine Blocks., In Proceedings of the IEEE International
Conference on Image Processing.
Zhou K.L., Chen L.S., Cao X., 2020, Improving Multispectral Pedestrian Detection by
Addressing Modality Imbalance Problems., arXiv preprint arXiv:2008.03043.
Author
Yuting Li is currently working in the Department of Electronic Systems Engineering,
Hanyang University, South Korea. His research interests include autonomous driving,
convolu-tional neural networks, and deep learning.