Mobile QR Code QR CODE

  1. (Department of Electronic Engineering, Hanyang University / Korea

Autonomous driving, Multispectral pedestrian detection, Information fusion, Convolutional neural network, Deep learning

1. Introduction

Pedestrian detection is a popular task with a long history due to its important applications in fields such as robotics, surveillance, and autonomous driving. Existing pedestrian detectors with visible-light cameras are limited to daytime use only. To detect pedestrians is still a difficult task in adverse illumination environments, such as nighttime, low-light exposures, and in shadows.

To resolve the abovementioned problems, fusion of visible-light and thermal images is a helpful solution in order to generate diverse and complementary features serving the detection task under challenging circumstances. A visible-light image usually contains semantic and color information crucial to detection, but is sensitive to the lighting. A thermal infrared image is captured in terms of radiated heat from objects in the scene, and therefore, no external light sources are required. However, the thermal image is typically troubled by the problem of low resolution, and a lack of object shape and texture details. By integrating visible-light and thermal images, these different, but complementary, characteristics can be effectively fused to achieve a more robust perception of the environment. As shown in Fig. 1(a), the color and details of pedestrians' clothing are difficult to distinguish in thermal images but can be easily obtained through visible-light images. As shown in Fig. 1(b), pedestrian visibility with a visible-light image is restricted because of the low-light conditions. However, information on pedestrian intensity can be enhanced by using thermal images. When the background is bright, the visual features of pedestrians provided by visible-light images are more distinct from the background, whereas human silhouettes in thermal images are ambiguous, as shown in Fig. 1(c).

To take better advantage of complementary catachrestic features from visible-light and thermal images for pedestrian detection, it is crucial to exploit how to efficiently integrate visible-light and thermal sensors. In recent years, visible-light and thermal-camera fusion-based pedestrian detection has achieved considerable improvement under various lighting conditions [1-10, 23-25]. These studies proved that visible-light and thermal-image fusion is helpful in improving pedestrian-detection accuracy in a variety of lighting environments-both daytime and nighttime. Fig. 2 illustrates how fusion can improve detection performance. It is clear that the detection results significantly improve through fusion. Fig.2(a) has one false negative and five false positives, resulting in six errors. However, two persons were detected without error after fusion, as shown in Fig. 2(b).

The rest of this paper is organized as follows. Section 2 presents a review of the well-known multispectral pedestrian-detection methods. Section 3 provides an overview of several multispectral object-detection benchmarks, and compares the performance of well-known multispectral detectors. Section 4 discusses a few challenges in detection, and puts forward some prospects for future study. Finally, conclusions are given in Section 5.

Fig. 1. Visualization of the complementary characteris-tics of visible images (top) and thermal images (bottom) in (a) well-illuminated conditions, (b) poor illumination, (c) with bright backgrounds.
Fig. 2. Effects of pedestrian detection with (a) only a visible-light image, (b) visible-light and thermal images.

2. Literature Review

Multispectral pedestrian detection approaches can be divided into three categories: hand-crafted approaches, convolutional neural network (CNN)-based one-stage approaches, and CNN-based two-stage approaches. Hand-crafted channel-based approaches need to manually design the features that describe images. Conversely, CNN-based approaches are able to extract features in an automatic way to avoid extracting and selecting features manually.

2.1 Hand-crafted Methods for Multispectral Pedestrian Detection

In [1], a multispectral pedestrian detector based on the Aggregated Channel Feature (ACF) was developed. Based on ACF, the authors extracted an intensity feature, T, for the visible-light image, and a Thermal Histogram of Oriented Gradients (THOG) for the thermal image, respectively. Then, ACF + T + THOG are adopted as a new combined channel feature, and the AdaBoost classifier is used to predict pedestrians. Experiments in this work demonstrate that the fusion of multispectral data significantly improves detection accuracy. However, this conventional hand-crafted method requires manually designed features from which it is difficult to extract discriminative features in order to describe pedestrians.

2.2 CNN-based Two-stage Multispectral Pedestrian Detection Methods

Different from the approaches based on hand-crafted channel features, the CNN-based approaches have an advantage in that they can automatically extract informative features by self-learning. The CNN-based two-stage multispectral pedestrian detectors are typically built upon the classic two-stage CNN: the fast R-CNN and the faster R-CNN. In the two-stage detectors, the first stage aims at generating a group of candidate proposals, while the second stage screens the candidate proposals to accurately locate objects and predict the class label.

Liu et al. [2] explored how visible-light and thermal images are fused in the best position between two branches of a faster R-CNN [11]. As shown in Fig. 3, the authors proposed four fusion architectures named Early Fusion, Halfway Fusion, Late Fusion, and Score Fusion. The four architectures all refer to the faster R-CNN. The experimental results show that Halfway Fusion, which fuses the two branches in the middle layers, performs the best.

Based on [2], König et al. [3] developed a Fusion RPN and adopted a boosted decision tree (BDT) [3] to replace the original classifier that the faster R-CNN used. RPN+BDT designed a two-stream (visible-light stream and thermal stream) RPN to generate proposals, and utilized the BDT to classify the proposals. Before applying the BDT, they take features from the separate layers in the visible-light and thermal streams and the fused layer. In their work, the fused features are picked from conv4_3 layers, whereas in [2], the fused features are picked from the conv5_3 layers. The conv4_3 layers are the most promising sources of in-depth features for the BDT, as they have been proven. Like [2], they also use one concatenation layer for fusion, which can be improved significantly.

Chen et al. [4] proposed a multilayer fusion CNN that further improved detection accuracy. They introduced a summation fusion strategy, instead of concatenation fusion, to fuse layers from the visible-light and thermal streams. The experimental results demonstrated that summation fusion is more effective than concatenation fusion. For detection, they extract features from all three feature maps.

All the above methods fuse visible-light information and thermal information using equal weights, which is not robust enough under changeable lighting conditions. In [5] and [6], adaptive weighting strategies were proposed for fusing multispectral images more effectively. An illumination-aware scheme was proposed to learn the weight for fusing visible-light and thermal images adaptively. For example, during the daytime, higher weight is given to the visible-light channel. At night, the higher weight is given to the thermal channel. This kind of adaptive weighting strategy has proven useful, since FRPN-Sum + TSS [7] and MSDS-RCNN [8] design a subnetwork to process hard negative samples, which further improves detection accuracy. These methods jointly learn the semantic segmentation task and the pedestrian detection task in two subnets: a proposal generation network and a classification network. The semantic segmentation task has been proven to help boost performance from the object detection task, since the segmentation task can force the network to generate stronger semantic features at a high level.

Zhang et al. [23] pointed out that the existing multispectral data are not in strict alignment, which makes it hard for deep learning-based methods to accurately fuse features from two weakly aligned modalities. To solve this problem, they developed an Aligned Region CNN (AR-CNN). A Region Feature Alignment (RFA) module was designed to compute the position shift used for adaptive alignment of the region features of the two modalities. They also reweight features to highlight more useful features while suppressing less useful features by developing a new multimodal fusion approach. A novel Region of Interest (RoI) jitter strategy was also proposed to improve the robustness of the detector.

Fig. 3. Four architectures to fuse visible and thermal images for multispectral pedestrian detection, explored in [2].

2.3 CNN-based One-stage Multispectral Pedestrian Detection Methods

Although the above-mentioned two-stage detectors achieve good performance, the computation time is high. Different from two-stage detectors, one-stage detectors omit the region proposal generation procedure and encapsulate all operations in a one-shot process to resample features, which achieves performance superior to the two-stage detectors in terms of computation time.

The gated fusion double single shot detector (GFD-SSD) [9] proposes adopting two SSDs to process the input visible-light and thermal images, which aims to achieve a better balance between detection accuracy and computation time. In this work, two kinds of Gated Fusion Unit (GFU) are proposed to integrate the feature maps from the middle layers of the two SSDs. The critical function of the proposed GFU is to adaptively adjust the feature map combination between two modalities. Besides, the authors proposed one gated fusion and four mixed fusion architectures (named Mixed_Even, Mixed_Odd, Mixed_Early, and Mixed_Late, depending on which layers are selected to use the GFUs) on the feature pyramid to integrate two SSDs in different modalities. Experiments show that the GFD-SSD and its mixed fusion architecture achieved both competitive accuracy and better inference runtime.

In [10], a RetinaNet-based fusion architecture [12] was presented. The authors exploited three kinds of fusion network based on the feature pyramid network (FPN) [13], and they adopted a focal-loss function in RetinaNet to achieve more accurate detection performance. However, the architecture of this method is complicated, with redundant parameters from which the performance was not satisfactory, and the central processing unit (CPU) time is high.

Zhang et al. proposed a Cyclic Fuse-and-Refine Module (CFRM) [24] to be implemented on a feature fusion single shot multibox detector (FSSD). The proposed CFRM can leverage the complementary balance existing in multispectral features, and can be incorporated into any network.

Followed by the AR-CNN, Zhou et al. proposed a Modality Balance Network (MBNet) [25] to further resolve the problem of modality imbalance. The MBNet is built by extending the SSD in which a Differential Modality Aware Fusion (DMAF) module is designed so the two modalities complement each other. Besides, complementary features are selected and aligned adaptively by the proposed illumination-aware feature alignment module.

3. Datasets and Performance Comparisons

3.1 Datasets

$\textbf{KAIST multispectral pedestrian detection dataset:}$ The KAIST dataset [1] was collected using a dual visible-light/thermal camera installed on a car roof. The dataset consists of 25~086 pairs of day and night scenes as visible-light/thermal training images with two-frame skips. The test set consists of 2252 pairs of visible-light/thermal images with 20-frame skips, of which 1455 pairs were captured in the daytime, and 797 pairs captured at night. All images are strictly pre-aligned and of the same size (640${\times}$512). The dataset is comprehensive and diverse, containing a wide range of pedestrian sizes, various poses for pedestrians, and partially or heavily occluded pedestrians. In particular, the dataset includes images under various illumination environments (e.g., overexposed, in shadow, at nighttime, at dawn, and at dusk).

$\textbf{UTokyo dataset:}$ The UTokyo dataset [14] includes 7512 groups of images, in which 3740 groups were collected during the day, while the other 3772 groups were collected at night. All images were collected in a university scene at one frame per second by using color, far-infrared, middle-infrared, and near-infrared cameras. The dataset contains the following labeled categories: bike, car, car stop, color cone, and person, in which 6066 groups of images are unaligned, while the other 1466 groups of images are in alignment. All images are 320${\times}$256.

$\textbf{OSU color-thermal dataset:}$ The OSU color-thermal dataset [15] contains six sequences, in which three were captured under the same conditions. The dataset contains 8544 pairs of visible-light/thermal images, which are 320${\times}$240 and well-aligned.

$\textbf{CVC-14:Visible-FIR Day-Night Pedestrian Sequence Dataset:}$ The CVC-14 [16] dataset images were taken using visible-light and far-infrared cameras during the day and at night, providing one day set and one night set sequences. The training set contains 3695 pairs of daytime images and 3390 pairs of nighttime images, which includes 1500 annotated pedestrians in both daytime and nighttime images. The testing set contains 700 pairs of images, with around 2000 annotated pedestrians from daytime and around 1500 pedestrians from night.

Table 1. Comparisons of Detection Accuracy and Computation Time on the KAIST Test Set.


Miss Rate (%) in terms of different scales

Miss rate (%) in terms of different lighting conditions

CPU time on the PC (seconds per frame)

Reasonable scale

Far scale



ACF + T +THOG [1]






Halfway Fusion [2]






Fusion RPN+BDT [3]
























FRPN-Sum + TSS [7]


















ResNet-101 + FPN +Sum [10]






AR-CNN [23]






CFRM_3 [24]






MBNet [25]






3.2 Performance Comparisons

This section compares the performance of 10 recently published, well-known multispectral pedestrian detection methods. This study adopted the widely used log-average miss rate against a false-positive per image (FPPI) range of [10-2, 100] (as suggested by Dollár et al. [17]) to evaluate detection performance. An IoU threshold should be larger than 0.5 for matching predicted bounding boxes to boxes of ground truth. Here, all comparisons were evaluated using the KAIST multispectral pedestrian dataset. Table 1 compares the detection results in terms of different scales and different lighting conditions. The results were examined using the miss rate for different scales defined in [1]: the reasonable scale (i.e., more than 55 pixels in height), and the far-scale (i.e., less than 55 pixels). The results were also examined using a miss rate for different lighting conditions: daytime and nighttime.

From Table 1, we can see that the trend in the accuracy for nighttime and daytime images is different for each method. For daytime, the illumination conditions in the environment are usually good. In this case, the visible-light image contains more useful information for pedestrian detection, while the thermal image contains less useful information. On the other hand, for nighttime, the illumination conditions in the environment are adverse. In this case, the thermal image contains more useful information about pedestrians, because the thermal camera captures the radiated heat of the objects in the scene, which does not require external lighting sources. However, the visible-light image contains rare useful information because the visible-light camera relies on external lighting. During the training process of a deep neural network, the weights for fusing visible-light and thermal images are automatically learned, and will be fixed after training is finished. However, fusing weights for daytime or nighttime images should be adaptive due to the changeable lighting conditions. That is the reason the trend in the accuracy performance for nighttime and daytime images is different for each method.

Considering the above-mentioned situation, some work, such as IATDNN + IAMSS [5] and IAF R-CNN [6], adopted an illumination-aware subnetwork to assign the fusion weights adaptively. Although the miss rate decreased, compared to previous methods [1-4], the overall performance was still not satisfactory. As shown in Table1, for a reasonable-scale case that contains both day and night images, the miss rate was 15.73% for the IAF R-CNN [6], for example. However, this miss rate significantly improved to 11.63% with the MSDS-RCNN [8]. This result demonstrates that the simultaneous learning of semantic segmentation and object detection tasks is helpful for improving detection performance. However, the CPU time for the MSDS-RCNN was 0.23 s/f, which is not real-time. Therefore, one-stage-based methods were proposed, such as GFD-SSD [9], ResNet-101 + FPN +Sum [10], CFRM_3 [24], and MBNet [25]. Although the CPU time decreased, the miss rate for GFD-SSD [9] and ResNet-101 + FPN +Sum [10] was not satisfactory. AR-CNN [23] and MBNet [25] discovered that the non-strictly aligned visible-light and thermal data can influence the learning accuracy of deep neural networks. They proposed additional strategies to complement the inconsistency problem from visible-light and thermal images, and significantly improved detection accuracy. For example, the state-of-the-art MBNet method [25] achieved a miss rate of 8.13% for a reasonable-scale case with a CPU time at 0.07 s/f, as shown in Table 1.

Table 1 also compares computation times, which were measured using the same machine: a standard computer under Ubuntu 16.04 with a Core i7-4790k 4.0 GHz CPU and 32 GB of random access memory. The graphics processing unit (GPU) used for the experiment was the NVIDIA Titan X.

4. Challenges and Future Scope

Though promising results were achieved in the multispectral pedestrian detection area, there are still open challenges, such as detection of small-sized pedestrians, information lost during fusion of visible-light and thermal images with different properties, occluded-pedestrian detection, and dealing with the trade-off between detection accuracy and speed. In the following, the above-mentioned challenges and their possible solutions are discussed.

4.1 Small-pedestrian Detection

Existing multispectral pedestrian detectors perform well when detecting large-size pedestrians. However, they are likely to mis-detect small pedestrians. To accurately detect objects in a wide range of sizes is a crucial requirement in pattern recognition. In a complicated environment that includes various types and sizes of pedestrian, existing methods usually try to extract discriminative features from regions of interest by using a fixed scale for the corresponding receptive field. However, it is difficult to cover various scaled objects in a real scene by using a certain receptive field. Besides, small pedestrians always have ambiguous appearances and blurred contours. This makes it hard to discriminate objects from backgrounds and other overlapping objects. Large pedestrians contain rich information for the detection task, whereas small pedestrians are difficult to recognize.

Many methods have been proposed recently to improve feature extraction from small-scale objects. Those methods aim to bring more context information, and to increase the spatial resolution of feature maps [18,19]. Those methods typically add additional deconvolution layers, which is a strong and simple strategy to efficiently enlarge the receptive field of filters, and brings larger amounts of contextual information to avoid increasing the number of redundant parameters or the computation time. To integrate the deconvolution technique into a multispectral fusion network for small-sized object detection can be a future research topic.

4.2 Fusion of Visible-light and Thermal Images

The difference in resolution between a visible-light image and an infrared image is large, and therefore, information can be lost when performing information fusion. In addition, infrared images may lose pedestrian thermal information in hot weather. In this case, the fused image will contain noise, which will cause the detection rate to decrease.

We can consider enhancing the features of infrared images before fusion with visible-light images. One infrared image enhancement method can be introduced through saliency feature detection. Some recent research showed desirable performance from saliency detection by using a CNN [20]. The objective of saliency detection is to discriminate the differences between targets. Saliency detection can reduce the complexity of the background so vital targets can be easily detected, which is helpful in enhancing the region of interest (the pedestrian region) in thermal images. It is natural to detect vital pixels that belong to salient regions that might contain pedestrians through learning informative features by using saliency detection methods. To this end, it is worth exploring how to take advantage of saliency detection to enhance thermal images.

4.3 Partially or Heavily Occluded Pedestrian Detection

To detect occluded pedestrians is another challenging task for all pedestrian detectors. It is essential to resolve this limitation. This common challenge has been studied by many researchers [21]. A pedestrian can be regarded as an integration of different body parts. We can use a CNN to learn the features of each body part, and produce a corresponding score. A low score denotes that the body part is occluded, whereas a high score denotes a body part that is not occluded. During training, the features of each body part are combined. The combined features are adopted to classify and localize the pedestrian. The issues we need to resolve in the future are the number of body parts to be selected and how to efficiently combine the features of each body part in IR images with a low resolution. Additionally, we can resolve the occlusion issue through a loss function by taking advantage of supervised learning. We can think about designing the loss function to allow proposals to get rid of overlapped, non-largest ground truth bounding boxes. The aim is to force the proposal to focus on real objects and to avoid false objects so that the mis-detection rate due to occlusions can be reduced.

4.4 The Trade-off between Detection Accuracy and Speed

For autonomous driving, accurate and real-time pedestrian detection is necessary. To balance the trade-off between accuracy and computation time is a common problem in the object detection area. Most recent work [22] proposes a lightweight auxiliary network based on SSD. In combination with the existing bottom-up and top-down networks, a bidirectional network was proposed. Experiments show that these two strategies can improve accuracy and save computing time. To explore light CNNs for multispectral pedestrians will be attractive in the future, because much of the information from a CNN is redundant.

5. Conclusion

This paper reviewed multispectral pedestrian detection methods proposed in recent years, dividing them into three categories and reviewing each of them. Then, four multispectral pedestrian detection datasets were explained, and the performance of 13 well-known, recently published multispectral pedestrian detection methods were compared. Finally, several current problems were discussed, with future research directions for multispectral pedestrian detection suggested.


The author thanks Yunfan Chen for her great help during the whole process.


Hwang S., Park J., Kim N., et al., June 2015, Multispectral pedestrian detection: benchmark dataset and baseline., Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 1037-1045DOI
Liu J., Zhang S., Wang S., et al. , September 2016, Multispectral deep neural networks for pedestrian detection., Proc. British Machine Vision Conf., York, UK, pp. 1-13DOI
König D., Adam M., Jarvers C., e tal. , July 2017, Fully convolutional region proposal networks for multispectral person detection., Proc. IEEE Workshop on Computer Vision and Pattern Recognition, Honolulu, HI, USA, pp. 243-250DOI
Chen Y., Xie H., Shin H., 2018, Multi-layer fusion techniques using a CNN for multispectral pedestrian detection., IET Computer Vision, Vol. 12, No. 8, pp. 1179-1187DOI
uan D., Cao Y., Yang J., Cao Y., Yang M.Y., 2019, Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection., Information Fusion, Vol. 50, pp. 148-157DOI
Li C., Song D., Tong R., Tang M., 2019, Illumination-aware faster R-CNN for robust multispectral pedestrian detection., Pattern Recognition, Vol. 85, pp. 161-171DOI
Guan D., Cao Y., Yang J., Cao Y., Tisse C.L., 2018, Exploiting fusion architectures for multispectral pedestrian detection and segmentation., Applied optics, Vol. 57, No. 18, pp. d108-D116DOI
Li C., Song D., Tong R., Tang M., 2018, Multispectral pedestrian detection via simultaneous detection and segmentation., arXiv preprint arXiv:1808.04818.URL
Zheng Y., Izzat I.H., Ziaee S., 2019, GFD-SSD: Gated Fusion Double SSD for Multispectral Pedestrian Detection., arXiv preprint arXiv:1903.06999.URL
Pei D., Jing M., Liu H., Jiang L., Sun F., 2020, A Fast RetinaNet Fusion Framework for Multi-spectral Pedestrian Detection., Infrared Physics & TechnologyDOI
Ren S., He K., Girshick R., Sun J., 2015, Faster r-cnn: Towards real-time object detection with region proposal networks., In Advances in neural information processing systems, pp. 91-99DOI
Lin T.Y., Goyal P., Girshick R., He K., Dollár P., 2017, Focal loss for dense object detection., In Proceedings of the IEEE international conference on computer vision, pp. 2980-2988DOI
Lin T.Y., Dollár P., Girshick R., He K., Hariharan B., Belongie S., 2017, Feature pyramid networks for object detection., In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117-2125DOI
Takumi K., Watanabe K., Ha Q., et al., Mountain View, Multispectral object detection for autonomous vehicles., Proc. Thematic Workshops of ACM MultimediaDOI
Davis J.W., Sharma V., 2007, Background-subtraction using contour-based fusion of thermal and visible imagery, Comput. Vis. Image Underst., Vol. 106, No. 2-3, pp. 162-182DOI
CVC-14: Visible-FIR Day-Night Pedestrian Sequence DatasetURL
Dollár P., Wojek C., Schiele B., et al. , 2012, Pedestrian detection: an evaluation of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 34, No. 4, pp. 743-761DOI
u C-Y, Liu W, Ranga A, Tyagi A, 2017, , DSSD: deconvolutional single shot detector., arXiv:1701.06659URL
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL, 2017, DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs., IEEE Trans Pattern Anal Mach Intell 40, Vol. 4, pp. 834-848.DOI
Wang W., Lai Q., Fu H., Shen J., 2019, , Salient object detection in the deep learning era: An in-depth survey., arXiv preprint arXiv:1904.09146.URL
Zhang S., Wen L., Bian X., Lei Z., Li S.Z., 2018, Occlusion-aware r-cnn: detecting pedestrians in a crowd., In Proceedings of the European Conference on Computer Vision, pp. 637-653DOI
Wang T., Anwer R.M., Cholakkal H., Khan F.S., Pang Y., Shao L., 2019, Learning rich features at high-speed for single-shot object detection., In Proceedings of the IEEE International Conference on Computer Vision, pp. 1971-1980DOI
Zhang L., Zhu X.Y., Chen X.Y., Yang X., Lei Z., L Z.Y., 2019, Weakly aligned cross-modal learning for multispectral pedestrian detection., In Proceedings of the IEEE International Conference on Computer Vision., pp. 5127-5137DOI
Zhang H., Fromont E., Lefevre S., Avignon B., 2020, Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks., In Proceedings of the IEEE International Conference on Image Processing.DOI
Zhou K.L., Chen L.S., Cao X., 2020, Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems., arXiv preprint arXiv:2008.03043.URL



Yuting Li is currently working in the Department of Electronic Systems Engineering, Hanyang University, South Korea. His research interests include autonomous driving, convolu-tional neural networks, and deep learning.