Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 05, p.443-451

ISSN (online) :

2287-5255

Received : 20 March 2024Revised : 6 August 2024Accepted : 6 August 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.5.443

Regular Paper

A Study on the Improvement of Object Detection Performance by Infrared Data Augmentation based on Diffusion Models

ParkSeonghyun¹ LeeTaeyoung¹ AhnJongsik¹ KimHaemoon¹ KimHyunhak¹ KimSeoyoung¹ ChoiByungin^*

(Intelligence Software Team, Hanwha Systems Co., Ltd., Seongnam-si, Gyeonggi-do 13524, Korea {seonghyun, ty.lee, jongsik.ahn, haemoon1205, kim.hyun.hak95, seoyoung.kim, byungin.choi}@hanwha.com )

^*Corresponding Author: Byungin Choi, byungin.choi@hanwha.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Infrared images are known to capture the thermal radiation emitted from objects and are increasingly essential in Night Vision and surveillance. These can be utilized in various image processing algorithms, such as object detection and tracking. However, infrared image processing is highly complex due to the sensor degradation and the status of temperature inversion between the background and the object, which results in an inadequate dataset. Data augmentation approaches have been introduced to overcome the lack of datasets by increasing the diversity of data distribution. Withal, the augmentation approach via image processing algorithms is widely used to improve model performance, prevent overfitting caused by insufficient data, and mitigate data bias. Furthermore, several recent studies have established novel algorithms to overcome dataset shortage and uniform distribution through domain shifts such as image generation and image-to-image translation. In this paper, the object detection performance with infrared data augmentation based on the diffusion models of "Palette" and "BBDM" are analyzed and evaluated from various perspectives, such as the number of images, class, and object size. The evaluation showed that the compound dataset of Palette and BBDM at the ratio of 20% and 10%, respectively, improved by 0.3% and 0.5% compared to the baseline. Nevertheless, the similar distribution of real and translated infrared images showed better qualitative and quantitative performances.

Keywords

Infrared image, Object detection, Data augmentation, Generation model, Diffusion model

1. Introduction

Object detection is an image processing technique used to classify the object type while determining the location and size of objects in the form of a bounding box. Object detection has been widely used in various industrial fields, such as autonomous driving, surveillance-security systems, and robotics. In particular, autonomous driving and security systems require high accuracy and diverse training data, considering various environmental conditions to minimize errors caused by false detection.

However, visible light images or visible image in short with insufficient information due to low-light environments (i.e. fog, rain, darkness) increases object detection error. Fig. 1 shows the examples of the object detection performance on visible images and infrared images from the same viewpoint. Visible images can capture semantic information representing the background and objects, but their quality is easily affected by weather and illumination conditions. In contrast, infrared images capture the thermal radiation emitted from objects, thereby overcoming environmental limitations. Therefore, infrared images can be used in various image processing algorithms, such as object detection and tracking, which are increasingly essential in night vision and surveillance systems ^[1]. However, infrared sensors are expensive and have limited capture conditions because thermal factors should be considered in various environments. In addition, the complexity of infrared image processing is high because of sensor degradation and the status of temperature inversion between the background and object. Thus, there are only a few public datasets based on infrared images. Only two of the 244 publicly available object detection datasets are infrared image datasets ^[2]. Consequently, infrared image datasets are used for specialized fields such as defense and medicine, making them difficult to train because of their low versatility and the lack of images ^[3].

Fig. 1. An example of object detection performance for visible and infrared image pairs.

Data augmentation techniques have been introduced to overcome the lack of image datasets and to ensure model performance. These techniques increase the diversity of image datasets in order to alleviate data bias, prevent overfitting, and improve model performance. However, although these image processing algorithms achieve performance improvements, they are still inadequate to meliorate the absolute performance due to the limitation of inherent constraints in the original dataset or the characteristics of the infrared images. To overcome the limitation, many research studies have been proposed image generation methods, such as Image-to-Image translation algorithms, to fill the lack of training data ^[4]. Additionally, inspiration from the probabilistic approach overcoming the imbalance of the generator and discriminator of generative adversarial networks (GANs) has developed several diffusion models; leading the models to estimate the sample distributions according to the probability distribution of the images. With the ability to specify clear goals, such as distribution range and fixed training objectives, diffusion models show superior performance for data augmentation ^[5].

In this study, we analyze the object detection performance by infrared data augmentation based on diffusion models according to the number of images, classes, and object size. First, we translate infrared images from visible images using diffusion models. Then, we trained the object detection model by constructing the translated infrared image training dataset at various ratios.

Finally, we analyze the performance of each object detection model trained on datasets with various mixed ratios ${\lambda}$ and the factors that accordingly affect object detection performance. Furthermore, we show the effectiveness of the relationship between the distribution of real infrared images and translated infrared images from visible images in the quantitative assessment of object detection.

The remainder of this paper is organized as follows: Section II briefly reviews data augmentation techniques, Section III describes the infrared image augmentation techniques using diffusion models. Section IV discusses the experimental results, and Section V, finally, concludes the paper.

Table 1. Classification of image data augmentations.

Image processing-based approaches

Color Space Transformation,

Geometric Transformation,

Kernel Filter

Learning-based

approaches

Adversarial Training, Neural Style Transfer, GAN (Generative Adversarial Networks),

Diffusion

2. Related Work

2.1 Image Processing-based Image Data Augmentation

Data augmentation techniques can be broadly categorized into image processing-based and learning-based approaches ^[6], as illustrated in Fig. 2. Image processing-based data augmentation mainly increases the number of images by transforming characteristics or applying specific filters. For example, color space transformation is robust against color modifications in the same object because of its ability to manipulate color information in various ways. Geometric transformation ^[7] increases the variability of objects through image resizing, rotation, and displacement, which improves the robustness to changes in object form. Finally, the kernel filter ^[8] emphasizes the features of an image by applying specific filters. However, image processing-based data augmentation distorts the information in the image during the transformation process. Moreover, the performance improvement is insufficient for data lacking semantic information about objects obtained from specialized sensors such as synthetic-aperture radar (SAR) and infrared.

Fig. 2. Illustration of different architectures for learning-based image data augmentation.

2.2 Learning-based Image Data Augmentation

Learning-based image data augmentation trains specific domain data to generate or translate into unique style images, such as style transfer and adversarial training. Fig. 2 compares the three architectures commonly used for learning-based image data augmentation. Variational Auto-Encoder(VAE) model ^[9] can generate an image with a distribution similar to the input image using an encoder and a decoder, as illustrated in Fig. 2(a). However, it generates a low-quality image according to the quality of the input image. GAN model ^[10,^11] generates images via an adversarial learning process; where the generator generates images of quality that the discriminator cannot distinguish, and the discriminator enhances the ability to distinguish between real and generated images, as illustrated in Fig. 2(b). However, GAN model requires detailed parameter fine-tuning because of the training instability. Finally, as illustrated in Fig. 2(c), Diffusion model ^[12] adds noise to the image and gradually removes the noise during training to obtain data distribution similar to the original, resulting in high-quality images.

3. The Proposed Method

Our approach aims to improve the object detection performance limited by the insufficiency of infrared image data. We generated insufficient infrared image data based on the diffusion model, known to be a type of likelihood-based model that generates high-quality images. We then analyzed the object detection performance by the number of images, classes, and object size. Fig. 3 shows an overview of the infrared object detection process, which translates infrared images from visible images and trains an object detection model. Specifically, we employ the pixel space-based diffusion network Palette ^[13] and the latent space-based diffusion network BBDM ^[14] for image-to-image translation. The training datasets were constructed from various mixed ratios $\lambda $ between ground truth and translated infrared images using diffusion models. Finally, $\mathrm{mAP}_{0.5}$ is evaluated from trained object detection models using infrared image datasets with various mixed ratios $\lambda $.

Fig. 3. Overview of infrared object detection process using image data augmentation based on diffusion models.

3.1 Pixel Space-based Palette

Diffusion models consist of a forward process and a reverse process. The forward process is a Markovian process that iteratively adds Gaussian noise to the original image. In contrast, the reverse process reconstructs noise into the original image, yielding significant flexibility and tractability. Palette is a type of pixel space-based diffusion models that unified the framework for image-to-image translation tasks, colorization, inpainting, uncropping, and JPEG restoration. The Palette trains a reverse process to translate an infrared image from a visible image, which inverts the forward process. Given a noisy infrared image $\overset{˜}{y}$,

(1)

$\overset{˜}{y}=\sqrt{\gamma }\,y_{0}+\sqrt{1-\gamma }\,\epsilon ,\,\,\epsilon \sim N\left(0,I\right),$

the goal is to recover the target infrared image $y_{0}$. Thus, we parameterize network $f_{\theta }\left(x,\,\,\overset{˜}{y},\,\,\gamma \right)$ using the input visible image$~ x$, a noisy infrared image $\overset{˜}{y}$, and the current noise level $\gamma $.

To optimize the loss function, we predict the noise vector $\epsilon $ as follow:

(2)

$ \mathrm{\mathbb{E}}_{\left(x,y\right)}\,\,\mathrm{\mathbb{E}}_{\epsilon ,~ \gamma }\left\| f_{\theta }\left(x,\,\,\overset{˜}{y},\,\,\gamma \right)-\epsilon \right\| ^{2}. $

As Palette minimizes the difference at the pixel-level between real images and translated infrared images by using the spatial dimension, it can effectively translate high-quality images with the structures and texture details.

3.2 Latent Space-based BBDM

Although Palette faithfully translates infrared images from visible images retraining detail textures, it has a limitation that generally requires extensive computational resources due to the use of pixel space. To address this issue, latent space-based diffusion models have been developed to train with fewer computational resources. As a latent space-based diffusion model, BBDM can conduct image-to-image translation between two domains using a stochastic Brownian bridge process , providing promising results. The encoder extracts the feature maps of the image and maps to the high-dimensional latent space. Then, diffusion process is progressed based on the schedule of variance in latent space, then the decoder translates into infrared image. The schedule of variance for Brownian bridge $\delta _{t}$ diffusion process can be designed as

(3)

$ \delta _{t}=2\left(m_{t}-m_{t}^{2}\right),\,\,\,\mathrm{m}_{t}=\frac{t}{T}, $

where $T$ is the total steps of the diffusion process. The sampling diversity can be tuned by the maximum variance $\delta _{max}$ at the middle step $t=~ \frac{T}{2}.$ To translate infrared images from visible images, object function of BBDM is as follows:

(4)

$ \mathrm{\mathbb{E}}_{{x_{0}},\,y,\,\epsilon }\left\| m_{t}\left(y-x_{0}\right)+\sqrt{\delta _{t}}\epsilon -\epsilon _{\theta }\left(x_{t}-t\right)\right\| ^{2}, $

where $x_{0}$ and $y$ denote initial visible image status and infrared image, and $\epsilon _{\theta }$ is the trained model to estimate ${\epsilon}$. BBDM effectively represents high-dimensional characteristics in latent space via encoder rather than pixel space with the diffusion process, thereby improving the learning efficiency and model generalization.

4. Performance Evaluation

4.1 Experimental Settings

We evaluate infrared object detection performance using the FLIR dataset ^[15], which contains pairs of well-aligned visible and infrared images from real cameras concluding over 375,000 annotations. The visible images were translated into infrared images, and then the dataset was constructed in various ratios with translated images and FLIR dataset images based on the diffusion model. We use Yolov5l-TA model ^[16] for object detection, which exhibits superior detection performance and is actively used in many industry fields because of its outstanding inference speed. The qualitative metric for the object detection performance calculates the mean Average Precision ($\mathrm{mAP}$) across different Intersection over Union (IoU) thresholds. Furthermore, the object detection model in different settings was trained to analyze the effectiveness of the number of images. Table 2 compares the object detection performance. Besides, we categorized the size of the object based on the pixel scale range of width $W$ and height $H$. Then, the $\mathrm{mAP}$ metrics were classified into $\mathrm{mAP}_{s},$ $\mathrm{mAP}_{m}$, and $\mathrm{mAP}_{l}$ as small, medium, and large, as shown in Table 3. Training with 2,000 and 3,000 images yielded equal results, with a performance of 62.6%. This indicates that the object detection model achieves saturation in terms of performance; the diversity of the FLIR dataset is sufficient even with 3,000 images. Therefore, we set the baseline for training the object detection model to 3,000 in all experiments.

Table 2. The results of object detection performance according to the number of images.

Numbers	mAP_0.5(%)	mAP(%)	mAP_s(%)	mAP_m(%)	mAP_l(%)	mAP_person(%)	mAP_car(%)
1,000	61.2	34.8	21.1	60.3	71.3	52.9	69.5
2,000	62.6	35.3	21.8	59.7	72.5	55.8	69.3
3,000	62.6	35.2	21.6	59.9	72.9	55.6	69.6

Table 3. The categories of object size within the pixel scale range.

4.2 Quantitative & Qualitative Evaluation

We evaluated quantitative image-to-image translation performance based on the diffusion model using four metrics: Peak Signal-to-noise ratio (PSNR), Structural Similarity (SSIM), inception score (IS) ^[17], and Fréchet Inception Distance (FID) ^[18]. Higher PSNR, SSIM, and IS scores indicate better performance, whereas a lower FID score implies better performance. Table 4 quantitatively compares performance using pixel space-based Palette and latent space-based BBDM. BBDM yields higher PSNR and SSIM scores than Palette, indicating that the translated infrared images have similar structures and characteristics with infrared ground truths. In contrast, Palette achieves a higher IS score than BBDM, implying that high quality and superior diversity of infrared images are translated effectively. Furthermore, The Palette yields a significantly higher FID score than the BBDM. This indicates that Palette preserves detailed information and effectively translates infrared images from visible images.

Table 4. Quantitative comparison of the translated infrared image from visible image.

Model	PSNR (↑)	SSIM (↑)	IS (↑)	FID (↓)
Palette	18.92	0.4826	1.2158	53.81
BBDM	21.88	0.4992	1.1885	59.03

Fig. 4 shows the qualitative comparison of the translated infrared image from the visible image based on the diffusion models. Palette showed outstanding results in object detection with multiple objects and particular texture retention in infrared images. Additionally, Palette captures spatial information in pixel space and preserves detailed texture for high-quality infrared images. In contrast, BBDM generates poor visual artifacts that result in a failure to detect objects such as trees and cars. This is because the model focuses only on the semantic information, which is the texture information lost during diffusion process. These results indicate that BBDM achieves higher PSNR and SSIM scores; however, it changes the appearance of the object and eventually degrades the object detection performance. In contrast, Palette translates the distribution of the translated infrared images to that of the real infrared image distribution, resulting in promising results in terms of IS and FID scores.

Fig. 4. Qualitative comparison of the translated infrared image.

4.3 Object Detection Performance Based on Mixed Ratio $\mathbf{\lambda}$

Table 5 compares the quantitative object detection performance of the training datasets with various mixed ratios $\lambda $ of the translated infrared images. We also marked the relative value increased from the baseline score in parentheses for better comparison. Furthermore, mixed ratio ${\lambda}$ of 100% refer to the entire translated infrared images using diffusion models. In mixed ratio ${\lambda}$ of 20%, Palette resulted 62.8% $\mathrm{mAP}_{0.5}$, which relatively improved by 0.3% based on the baseline $\mathrm{mAP}_{0.5}$. This indicates that the increase of $\mathrm{mAP}_{0.5}$ improves the object detection performance with superior generalization ability. However, increment of the mixed ratio $\lambda $ degraded $\mathrm{mAP}_{0.5}$ due to the failure of full coverage with the real infrared images distribution, e.g., mixed ratio ${\lambda}$ of 100%. The BBDM exhibits tendencies similar to Palette. BBDM with a mixing ratio ${\lambda}$ of 10% achieved 62.9% $\mathrm{mAP}_{0.5},$ demonstrating a relative improvement of 0.5%. Specifically, the significant decrease in object detection of BBDM to Palette is the hazy appearance yielded by undesirable discrepancies in the distribution of real infrared images.

Table 5. The results of object detection according to Mixed Ratio $\mathbf{\lambda}$.

Mixed Ratio $\mathbf{\lambda}$(%)	Baseline		Palette		BBDM
Mixed Ratio $\mathbf{\lambda}$(%)	mAP_0.5(%)	mAP(%)	mAP_0.5(%)	mAP(%)	mAP_0.5(%)	mAP(%)
10	62.6	35.2	62.6	35.2	62.9 (+0.5)	35.3 (+0.3)
20			62.8 (+0.3)	35.2	62.1	35.1
30			62.4	35.3 (+0.3)	61.6	34.6
50			62.5	35.3 (+0.3)	61.4	34.3
100			57.2	29.4	39.4	19.7

4.4 Instance-level Analysis

We analyze the effectiveness of instance-level object detection performance by training datasets with various mixed ratios ${\lambda}$ of the translated infrared images using Palette and BBDM. Tables 6 and 7 compare the object detection performance of object size- and class-level, respectively. With the mixed ratio $\lambda $ set to 20%, 30%, and 50%, Palette improves the $\mathrm{mAP}_{s}$, $\mathrm{mAP}_{m}$, and $\mathrm{mAP}_{l}$ compared to the baseline because it considers the statistical characteristics of the infrared images. Furthermore, $\mathrm{mAP}_{car}$ consistently increases as the mixed ratio ${\lambda}$ increases in Table 7. This indicates that the structures and texture details of the car are well captured by Palette due to

the relatively simple shape. In addition, $\mathrm{mAP}_{0.5}$ had no significant impact, for the large proportion consist of small classes of person and car in the experimental dataset, as shown in Fig. 5. In contrast, the object detection performance of BBDM degraded as the mixed ratio $\lambda $ increased. BBDM failed to preserve representation ability during the diffusion process of translating infrared images, thereby demoting the infrared image quality and object detection overall performances.

Fig. 5. Ratio by object size and class.

Table 6. The results of object detection performance based on the object size.

Mixed Ratio $\mathbf{\lambda}$(%)	Baseline			Palette			BBDM
Mixed Ratio $\mathbf{\lambda}$(%)	mAP_s(%)	mAP_m(%)	mAP_l(%)	mAP_s(%)	mAP_m(%)	mAP_l(%)	mAP_s(%)	mAP_m(%)	mAP_l(%)
10	21.6	59.9	72.9	21.7 (+0.5)	59.7	72.4	21.8 (+0.9)	59.9	72.7
20				21.6	60.1 (+0.3)	73.2 (+0.4)	21.5	59.9	73.4 (+0.7)
30				21.7 (+0.5)	60.0 (+0.2)	74.2 (+1.8)	21.1	59.3	72.2
50				21.7 (+0.5)	60.3 (+0.7)	73.8 (+1.2)	20.8	59.6	71.6
100				17.3	52.1	65.7	7.7	40.0	60.2

Table 7. The results of object detection performance based on the object class

Mixed Ratio $\mathbf{\lambda}$(%)	Baseline		Palette		BBDM
Mixed Ratio $\mathbf{\lambda}$(%)	mAP_person(%)	mAP_csr(%)	mAP_person(%)	mAP_csr(%)	mAP_person(%)	mAP_csr(%)
10	55.6	69.6	55.7 (+0.3)	69.6	55.4	70.3 (+1.0)
20			55.9 (+0.5)	69.7 (+0.3)	54.6	69.7 (+0.3)
30			54.9	69.9 (+0.5)	53.8	69.3
50			54.6	70.5 (+1.3)	53.5	69.3
100			42.0	72.4 (+4.0)	22.9	55.9

4.5 Distribution Analysis

We verify the effectiveness of the relationship between the distributions of the real infrared images and translated infrared images from visible images. In particular, we utilize the Uniform Manifold Approximation and Projection (UMAP) ^[19] dimension reduction technique for the visualization of high-dimensional feature of image in low dimensional space. Fig. 6 illustrates the results of the real and translated infrared images distribution. Palette overlaps the distribution of real infrared images than BBDM. This overlap result shows the distribution of the translated images is similar to the real infrared images, which indicated that Palette results have better quantitative and qualitative performances.

Fig. 6. The results of translated infrared images and Ground truth distribution.

5. Conclusion

In this study, we analyzed object detection performance by infrared data augmentation based on diffusion models according to the number of images, classes, and object size. We first used the pixel space-based method Palette and the latent space-based method BBDM to translate infrared images from visible images. Palette with the mixed ratio of 20% and BBDM with the mixed ratio of 10%, respectively, improved by 0.3% and 0.5% compared to the baseline. In particular, we demonstrated that infrared data augmentation based on diffusion models can improve object detection performance, overcoming the lack of infrared image datasets. Finally, experimental results confirmed that the more similar the real and translated infrared image distributions, the better the qualitative and quantitative performance. Moreover, we show that that the diffusion model can be used to improve object detection performance. Some important directions for future work are to study the impact on object detection performance for datasets larger than 3,000 and developing a more effective diffusion model to translate infrared images from visible images.

REFERENCES

S. Park, A. G. Vien and C. Lee, "Cross-Modal Transformers for Infrared and Visible Image Fusion," IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 2, pp. 770-785, Feb. 2024.

https://paperswithcode.com/datasets?task=object-detection

P. Kaur, B. S. Khehra and B. S. Mavi, "Data Augmentation for Object Detection: A Review," IEEE Int. Midwest Symp. Circuits Syst., pp. 537-543, Aug. 2021.

G. Mariani, F. Scheidegger, R. Istrate, C. Bekas and C. Malossi, "BAGAN: data augmentation with balancing GAN," arXiv:1803.09655, 2018.

P. Dhariwal and A. Nichol, "Diffusion models beat GANs on image synthesis," in Proc. Adv. Neural Inf. Process. Syst., pp. 8780-8794, 2021.

C. Shorten and T. M. Khoshgoftaar, "A survey on image data augmentation for deep learning," J. Big Data, vol. 6, no. 1, pp. 1-48, 2019.

I. Golan and R. El-Yaniv, "Deep anomaly detection using geometric transformations," in Proc. Adv. Neural Inf. Process. Syst., pp. 9758-9769, 2018.

K. He, J. Sun and X. Tang, "Guided image filtering," IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 6, pp. 1397-1409, Jun. 2013.

R. Lopez et al., ``Information constraints on auto-encoding variational bayes,'' in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 6114-6125.

P. Isola, J.-Y. Zhu, T. Zhou and A. A. Efros, "Image-to-image translation with conditional adversarial networks," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1125-1134, 2017.

J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, ``Unpaired image-to-image translation using cycle-consistent adversarial networks,'' in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, pp. 81-88.

J. Ho, A. Jain, and P. Abbeel, ``Denoising diffusion probabilistic models,'' in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 6840-6851.

C. Saharia et al., "Palette: image-to-image diffusion models," in Proc. ACM SIGGRAPH Conf., pp. 1-10, 2022.

Li, Bo, et al. "BBDM: image-to-image translation with brownian bridge diffusion models," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1952-1961, Vancouver, Canada, 2023.

https://www.flir.ca/oem/adas/adas-dataset-form/

H. Kim, J. Ahn, T. Lee, and B. Choi, "The object detector for aerial image using high resolution feature extractor and attention module," J. Korean Inst. Inf. ELectr. Commun. Technol., vol. 48, no. 1, pp. 1-11, Jan, 2023.

T. Salimans et al., ``Improved techniques for training GANs,'' in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 2234-2242.

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler and S. Hochreiter, "GANs trained by a two time-scale update rule converge to a local nash equilibrium," in Proc. 31st Int. Conf. Neural Inf. Process. Syst., pp. 6629-6640, 2017.

L. McInnes, J. Healy, and J. Melville, ``UMAP: Uniform manifold approximation and projection for dimension reduction,'' arXiv:1802.03426, 2020.

Seonghyun Park.

Seonghyun Park. received the B.S. degree in electrical, electronic, and control engineering from Hankyong National University, Anseong, South Korea, in 2020, and the M.S. degree in multimedia engineering from Dongguk University, Seoul, South Korea, in 2023. He is currently a Junior Researcher with the Intelligence S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include image processing and computational imaging.

Taeyoung Lee.

Taeyoung Lee. received the B.S. degree in information and control engineering from robotics school, Kwangwoon University, Seoul, South Korea, in 2009, and the M.S. degree in control and instrumentation engi-neering from robotics school in Kwangwoon University, Seoul, South Korea, in 2011. He is currently a Senior Researcher with the Intelligent S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include object detection, object tracking and segmentation with deep Learning also generative AI.

Jongsik Ahn.

Jongsik Ahn. received the B.S. degree in mechanical engineering from Kyunghee University, Suwon, South Korea, in 2017, and the M.S. degree from the school of electronic and electrical engineering, Kyungpook National University, Daegu, South Korea, in 2022. He is currently a Researcher with the Intelligent S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include infrared image object detection, segmentation, and object tracking.

Haemoon Kim.

Haemoon Kim. received the B.S. degree in electrical, electronic, and control engineering from Hankyong National University, Anseong, South Korea, in 2020, and the M.S. degree from Computer Science and Engi-neering, Hanyang University, Ansan, South Korea in 2022. He is currently a Junior Researcher with the Intelligence S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include object detection, instance segmentation, and aerial image processing.

Hyunhak Kim.

Hyunhak Kim. received the B.S. and M.S. degrees in biomechanical engi-neering from Sungkyunkwan University (SKKU), South Korea, in 2020 and 2022, respectively. He is currently a Junior Researcher with the Intelligence S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include reinforcement learnings, object detections, image processing, and language model.

Seoyoung Kim.

Seoyoung Kim. received the B.S. degree in chemical and biological engineering from Korea University, Seoul, South Korea, in 2020. She is currently a Junior Researcher with the Intelligence S/W Team, Hanwha Systems Co., Ltd., Seongnam, South Korea. Her current research interests include image processing and 3D vision.

Byungin Choi.

Byungin Choi. received the B.S., M.S., and Ph.D. degrees in electronic engineering from Hanyang University in Seoul, South Korea, in 2001, 2003, and 2008, respectively. He is currently a Leader of the Intelligent S/W team Hanwha Systems Co., Ltd., Seongnam, South Korea. His current research interests include object detection, object tracking, and super resolution.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

A Study on the Improvement of Object Detection Performance by Infrared Data Augmentation based on Diffusion Models

Abstract

Keywords

1. Introduction

Fig. 1. An example of object detection performance for visible and infrared image pairs.

Table 1. Classification of image data augmentations.

2. Related Work

2.1 Image Processing-based Image Data Augmentation

Fig. 2. Illustration of different architectures for learning-based image data augmentation.

2.2 Learning-based Image Data Augmentation

3. The Proposed Method

Fig. 3. Overview of infrared object detection process using image data augmentation based on diffusion models.

3.1 Pixel Space-based Palette

(1)

(2)

3.2 Latent Space-based BBDM

(3)

(4)

4. Performance Evaluation

4.1 Experimental Settings

Table 2. The results of object detection performance according to the number of images.

Table 3. The categories of object size within the pixel scale range.

4.2 Quantitative & Qualitative Evaluation

Table 4. Quantitative comparison of the translated infrared image from visible image.

Fig. 4. Qualitative comparison of the translated infrared image.

4.3 Object Detection Performance Based on Mixed Ratio $\mathbf{\lambda}$

Table 5. The results of object detection according to Mixed Ratio $\mathbf{\lambda}$.

4.4 Instance-level Analysis

Fig. 5. Ratio by object size and class.

Table 6. The results of object detection performance based on the object size.

Table 7. The results of object detection performance based on the object class

4.5 Distribution Analysis

Fig. 6. The results of translated infrared images and Ground truth distribution.

5. Conclusion

REFERENCES

Seonghyun Park.

Taeyoung Lee.

Jongsik Ahn.

Haemoon Kim.

Hyunhak Kim.

Seoyoung Kim.

Byungin Choi.

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing