1. Introduction
Image fusion technology aims to associate and synthesize multi-source sensor image
data to generate an estimation and judgment. As an effective information fusion technology,
image fusion technology has been used widely in automatic machine recognition, earth
remote sensing, computer vision, military reconnaissance, medical image pathological
change recognition, and other fields. This method is more complete, reliable, and
accurate than a single information source [1,2]. As the image is a unique signal form, its signal characteristics and data information
will be subject to objective and subjective environments, and the influence of uncontrollable
factors shows distinct differences [3]. At the same time, the limitation of image processing of a single sensor makes it
difficult to process an image based on meeting the requirements of the fusion target.
Multi-sensor image information fusion has particularity and complexity that can better
achieve a multi-source combination of data information [4].
The interference of internal and external environmental factors will greatly restrict
the image processing and fusion process, and there will be significant differences
in its actual presentation effect. Different types of images have different spatial
scales, forms of expression, information characteristics, and properties, so they
need to consider these characteristics when performing image fusion. At the same time,
the complexity of network environment changes, and the differences in images in time
and space will make data processing more difficult. The image fusion method based
on the multi-sensor proposed in this paper can realize the processing of different
scale information images and transform them, achieving better processing results.
Moreover, this method can effectively reduce the oscillating effect of moving target
images on image information during the moving process and improve the accuracy of
target image detection in a dynamic environment.
Y. B et al. [5] proposed a multi-exposure image fusion method based on tensor decomposition (TD)
and convolutional sparse representation (CSR). H. B Scholar [6] proposed to realize the marking of infrared image targets by semantic segmentation
and use different loss functions for image fusion for different target areas and background
areas. The experimental results showed that the fusion results have higher contrast
in the target area and more texture details in the background area. Y. S et al. [7] reported the fusion of multi-exposure images with three image quality attributes:
contrast, saturation, and brightness. The multi-scale image fusion under different
coefficients is realized by attribute weighting processing and pixel elimination with
a poor visual effect. The image quality was improved using the color correction method
of a local saturation. The experimental data showed that the algorithm could maintain
the image details and correct the color of an exposure fusion image. M. Zheng, G et
al. [8] used adaptive structure decomposition to achieve a defogging effect of multi-exposure
images, i.e., through spatial linear adjustment, image sequence extraction, and the
application of fusion schemes to increase the acquisition of image information. Hence,
the method based on texture energy was studied to select the block size of image structure
decomposition adaptively. The method had high effectiveness and applicability according
to the experimental data. Y. Yang et al. [9] proposed a multi-layer feature convolution neural network to achieve the fusion of
multiple input images. They generated a fused total clustered image using a weighted
summation decision. This method showed a good fusion effect in the experimental evaluation.
Q. Zhang et al. [10] proposed a method based on multi-sensor image fusion under computer vision. The method
was based on sparse representation and compared with a dictionary learning method
based on ensuring that the misregistration between source images is unaffected. In
the process of an experimental evaluation, sparse representation could better fuse
multi-sensor images.
Robust principal component analysis (RPCA) is often used in moving target detection
problems, but its performance is poor because of the dynamic background and object
movement. Therefore, S. Javed, A et al. [11] proposed a spatiotemporal structured algorithm, i.e., spatial and temporal normalization
of sparse components. The spatiotemporal subspace structure could effectively constrain
the sparse components and obtain a new target function. The experimental results showed
that the algorithm could achieve better target detection in different data sets, and
its performance was good. D. P. and Bavirisetti, G et al. [12] realized the induced transformation of different source images under the structure
transfer attribute to guide the visual significance detection of image filtering and
the selection of weight maps to achieve the extraction of image information and the
integration of pixels. The video sequence can show the changes in the objects in the
process of motion capture. Yadav SP scholar [13] improved the traditional frame difference algorithm with the help of MATLAB and considered
the noise interference and structural differences to achieve the diversity of coding
methods. The results show that the video sequence test algorithm can effectively recognize
moving images and has high robustness. Considering the current single use of medical
imaging methods, Yadav S P et al. [14] used a wavelet transform to achieve multimodal medical image fusion. They used wavelet
transform, independent component analysis, and principal component analysis technology
to perform image fusion, denoising, and data dimensionality reduction. This research
idea can effectively improve the medical diagnosis effect. Based on the fact that
the original driving target image detection relies on the RGB features, Ma X et al.
[15] innovatively proposed the use of a 2D image plane for a 3D point cloud spatial point
representation. They introduced the PointNet backbone and multimodal feature fusion
module to realize 3D detection and image inference of automatic driving targets. This
method improved the performance of monocular 3D target recognition. The previous salient
object detection (SOD) method based on RGB - D could not fully capture the complex
correlation between RGB images and depth maps and did not consider the cross-hierarchy
and continuity of information. Therefore, Li G et al. realized information conversion
interactively and adaptively and distinguished cross-modal features and enhanced RGB
features from different sources by cross-modal depth weighted combination and depth
algorithm. The method showed a good application effect and effectiveness on five test
data sets.
Fusion performance analysis showed that the proposed method performs well in visual
quality and fusion metrics with less running time. A two-way guarantee of image quality
and SNR were difficult in image information fusion, and the feature difference of
moving images in different time and space backgrounds was noticeable. A moving-image
information fusion analysis algorithm based on multi-sensor is proposed to improve
the SNR and information entropy of moving-image information fusion, reduce the standard
mean square error, and improve the visual effect of moving-image information fusion.
Image information fusion technology is an important method of image information processing
and analysis, which can be carried out at three levels: pixel, feature, and decision.
Ensuring the clarity and integrity of image fusion is a problem requiring attention
in information processing. Multi-sensor information fusion is actually a functional
simulation of the human brain processing complex problems. The method can observe
and process various image information in image information processing and show the
complementarity and redundancy avoidance in space and time under different image optimization
criteria. Research on image processing can consider the scale differences among image
data with the help of the multi-sensor information transmission concept. As a local
connection network, a convolutional neural network (CNN) has the characteristics of
local connectivity and weight sharing. It can perform convolution processing on a
given image and extract some features. The transformation of its convolution kernel
parameters can meet the requirements of image processing. Wavelet decomposition and
color mode conversion of moving images can effectively reduce the impact of interference
factors on image processing. The proposed method can be considered from image feature
extraction, information processing, and fusion image, and is significantly different
from the previous research content in that it can generate a sequence of moving images,
effectively avoiding the impact of precision interference caused by the missing number
of moving images and detection errors.
2. Moving Image Preprocessing
(Ed note: ``Owing to'' means ``because of'' and ``due to'' means ``caused by''.) Moving
images show different image scales owing to the difference in the information environment
and information content. Therefore, it is necessary to preprocess the image before
image information fusion. The primary purpose of image preprocessing is to eliminate
the irrelevant information in the image, restore the useful and true information,
enhance the detectability of relevant information, and simplify the data to the maximum
extent to achieve the feature extraction and recognition of relevant information.
At the same time, wavelet decomposition and color space conversion of moving images
can reduce the impact of interference factors on image processing, ensure the quality
of image information, and better judge the quality of image fusion. Moving image originally
refers to the changes in speed, time, and displacement of the physical properties
of the image, i.e., the movement changes in objects are expressed in the form of images.
The image change caused by the relative motion between the object and the surrounding
environment can be called a moving image. The image may have the characteristics of
fuzziness, distortion, and overlap because of the uncertainty and difference of the
motion, which also makes the fusion of moving images more difficult. The richness
and integrity of the image information displayed by the captured moving images at
different time intervals will also differ. The network environment and sensor types
will also remove and retain the content contained in the image information to varying
degrees [17]. At the same time, when moving image information is used for data input and digital
image conversion, it will inevitably reflect the signal characteristics under different
dimension levels. Identifying the signal characteristics can effectively capture the
commonness of different moving images in the fusion process. Therefore, based on the
characteristics of the moving image, the information of the moving image is analyzed
based on retaining the characteristics of the moving image to obtain the image time
series. The image time series are decomposed by a wavelet to obtain the signal characteristics
under different time-frequency resolutions, providing a richer information basis for
image information fusion.
2.1 Moving Image Feature Extraction
In the feature extraction of moving images, it is necessary to divide the time and
amount of the pixel frames that constitute the image and then sample them step by
step. A pixel frame refers to the complete sampling of an image, while a single frame
pixel refers to the image pixel frame at a specific time. The neural network of a
single frame pixel is then calculated using a CNN, i.e., the neural network represents
the semantic information of the extracted convolutional feature image and covers each
pixel in the feature map with a reference suggestion box to determine if the suggestion
box is a foreground or a background. The classification of the foreground is determined
according to the comprehensive characteristics of the subnet and the information of
the mapping box [18]. The convolution-sharing feature is used to transmit and share the features of each
single frame pixel to form a feature network and achieve the feature edge effect of
determining the single frame pixel of a moving image.
Image edge refers to a collection of pixels whose surrounding pixels have a step change
in grayscale, placed between the target and the background. The edge feature can better
reflect the clarity of the target. On the other hand, the actual image is blurred
due to optical reasons, the performance of image acquisition information, and sampling
rate(Ed note: ``etc.'' can be deleted here. If there are other important items in
the list, then they should be added.) [19]. Therefore, based on the characteristics of the human visual system, the key to whether
the target image is clear is whether the edges of the target and the background are
clear. The larger the image edge features, the clearer the effect it presents. The
specific calculation steps of the feature edge of a single frame pixel are as follows:
The number of edge feature systems is defined as $E$ according to the motion position
of a single frame pixel and its corresponding number of time systems, and $E=e\left(n\right)$,
where $e\left(n\right)$ is the pixel corresponding to the amount of edge feature information.
The one-dimensional curve composed of a single frame pixel and its adjacent single
frame is calculated discretely to obtain two motion feature curves, $p\left(n\right)$
and $q\left(n\right)$, from which the calculation formula of the number of edge pixel
feature systems corresponding to a single frame pixel can be acquired, as shown below:
where $n$represents the position of a single frame pixel on the motion curve; $n=1,2,\ldots
,m$, where, $m$represents the total amount of feature information.
Considering that the noise error of the moving image will affect $p\left(n\right)$
and $q\left(n\right)$, to ensure the number stability of feature systems, let $\varepsilon
\left(n\right)$ be the discrete coefficient of the number of feature systems, and
carry out $x+1$ iterative calculations. The number of feature systems after stable
optimization can be obtained as follows:
where $V^{g}$represents the weighted value of the CNN convolution coefficient. $e_{1}\left(i\right)$and
$e_{2}\left(j\right)$represent the statistical characteristic quantities.
After the above calculations are completed, the pixel point $e\left(n\right)$ of a
single frame is defined as the center amount of the characteristic pixel, the outward
radiation distance is $D$, and the corresponding angle of the characteristic pixel
point is calculated to obtain a more accurate characteristic error coefficient. The
feature information association area is composed of the pixel point $e\left(n\right)$.
All pixels within the radiation distance $D$ can be described as
where $k$ represents the characteristic labels of all pixels within the radiation
distance range $D$ of the setting area $e\left(n\right)$.
The centers of two adjacent groups of feature single frame pixels of $e\left(n\right)$
are $e^{1}\left(n\right)$ and $e^{2}\left(n\right)$, and the feature angle relations
formed by $e^{1}\left(n\right)$ and $e\left(n\right)$ and $e\left(n\right)$ and $e^{2}\left(n\right)$
are $\theta ^{1}\left(n\right)$ and $\theta ^{2}\left(n\right)$, respectively. The
edge coefficient curvature angle of the feature area composed of single frame pixels
is(Ed note: A complete sentence should precede a colon.)
$\theta \left(n\right)$ and $e\left(n\right)$ satisfy the positive gradient relationship
so that the reference value of the moving image feature coefficient is $f$, if $\theta
\left(n\right)>f$, the following equation can be obtained:
Repeated iterative calculations on Eq. (5) are carried out to acquire the optimal coefficient number, and the range of the corresponding
coefficient number of $f$ is set to be less than 0.4. The value range corresponding
to $D$ is (4, 18).
The pixel characteristic parameters of the final moving image are
In Eq. (6), $G\left(n\right)$ represents the pixel characteristic parameters of the final moving
image. $f'$represents the number of curvature angle systems of pixels in a single
frame of edge features.
2.2 Generation of Moving Image Time Series
The mixed function control curve method generates the moving image time series. The
HC-B\'{e}zier curve and uniform B-spline curve with one shape parameter are defined
in hyperbolic function space [16] to generate the moving image time series. The second-order $\lambda $ function can
be written as
Marked $\alpha =\left(1-\lambda \right)\left(1-k^{2}\right)$, $\beta =h\left(t-\lambda
k\right)\overline{h}\left(1-\lambda \right)$. The mixed function with the parameters
defined above is used to form the $\lambda $ function for control edge coincidence
interpolation, and the image contour curve is defined as
where $i=1,2,\ldots ,n$.$n$is the position of a single pixel in the motion time series
curve. Feature extraction is carried out for each contour point that produces the
maximum gray value, and a combined surface composed of $A$and$N\times M$-order surface
patches is defined using the control point $C_{k}$:
The bright spot area at the edge of the studied image is decomposed into a set of
two-dimensional network points, which is expressed as
where $1\leq x\leq N$,$1\leq y\leq M$. The elements in this set are moving image time
series.
2.3 Wavelet Decomposition of Moving Image
Based on the time series of moving images obtained in Section 2.2, the time series
of moving images are further decomposed by a wavelet. The local features of moving
images are generally determined by multiple pixels. The wavelet transform method [18] is adopted to decompose the moving images considering the differences in semantic
information and detail features of moving images at different scales.
Wavelet transform is a method different from a Fourier transform. The multi-scale
representation is also different from the traditional image pyramid decomposition
representation. It can obtain the decomposition sub-bands of images with different
resolutions and spatial scales. The wavelet transform method generates the corresponding
scale and displacement function through a wavelet basis [19].
The definition of wavelet basis is expressed as
where $z_{i}$and $\sigma _{i}$represent the scale factor and displacement factor,
respectively.
The wavelet transforms of the signal $L\left(t\right)$ can be expressed as
Moving image is a kind of digital signal that generally does not meet the continuous
condition, so $c_{1}$ and $c_{2}$ are discrete forms. Among them, the corresponding
wavelet function is obtained after power series processing of $c_{1}$, i.e., $c=c_{1}^{n}$,
and the corresponding wavelet function can be expressed as
where $c_{2}$ is discretized uniformly. The form of $U\left(t\right)$is shown below:
Finally, the discrete wavelet transforms [20] are defined as
The wavelet decomposition of moving images is similar to filtering images with multiple
groups of filter banks that automatically adjust the parameters to obtain subbands
with different frequencies. There are many commonly used wavelet decomposition algorithms,
such as Mallat fast decomposition algorithm [21], and the mathematical expression of the algorithm is
where $p_{a}\left(i\right)$refers to a high pass filter. $p_{b}\left(j\right)$refers
to a low-pass filter; $a$ and $b$refer to the row and category of the image(Ed note:
The possessive form (e.g., Mark's cup) is used with nouns referring to people, groups
of people, countries, and animals. The possessive form is not used with inanimate
objects. Instead you should use an adjective or an "of" phrase. For example, "The
TV legs are broken". OR "The legs of the TV are broken".); $\mu $refers to the low-frequency
part of the image. $d_{1}$, $d_{2}$, and $d_{3}$ refer to the image’s horizontal,
vertical, and diagonal edge details, respectively, i.e., the high-frequency part.
This paper selects the haar, db6, bior4.4, and sym8 wavelets as wavelet basis functions
for decomposing the image three times. Each decomposition will obtain four subbands,
i.e., low-frequency approximate sub-band, $LL$, high-frequency vertical knot sub-band,
$HL$, high-frequency diagonal detail sub-band, $HH$, and high-frequency horizontal
knot sub-band, $LH$. The low-frequency subband is further decomposed according to
the decomposition order next time. The Haar wavelet three-level decomposition transformation
is an example, and the decomposition principal diagram is shown in Fig. 2.
Fig. 1. Moving image decomposition.
Fig. 2. Schematic diagram of wavelet decomposition.
3. Moving Image Fusion Algorithm
Based on the moving image features, time series, and decomposition results obtained
in Section 2, a CSM was established to ensure the color consistency of image fusion.
A multi-sensor was then adopted to fuse the moving image. Fig. 2 decomposes the image information using the wavelet basis function to obtain subbands
in different frequencies and directions. The wavelet decomposition of moving images
is similar to the filter decomposition, so the parameters can be adjusted on the filter
to achieve a hierarchical division of signal features. The multi-target PSO algorithm
is used to optimize the image fusion result and improve the image fusion effect.
3.1 Construction of CSM
A moving image can be decomposed into three channel components: R, G, and B. Because
the three-channel components are related, they affect each other in the image fusion
process, which is not conducive to calculating image fusion. IHS (Intensity, Hue,
and Saturation) image fusion is based on color space conversion. The method can realize
the conversion of RGB (Red, Green, and Blue) spatial information, transforming the
image into an image containing three independent components. I (Intensity), H (Hue),
and S (Saturation) represent the intensity information (i.e., brightness), Hue (i.e.,
the distinguishing property between colors), and saturation (i.e., the depth and concentration
information of image colors). Compared with RGB space, IHS Color space is closer to
the perception of color through the human visual system. Different regions of the
same moving image will be blurred and clear owing to the different viewfinder depths
of the unprocessed moving image. Consistent with grayscale multi-focus image fusion,
moving image fusion combines the clear images of different focus targets in multiple
color multi-focus images into a single image. IHS CSM was established to achieve this
goal [22].
The IHS model is a CSM based on the three elements of human visual color. The I component
mainly contains the gray information of the source image, and the H and S components
together contain the spectral information of the source image [23]. The I component mainly contains the gray information of the source image. The H
and S components jointly contain the spectral information of the source image. In
addition, the I component is the weighted average of the three-color channels and
is insensitive to noise. Therefore, the I component is selected to calculate the focusing
degree, which is adopted to characterize the fusion degree of the pixel of interest
in the resulting image. IHS CSM is established based on the above reasons. Taking
two motions as examples, the schematic diagram of CSM is given, as shown below in
Fig. 3.
According to Fig. 2, the main steps of building IHS CSM are as follows:
(1) The two moving images $A$ and $B$ are converted from RGB space to IHS space, and
the brightness components $I_{A}$ and $I_{B}$ of the two moving images are separated;
(2) The luminance component $I_{A}$ and the luminance component $I_{B}$ are decomposed
by DT-CWT to acquire the low-frequency component and high-frequency component of $I_{A}$
and $I_{B}$, respectively;
(3) In accordance with the fusion rules based on fuzzy theory, the moving image is
preliminarily fused to acquire $S_{A}\left(x,y\right)$, the low-frequency component,
and $H_{B}\left(x,y\right)$, the six high-frequency detail components;
(4) These low-frequency and high-frequency components are inversely transformed by
DT-CWT to obtain the fusion result $I_{AB}$;
(5) $S_{A}^{'}$ and $H_{B}^{'}$ are obtained by the weighted average of $S_{A}$ and
$H_{B}$, respectively, and the IHS inverse transformation is performed together with
$I_{AB}$;
(6) From IHS space to RGB space, the preliminary fusion results of the moving images
are obtained.
Fig. 3. IHS CSM principle.
3.2 Moving Image Fusion Method based on Multi-sensor
Multisensory image fusion is a comprehensive analysis technology that performs spatial
registration of different image data corresponding to the same scene obtained by multiple
different types of sensors. Hence, the advantageous information of each image data
is complementary to each other and organically combined to produce new and more informative
images [24]. Image fusion, which is a key branch and research hotspot of information fusion,
has been applied extensively in fields, such as machine vision, military remote sensing,
and medical diagnosis [25].
Multi-sensor image fusion is a processing that integrates images or image sequence
information of a specific scene acquired by multiple sensors simultaneously or at
different times for generating new information about the scene interpretation. Let
$r_{1},r_{2},\ldots ,r_{n}$ represent the measured data obtained by the sensor from
the measured parameters. Owing to the influence of the sensor accuracy and environmental
interference, $r_{i}$ has randomness. Let its corresponding random variable be $R_{i}$,
then in practical applications, $R_{i}$ will generally obey the normal distribution,
and it is assumed that the measured values of each sensor are independent of each
other. Owing to the randomness of environmental interference factors, the authenticity
of $R_{i}$ can be determined only by the information contained in the measurement
data $r_{1},r_{2},\ldots ,r_{n}$. Hence, a higher authenticity of $r_{i}$ leads to
a higher degree that $r_{i}$ is supported by other measurement data. The degree that
$r_{i}$ is supported by $r_{j}$ is the possibility of the measured data $r_{i}$ being
real data from the measured data $r_{j}$. The concept of the relative distance is
introduced for the support degree between two sets of sensor-measured data.
The relative distance $d_{ij}$ between the measured data of two sensors is defined
as the following expression:
A larger $d_{ij}$leads to a greater difference between two measured data, i.e., a
smaller mutual support between two data. The relative distance is defined based on
the existing implicit information of the data, reducing the requirement for priori
information. A support function $\vartheta _{ij}$ is defined to quantify the mutual
support between the measured data further. $\vartheta _{ij}$ should meet two conditions:
(1) $\vartheta _{ij}$ should have an inverse proportional relationship with the relative
distance;
(2) $\vartheta _{ij}\in \left[0,1\right]$ enables the measurement data processing
to take advantage of the membership function advantages in fuzzy set theory and prevent
the absoluteness of mutual support between two measurement data.
The support function $\vartheta _{ij}$ is defined as
According to Eq. (18), a smaller relative distance between two measured data leads to greater mutual support
between the two data. The support degree will be one when there is no relative distance
between the equivalent measurement data and itself. In contrast, if $\vartheta _{ij}$
is very small, it means that there is a considerable relative distance between the
two data. At this time, it can be deemed that there is no mutual support between the
two data, and $\vartheta _{ij}$ is meaningless. According to the practical application
background of the problem, the parameter $\xi \geq 0$ can be determined when $d_{ij}\geq
\xi $, $\vartheta _{ij}=0$. When there is the largest relative distance between two
measured data, there is no mutual support between the two measured data, and the support
function value reaches zero at this time. Because the $\vartheta _{ij}$ value declines
from 1 to 0 on $d_{ij}\in \left[0,+\infty \right)$, it satisfies the properties of
the support function. Furthermore, the definition form of fuzzy support function $\vartheta
_{ij}$ complies more with the authenticity of the practical problem. The method is
easy to implement and can make the fusion result more accurate and stable.
3.3 Moving Image Fusion Optimization based on the Multi-objective PSO Algorithm
The spatial conversion of image information and the fusion of different image data
by multi-sensor have improved the accuracy and stability of fusion results. On the
other hand, a multi-objective PSO algorithm is used to optimize the fusion parameters
and further improve the retention of image details and the richness of image information
[26]. Multi-objective particle swarm optimization algorithm is efficient and straightforward,
does not require complex parameter settings, and can meet many objective evaluation
indicators and fusion requirements in moving image fusion. These objective evaluation
indicators can optimize the objective function [27]. It is difficult for a single objective function to cover the features of image fusion
completely, while the multi-objective particle swarm optimization algorithm can enhance
the ability of image fusion based on considering multiple objective functions. The
algorithm takes multiple motion sub-images obtained from training as training objects,
analogy to the number of PSO space dimensions, preprocesses the sampled images, and
takes the divided image blocks as input values, Select the best fitness individual
to initialize the weight of the network structure, repeat the particle iteration optimization
process, and calculate the best fitness of each example to obtain the optimal particle
solution. The PSO algorithm will produce a global optimal particle when solving single
objective problems and various non-dominated solutions for multi-objective problems,
i.e., non-inferior solutions. Therefore, appropriate particles can be selected from
this set of non-dominated solutions according to specific fusion needs [28,29].
The fusion parameter optimization algorithm is as follows:
(1) Set two fusion parameters $\delta _{1}$ and $\delta _{2}$, and initialize two
particle populations $\lambda _{{\delta _{1}}}$ and $\lambda _{{\delta _{2}}}$ respectively,
corresponding to fusion parameters $\delta _{1}$ and $\delta _{2}$. The search space
of the two-particle populations is [0,1], and the number of particles is $N_{k}$. The initial position $\tau _{0}$ of each
particle in the particle swarm is generated randomly, and the initialization speed
$v_{0}$ is set to 0. Individual extremum $\delta _{best}$ is initialized.
(2) Calculate the optimization objective function value $\partial _{i}\left(k\right)$
corresponding to each particle in the population, $k=1,2,\ldots ,N_{s}$. $N_{s}$ is
the number of objective functions, and the optimization objective function is the
objective evaluation index of the selected image fusion.
(3) Initialize the external archive $O$ and store the non-dominated particles in $\delta
_{best}$ into the external archive.
(4) Perform the following operations and iterate to the maximum evolutionary algebra.
1) Calculate the crowding distance of the non-dominated solution set in the external
archive, and sort archive $O$ in grading down order according to the crowding distance.
The calculation formula for congestion distance is
where $\varpi _{dist}\left(i\right)$represents the crowding distance of individuals,
which is 0 during initialization. $\partial _{i}\left(k+1\right)$and $\partial _{i}\left(k-1\right)$represent
the objective function value of the individual, respectively.
2) Updates the speed of the particles. The equation is
where $w$ represents the inertia weight. $E_{W}$ is the learning factor; $\delta _{best}\left(i,j\right)$
is the individual extreme value, i.e., the best position searched by the particle.
3) Update the position of particles. When updating the position of the particles,
keep the particles in the search space. If the particle crosses the boundary, the
particle position is the corresponding boundary value, and the particle velocity is
$-V\left[i,j\right]$, so a reverse search is carried out.
4) Set the iteration number to $T$ and mutate the particle swarm. The mutation operator
is
where $\rho _{Up}$and $\rho _{Low}$ refer to the upper bound and the lower bound of
the search space, respectively.
5) For particles in the population, calculate and evaluate their objective function
values.
6) Update the external archive and insert the non-dominated particles in the current
population into the external archive.
7) Update the individual extreme value. If the current position of the particle is
better than that stored in the individual extreme value, $\delta _{best}\left[i\right]=\delta
\left[i\right]$.
(5) The external archive is the non-dominated solution set.
The fusion processing of moving images is realized through the above steps. The proposed
method will be experimentally analyzed to verify its practical application value.
4. Experimental Research
Experimental analysis was conducted to verify the comprehensiveness and effectiveness
of the multi-sensor-based moving image-information fusion analysis algorithm. In the
experiment, the TD and CSR-based fusion methods of multi-exposure images and the semantic
segmentation-based infrared and visible image fusion methods are compared with the
method put forward in many aspects.
4.1 Experimental Hardware Environment and Data Source
The experimental hardware environment includes an experimental processor of Intel
Core- M480I5CPU@ 2.67GHz, 8GB memory, a 64-bit operating system, and a version of
Windows 10. The images used in this study were obtained from the Imagenet database,
the largest known image database. Five data sets were set up with 1000 images of various
types from the database. The experimental images were 512 ${\times}$ 512 pixels and
256 gray levels after spatial registration.
4.2 Analysis of Experimental Results
The experimental indices were divided into objective and subjective evaluation indices.
The objective evaluation indices included standard mean-square deviation, information
entropy, and signal-to-noise ratio. The subjective evaluation index was the human
visual effect.
(1) Objective evaluation
The standard mean-square error (MSE) refers to the information retention degree of
the fused image to the original image. A smaller value will lead to a higher approximation.
The SNR is opposite to the standard MSE. A larger value will lead to a better fusion
effect. The calculation equation for the two is shown below:
where $\varepsilon _{i}$represents the pixel gray value of the target scene; $\varepsilon
_{j}$is the pixel gray value of the fused image.
Image information entropy (IIE) is a key index for measuring the richness of the image
information. Based on the IIE comparison, the performance ability of the image detail
was compared. The information entropy reflects how much information the image carries.
A larger entropy leads to better-fused image quality. The amount of information contained
tends to be the largest if the probability of all gray levels in the image tends to
be equal. The IIE is defined as
where $\theta _{i}$ is the ratio of pixels with a gray value equal to $o$, the total
number of image pixels.
According to Eqs. (22) and (23), the MSE and SNR values of images were calculated under different data set numbers
of the proposed method and the other two comparison methods. The experiment was repeated
three times, taking the average MSE value and average SNR value as the experimental
results. The results are shown in Tables 1 and 2.
According to an analysis of the data in Table 1, under different image information of data sets, the MSE values of multi-exposure
image fusion methods based on TD and CSR and infrared and visible image fusion methods
based on semantic segmentation were all above 500, and the maximum average MSE value
of TD-CSR reached 1006.98. The minimum mean square error values of the two methods
were 579.13 and 488.56, which were higher than the 410.65 of the research method year-on-year.
The minimum average MSE of the proposed method was 286.25, and the maximum signal-to-noise
ratio was 24.66. The above results show that the proposed fusion method can effectively
improve image similarity and reduce the error value.
According to the data analysis in Table 2, the MSE values of the TD and CSR-based multi-exposure image fusion method and the
MSE values of the semantic segmentation-based infrared and visible image fusion method
were lower than those of the proposed method. The maximum SNR values of the two methods
were 12.36 and 12.39, respectively, while the maximum MSE value of the proposed method
was 24.75, indicating that the proposed fusion method has a better fusion effect.
Table 1. Comparison results of the MSE values.
Dataset Number
|
MSE value
|
The method put forward
|
TD and CSR
|
Semantic segmentation
|
1
|
410.65
|
579.13
|
488.56
|
2
|
348.73
|
632.44
|
592.94
|
3
|
286.25
|
720.13
|
826.03
|
4
|
412.66
|
1006.98
|
795.25
|
5
|
337.52
|
688.59
|
863.89
|
Table 2. Comparison results of the SNR values.
Dataset Number
|
SNR value
|
The method put forward
|
TD and CSR
|
Semantic segmentation
|
1
|
23.43
|
11.22
|
9.85
|
2
|
18.94
|
12.18
|
10.53
|
3
|
24.66
|
10.99
|
11.48
|
4
|
18.42
|
10.52
|
12.41
|
5
|
16.53
|
12.23
|
10.01
|
According to the data analysis in Table 3, the feature classification accuracy of the method proposed in this paper was more
than 91%; the maximum value was 95.16%, and the feature classification accuracy was
higher than the other two fusion algorithms. The classification accuracy of the other
two fusion algorithms was below 90%, and their maximum values were 85.24% and 86.14%.
Table 4 shows that the maximum information entropy of the proposed method is 9.2, which is
significantly higher than the two traditional methods. These results showed that after
the fusion of the method proposed in this paper, the image detail retention was higher;
the feature classification effect was better; the image richness was improved, and
the fusion quality was better
The information entropy of different methods was calculated using Eq. (24), and the results are shown in Fig. 4.
Table 3. Comparison results of the feature classification accuracy.
Dataset Number
|
Feature classification accuracy (%)
|
The method put forward
|
TD and CSR
|
Semantic segmentation
|
1
|
95.16
|
85.24
|
84.25
|
2
|
94.33
|
84.37
|
83.28
|
3
|
91.25
|
81.26
|
86.14
|
4
|
93.89
|
83.69
|
87.22
|
5
|
94.07
|
77.23
|
85.19
|
Table 4. Comparison results of the information entropy.
Dataset Number
|
Feature classification accuracy (%)
|
The method put forward
|
TD and CSR
|
Semantic segmentation
|
1
|
9.13
|
8.26
|
8.35
|
2
|
9.15
|
8.77
|
8.41
|
3
|
9.20
|
8.45
|
8.33
|
4
|
9.18
|
8.96
|
8.26
|
5
|
9.17
|
8.87
|
8.98
|
(2) Subjective evaluation
The above objective evaluation results show that the proposed method has a good image
fusion effect. To verify its application value, an image in the experimental set was
selected arbitrarily for fusion processing. The visual effects of image fusion were
compared with those of different methods. The results are shown in Fig. 5.
The fused original remote sensing data shall be preprocessed to reduce the impact
of the spectrum and time on the image information and data errors. Multi-source image
data shall be spatially registered, i.e., the high-resolution image data shall be
used as the reference datum, and the method of selecting control points shall be used
to perform geometric correction on other images. The original image in Fig. 5 is the image to be processed, and the target image is the image after preprocessing.
The original image is sparsely represented and convolved so that the image under the
multi-sensor can be fused better. In addition, the redundant information of the lower
right part in figure (d) is eliminated, and the main feature information is saved.
At the same time, the small rectangular box in the semantic segmentation effect in
figure (d) shows that when the target images after different processing are fused,
the information features it displays are relatively rich (the rectangular box on the
right side in figure (d) is the increased part compared to the original image), and
the image information can be smoothed based on the integration of image information.
The original image was input and preprocessed, including data format conversion and
image filtering processing (wavelet change). The color space information of the processed
image was then converted, and the data information was fused with a multi-sensor.
Before image fusion, other images were calibrated and registered based on the high-resolution
image. The processed object image was represented sparsely and convolved, and particle
swarm optimization was carried out based on considering the requirements of multi-objective
fusion to find the optimal particle solution. The fused image, including semantic
and scene segmentation, was resampled and post-processed to obtain the information
features of multiple motion images after fusion.
Fig. 5 shows that the fusion result of the proposed method retains rich image information
and improves the spatial detail expression ability of the fused image. Compared to
the traditional method, the proposed method produced clearer images and more obvious
detailed features. There was less image distortion. The definition increased and had
a better visual effect.
According to the experimental results, the image fused using the proposed method was
better than the traditional fusion method in subjective visual effect and objective
statistical data. Its performance of maintaining image information and enhancing image
spatial detail information is improved.
The remote sensing image data were selected for fusion algorithm comparison, and Figs.
6 and 7 present the results. The dynamic visual effects of the panchromatic images,
which were processed by a wavelet transform and inverse wavelet transform, and fused
with spectral images by multi-sensor processing, were quite different. The research
mainly evaluated the effect difference between spatial detail features and spectral
information, in which information entropy and sharpness can reflect the image fusion
quality. It can be seen in Figs. 6 and 7 that the image (b and c) after the wavelet
transform and inverse wavelet transform has some information distortion and image
blurring. A specific deviation was noted between the characteristic information content
and the original image. The research proposed that multi-sensor image fusion can effectively
consider the differences between different image scale information. The spectral distortion
in Figs. 6(d) and 7(d) was less than in the other images, and the sharpness increased
significantly, with better visual effects and presentation quality.
Fig. 4. Comparison results of the information entropy.
Fig. 5. Comparison of image fusion effects.}
Fig. 6. Image fusion effect based on the region features.}
Fig. 7. Image fusion effect based on the region features.}