Obtaining landscape plant image data through drones and performing intelligent image
recognition can quickly provide management personnel with relevant information on
the overall vegetation distribution and vegetation type structure of the landscape,
which helps them manage landscape vegetation more scientifically and monitor vegetation
growth. This study designs an adaptive threshold based plant remote sensing image
segmentation and mask algorithm, as well as an improved ResNet50 plant remote sensing
image recognition algorithm with mixed Squeeze and Excitation (SE) channel attention.
The two are combined to form a plant remote sensing image recognition model for landscape
design.
3.1. Plant Remote Sensing Image Segmentation and Mask Algorithm Based on Adaptive
Threshold
Firstly, an enhancement processing module is designed for the dataset in the plant
remote sensing image recognition model [11,12]. Due to the high resolution and large image size of the original dataset of remote
sensing image, direct use can lead to slow model training speed [13-15]. Therefore, it is necessary to perform down sampling on the original image, to reduce
the image length and width to $\tau $ times the original in an equal proportion, that
is, the original image needs to be sampled every $1/\alpha $ pixels in the row and
column directions [16]. Considering the high resolution of the dataset in this study, setting $\tau $ to
0.2 is more appropriate. Due to the presence of a large amount of environmental noise
in the background of landscape plant remote sensing image, filtering processing is
still required [17,18]. To ensure minimal loss of true information in the denoised image, median filtering
is the most appropriate choice. The image also needs to undergo contrast stretching
transformation. Due to the similarity in radiation intensity and lower contrast in
location images with concentrated features, the recognition difficulty of the algorithm
is higher. Because nonlinear stretching is sensitive to parameters and can affect
the stability of the recognition model, grayscale stretching in linear stretching
is more suitable for processing the dataset of this study. By using the grayscale
histogram of the image, the characteristics of grayscale brightness and contrast in
the image can be observed. Let the grayscale range of image $f(i,j)$ be $[a$, $b ]$,
and the grayscale range of image $g(i,j)$ after linear transformation be $[a'$, $b']$,
and $g(i,j)$ be calculated according to Eq. (1).
To be precise, piecewise linear transformation is used here, which has the effect
of highlighting the grayscale range of the target of interest and suppressing the
non interest range. Let the grayscale interval of the interest target in the initial
image $f(i,j)$ be $[a$, $b]$, and the corresponding grayscale interval of the image
be $[0$, $M_{f} ]$. Eq. (2) can be used to expand the grayscale range of the target of interest to $[c$, $d]$.
In summary, the principles of image linear stretching transformation and segmented
linear stretching transformation are shown in Fig. 1.
Fig. 1. Principle demonstration of linear and piecewise linear stretching transformation.
Due to the fact that non ornamental buildings do not belong to the objects of landscape
design, it is necessary to construct algorithms to mask them. This study proposes
a binary mask algorithm with a mixed maximum inter class variance adaptive threshold.
Considering that the binary image mask method has the advantages of simple processing,
low complexity, and easy logical operations, and the image to be processed has been
denoised, the binary image method was chosen for mask processing in the study. In
a binary image, there are only two values: 0 and 1, and there is a set grayscale value
$T$. Pixels that are not higher than $T$ will be mapped to 0, and vice versa will
be mapped to 1. It can be seen that finding a reasonable segmentation threshold is
the key to the mask. Here, we choose the maximum inter class variance method with
fast calculation speed and significant segmentation effect to find the grayscale values
of binary images. The designed grayscale value calculation process based on the maximum
inter class variance method is as follows. Firstly, for image $I(x,y)$, a threshold
$T$ is set to distinguish between foreground and background. This threshold divides
the image grayscale into two parts, $C_{1} =\{0$, $1$, $2$, ..., $T\}$ and $C_{2}
=\{T+1$, $T+2$, ..., $n-1\}$. The proportion of foreground pixels is set to the total
image size to $\omega _{0} $, corresponding to an average grayscale value of $\mu
_{0} $. The proportion of background pixels to the total image size is $\omega _{1}
$, and the corresponding average grayscale value is $\mu _{1} $; The total average
grayscale of the image is $\mu $, and the inter class variance is $S$. If the image
size is $M\times N$, the number of pixels with grayscale values less than $T$ is $N_{0}
$, and those with grayscale values greater than $T$ are $N_{1} $. Therefore, in the
first step, the foreground ratio and background ratio can be calculated, as shown
in Eq. (3).
The second step is to calculate the total number $N_{all} $ of pixels, as shown in
Eq. (4).
Considering that the sum of background probability and foreground probability is 1,
the average grayscale value can be calculated according to Eq. (5).
The third step is to calculate the inter class variance, as shown in Eq. (6).
The maximum value operation is performed on Eq. (6) and the required binary threshold values are divided for the corresponding $T_{k}
$. By using the binary mask based on the maximum inter class variance method, the
main framework of the mask area can be extracted. To further optimize the details,
it is necessary to perform a series of morphological processing on the segmented image
to improve the mask effect. The morphological processing methods used here are corrosion
and expansion. From a mathematical perspective, corrosion or dilation is the convolution
operation of a complete or partial image (denoted as $A$) with its corresponding computational
kernel (denoted as $B$).
In corrosion calculation, assuming $B$ corrodes $A$, the calculation method is shown
in Eq. (7).
If $B$ can be completely contained within $A$ after translation, then the set of $z$
points forms a corrosion of $B$ on $A$, and the intersection of the complement of
B and A is empty, as shown in Eq. (8).
Similarly, the expansion of set $A$ to $B$ can be described as $A\oplus B$ using Eq.
(9).
That is to say, if the mirror image of $B$ intersects with $A$ after translation,
the set of $z$ points forms an expansion of $B$ on $A$. But the intersection of the
mirror images after $A$ and $B$ translation cannot be empty, that is, Eq. (10) holds.
In summary, the image morphology processing process after binary segmentation: The
first step is to remove micro particles to protect the real target from being masked.
The second step is to connect the disconnected parts through corrosion calculation.
The third step is to perform dilation operation to cover adjacent areas. The fourth
step is to fill the internal holes. Finally, the binary image is de inverted and dot
multiplied with the original image to generate a mask region. At this point, the segmentation
mask algorithm for plant remote sensing image has been designed, and the overall process
is shown in Fig. 2
Fig. 2. Segmentation mask algorithm for plant remote sensing image.
3.2. Improved ResNet50 Plant Remote Sensing Image Recognition Model with Mixed SE
Channel Attention
After the preprocessing and segmentation mask of plant remote sensing image are completed,
they will be input into the recognition model. The ResNet50 neural network algorithm
has excellent feature recognition ability, and this algorithm is chosen to design
the recognition model. To further enhance the recognition ability of ResNet50 algorithm
for key features and its performance on small sample datasets, the algorithm is now
being improved.
Due to the diverse types of landscapes and vegetation that need to be recognized in
landscape recognition tasks, and some recognition objects have similar shapes, an
attention module based on SE channels is added after each Bottleneck block in ResNet50.
The SE module performs adaptive re-calibration of the features extracted from the
convolutional layer through two steps of ``Squeeze'' and ``Excitation''. In the ``squeeze''
phase, the module captures the importance of each feature channel by averaging the
channel dimensions globally to generate a vector that describes the global information.
In the ``activation'' phase, the channel importance is remapped using the fully connected
layer, so that the network can focus more on those features that contribute to plant
recognition, so that key plant features can be effectively extracted when dealing
with crops in similar areas. The final designed ResNet50 algorithm structure that
integrates SE modules and transfer learning modules is shown in Fig. 3.
Fig. 3. Improved Resnet50 structure incorporating SE channel attention module.
The detailed structure of each convolution module in Fig. 3 is shown in Table 1. The first string after ``Conv,'' represents the convolutional layer size, the second
value after ``,'' represents the number of neurons in the convolutional module, ``Stride''
represents the step size of the corresponding convolutional module, and ``FC'' represents
the fully connected layer.
Table 1. Detailed structure of the improved ResNet50 convolution module incorporating
SE channel attention.
|
Number
|
Convolutional Layer
|
Convolutional parameters
|
Output Size
|
|
#1
|
Conv1
|
Conv1, 14×14, 128, stride 2
|
224×224
|
|
#2
|
Conv2
|
Max pool, 3×3, stride
|
112×112
|
|
[Conv2_1, 1×1, 128
Conv2_2, 3×3, 128
Conv2_3, 1×1, 128] ×3
|
|
#3
|
Conv3
|
[Conv3_1, 1×1, 256
Conv3_2, 3×3, 256
Conv3_3, 1×1, 1024] ×4
|
56×56
|
|
#4
|
Conv4
|
[Conv4_1, 1×1, 512
Conv4_2, 3×3, 512
Conv4_3, 1×1, 2048] ×6
|
28×28
|
|
#5
|
Conv5
|
[Conv5_1, 1×1, 1024
Conv5_2, 3×3, 1024
Conv5_3, 1×1, 2048] ×3
|
14×14
|
Considering that the training dataset in plant remote sensing image landscape recognition
tasks may have drawbacks such as small scale and insufficient involvement of plant
and landscape species, transfer learning technology will be integrated into the construction
of recognition models. Transfer learning is divided into four types: feature based,
sample based, relationship based, and model parameter based transfer. Considering
the difficulty of implementation and the type and scale of current academic plant
image datasets, model-based transfer learning is chosen here. Specifically, it involves
training a plant image classification model using the ImageNet dataset to initialize
the convolutional layer parameters of the improved ResNet50 algorithm.
The SE attention module is specifically designed in the model, which can use convolution
operations to obtain weights, and then fuse this weight with the Feature Map for data
with higher importance. The reason for choosing SE type attention modules is that
this type of attention module has low requirements for the size of algorithm parameters
and can capture the correlation between different channels, making it more suitable
for the ResNet50 algorithm structure in this study. Specifically, the SE module will
be integrated into the residual simulation of the ResNet50 algorithm, enabling the
algorithm to better learn weight information from different channels in the feature
map. The structure of the residual module that integrates SE channel attention is
shown in Fig. 4
Fig. 4. The residual module structure that integrates SE channel attention.
The loss function of the ResNet50 algorithm is redesigned and improved. In the task
of plant remote sensing image recognition, there is a serious species imbalance between
different types of plants and landscapes. The CrossEntropyLoss function treats different
samples in an equal weight manner, which will result in poor processing performance
of the model for difficult to classify samples. Therefore, the Focal Loss function
is chosen here to construct the ResNet50 algorithm, and the calculation method of
this function $focal\_ loss(\rho _{t} )$ is shown in Eq. (11).
In Eq. (11), $\alpha $ represents the category weight, $\rho _{t} $ represents the algorithm's
ability to recognize samples, $\rho _{t} $ represents stronger recognition ability,
and $\gamma $ represents the coefficient that controls the weight of samples with
different classification difficulties in the loss function. After multiple experiments,
setting the $\gamma $ parameter to 2.3 is more appropriate for this study. To prevent
overfitting in neural networks, it is necessary to incorporate a random dropout module
into the network, which is placed after each SE module. Assuming that there is a probability
of $p$ stopping working for each neuron during randomization, and before this step
is carried out, for neuron $i$, the neuron output is calculated according to Eq. (12), and the output is activated and converted according to Eq. (13).
In Eq. (12), $z_{i}^{(l+1)} $ is the total input, $w_{i}^{(l+1)} $ is the neuron weight coefficient,
$y^{l} $ is the output connected to the corresponding neuron in layer $l$, and $b_{i}^{(l+1)}
$ is the bias coefficient.
In Eq. (13), $y_{i}^{(l+1)} $ represents the corresponding prediction result o, and $f(\cdot
)$ represents the activation function. When random discard operation is performed
based on the discard probability $p$, the neuronal output is shown in Eq. (14).
In Eq. (14), $\tilde{y}^{(l)} $ represents the neurons in layer $l$ that have been randomly discarded.
Here, the gradient descent method optimizes the network. When calculating, it is necessary
to first take the derivative of $focal\_ loss()$ to obtain the gradient to update
the model until it completes convergence, as shown in Eq. (15).
In Eq. (15), $\theta _{j} $ represents the model parameters at the $j$-th iteration, ${\partial
focal\_ loss()/ \partial \theta _{j} }$ represents the calculated gradient, and $lr$
represents the network learning rate. In summary, the calculation process of the plant
remote sensing image recognition model, which integrates the improved ResNet50 algorithm
and binary mask segmentation, is shown in Fig. 5.
Fig. 5. Calculation process of plant remote sensing image recognition model.