3.1 AlexNet Model and Its Generalized Finetune Processing Model
The AlexNet model, from the 2012 ILSVRC 2012 competition and proposed by Alex, is
built on a deep CNN and is known for its ultra-low error rate for image classification,
with an error rate of only 16.4% for image classification [21]. This model runs through a process of first inputting a $227 \times 227$ sized image,
going through a CL, a mechanistic operation, a pooling layer and normalisation and
warming, and finally a droupout to prevent overfitting, through two consecutive fully-connected
layers, and inputting a SoftMax classifier for classification. The overfitting problem
present in the model training is mitigated by Droupout in this model, while the data
is enhanced, due to the use of methods such as level flipping in the model. The ReLU
function in the model acts as a non-excitation function on the neurons, making the
training of the model more efficient [22]. The ReLU function requires fewer operations in the forward calculation and back
propagation to find the bias, omitting steps such as division, so ReLU, which is essentially
a segmentation function, is less computationally intensive and can handle the computation
more efficiently. The structure of the AlexNet is presented in Fig. 1.
Fig. 1. AlexNet model structure diagram.
As shown in Fig. 1, the AlexNet includes five CLs and two fully connected layers within the model, where
the standard pixel value of the input graphics is $227 \times 227$, but the processed
image size is compressed to $6\times6$ pixels, and the land covers fewer classification
categories with fewer samples. Back propagation is based on the CNN result error,
using gradient descent to correct the neural network weights. the CNN uses the SGD
algorithm to update the connections to calculate the weights, where the main formula
is shown in Equation (1).
In Equation (1), $\delta _{j}^{l} $, $\delta _{j}^{l+1} $ are the sensitivity of the error of the
neuron in the $j$ feature layer in the $l$, layers to the basis. $f$ is the derivative
of the activation function. $u_{j}^{l} $ is the output value of the activation function
of the neuron in the $j$ feature layer in the $l$ layer. $up$ is the upsampling function.
$L$ is the description of the sample and label error loss function, and $P_{I}^{l-1}
$ is the area neuron multiplied by $\omega _{ij}^{l} $ in the convolution operation
$x_{i}^{l-1} $. $\omega _{t+1}^{l} $ and $\omega _{t}^{l} $ are the updated value
of the weights of the $l$ layer at the $t+1$ and $t$ iterations respectively. $\eta
$ is the learning efficiency, and $\mu $ is the weighted impulse. The basic structure
of the CNN includes a CL, a ReLU unit, a downsampling layer, a normalization, a Droup
strategy, a SoftMax classifier, and therefore an AlexNet model. The output of the
channels in the CL is calculated as shown in Equation (2).
In Equation (2), $x_{j}^{i} $ is the output of the $j$ channel of the $l$ CL. It is the result of
convolving the $j$ feature map of the $l$ CL with the bias of the $x_{i}^{l-1} $ neglected
feature map of the previous layer, and $f$ is the activation function. The formula
for $u^{l} $ is shown in Equation (3).
In Equation (3), $M_{j} $ is the combination of the upper layer of feature maps calculated from $u_{j}^{l}
$. $k_{ij}^{l} $ is the convolution kernel, and $b_{j}^{l} $ is the intercept of the
feature map after the operation. The several-return function model of RrLU is shown
in Equation (4).
In Equation (4), $z$ is the result of the convolution calculation in the upper feature map layer.
The downsampling layer takes the input feature map and further reduces the model parameters
to output the downsampled feature map as shown in Equation (5).
The formula $u_{j}^{l} $ for calculating this is shown in Equation (6).
In Equation (6), $\beta $ is the coefficient of the downsampling operation. $u_{j}^{l} $ is the eigenmap
of the $j$ channel of the downsampling layer $l$.$x_{j}^{l-1} $ is the downsampling
operation biased by the downsampling operation. $b_{j}^{l} $ is the bias phase of
the downsampling layer. $down$ is the downsampling function, and $x_{j}^{l-1} $ is
the mean or maximum value of the eigenmap. The equation for the corresponding local
normalisation operation is shown in Equation (7).
In Equation (7), $n$ is the $n$ word stack around the first $i$ at a certain position. $\alpha $,
$k$, $\beta $ are the hyperparameters. $a_{x,y}^{j} $ indicates the result of the
convolution operation and activates the function operation, and $N$ is the total number
of convolutions. The formula for the SoftMax function is shown in Equation (8).
In Equation (8), $K$ is the amount of neurons in the output layer and $z_{j} $ is the output value
of the $i$ category predicted by the model. The AlexNet model is not feasible for
remote sensing land cover image classification, so this study innovatively combines
the structure of AlexNet model with the land cover classification problem, and finetue
processing on the basis of AlexNet model makes the model more widely applicable to
remote sensing land cover image classification. Specifically, the finetue process
is used to initialise the model with fully trained model parameters. The merit of
this method is that training on existing model parameters reduces the workload, and
a small number of adjustments are made to existing samples selected until they fit
the data. The primary problem in the remote sensing land cover image classification
is the initial processing of the image, followed by feature extraction, the classification
of the algorithm, and later the processing and accuracy evaluation. The remote sensing
image classification is shown in Fig. 2.
Fig. 2. Remote sensing image classification flowchart.
As shown in Fig. 2, remote sensing image classification first requires pre-processing of the image,
and the learning efficiency of the algorithm is closely connected to the level of
the input data [23]. The processing is mainly radiation correction, geometric correction, image enhancement,
noise removal to reduce the impact on post-processing; secondly, the selection and
extraction of features, the purpose is to improve the accuracy of image classification,
good features only need a simple algorithm to achieve good results, where the features
are divided into two kinds of spectral and spatial texture. Spectral features are
visualized as the spectral value of the feature, grayscale value or the ratio between
bands, etc.; spatial features are the main features for human recognition [24]. Features are the data that intuitively determine how good a classification is, and
the key to accurate image classification is better features. The common classification
algorithms used for supervised classification are maximum likelihood, SVMs and artificial
neural networks (ANNs). The maximum likelihood method achieves a relatively proud
accuracy, but does not perform well in terms of computational simplicity, is not efficient
enough to process and need much data to compute. The maximum likelihood method builds
a discriminant function based on Bayesian statistical methods, and the posterior probability
of occurrence of$x$ is calculated as shown in Equation (9).
As shown in Equation (9), $S$ is the sum of categories. $y_{i} $ means the prior probability of the $i$ th
category. $p(x/y_{i} )$ is the conditional probability density function of the $i$
th category of $x$, and when $x$ satisfies Equation (10), then the $x$ category is $y_{i} $.
SVMs require not only correct classification but also a high degree of confidence
in the classification results. SVM classification transforms the input data from low
latitude space to high latitude space using a kernel function, and seeks the optimal
solution to the classification hyperplane in the high latitude space whichever maximises
the data interval. Firstly, samples with known classification are selected for training,
and the quadratic optimisation formula is obtained by Lagrangian pairwise transformation,
as shown in Equation (11).
$l$ in Equation (11) is the total number of categories. $k$ is the kernel function. $a$ is the Lagrangian
factor, and the common representation of the kernel function is shown in Equation
(12)
The corresponding samples obtained according to the solution of the quadratic programming
problem are support vectors, and the classification function is denoted in Equation
(13).
In Equation (13), $sgn$ is the sign function for interpreting the conquered samples and $b$ is the
intercept term. The category of the sample is obtained by substituting the unknown
sample $X$ obtained from the unclassified image into the classification function equation.
The merit of the ANN model is the powerful ability of the special fit. The ANN model
can be trained in the sample data of known categories by back propagation gradient
descent algorithm, followed by the judgement of unclassified data. The statistics
commonly analysed to assess the results accuracy after completing the classification,
including user accuracy, overall classification accuracy, and Kappa coefficient. The
last one is a measure of whether the classification result is consistent with the
standard image, and is calculated as shown in Equation (14).
In Equation (14), $x_{ii} $ indicates the number of pixels in the classification result where the
$i$ category is the same as the $i$ category of the reference image. $x_{i+} =\sum
_{i=1}^{n}x_{ij} $ is the amount of pixels in the result where the $i$ category is
the same. $n$ is the amount of categories, and $N$ is the total number of all samples.
The flow of the finetneurized AlexNet CNN model, acting on the land cover classification
study, is shown in Fig. 3.
Fig. 3. AlexNet model finetune land cover classification flow chart.
3.2 Construction of the Land Cover Classification Model Based on LCNet-27 and LCNet-13
The LCNet CNN contains LCNet-27 and LCNet-13. The overall process of land cover classification
research using the LCNet CNN model is shown in Fig. 4, which includes five stages: sample data preparation, model training, optimal sample
size selection, comparison of LCNet-27 and LCNet-13, and comparison with traditional
methods.
Fig. 4. LCNet model finetune land cover classification flow chart.
The sample data preparation phase uses different pixel sizes for sample data with
different resolutions to normalise them to a standard input size. The model training
phase is to add labels to the acquired sample data to form the training data. In the
best sample size selection phase, the best trained model with high accuracy is selected
as the best model for raising others; the classification results obtained from the
training class models of different sizes are evaluated to determine the most suitable
sample size for the model; the best size input is obtained in the LCNet-27 and LCNet-13
comparison phase, and the trained model is trained to obtain the best model Classification
experiments are performed and the obtained classification results accuracy is evaluated.
In the traditional method comparison phase, the best sample size and the best model
are used for training, and the obtained model is contrast to the traditional method.
A model plot for LCNet-27 is shown in Fig.~5.
Fig. 5. LCNet-27 model structure diagram.
As shown in Fig. 5, the LCNet-27 model, a CNN, contains a CL, a pooling layer, and a fully connected
layer. The number of layers is 3, 3 and 2 respectively. The LCNet-27 model is able
to reduce the training size from $27 \times 27$ to $6\times6$ from the CLs to the
fully-connected and Softmax layers. A model diagram for LCNet-13 is shown in Fig. 6.
Fig. 6. LCNet-13 model structure diagram.
As shown in Fig. 6, the LCNet-13 model, a CNN, contains CLs and fully connected layers, all of which
are 2 in number. The LCNet-13 model is capable of feeding sample data with a size
of $13 \times 13$. In summary, in order to find a class of algorithms that can minimize
or maximize the parameter values of the objective function, the objective function
optimization tool used in the study is the LCNet model, which performs end-to-end
joint optimization by optimizing the network parameters under the conditions of bit
rate constraints, signal distortion constraints and semantic misalignment constraints.
Different designs of its LCNet make it computationally efficient while maintaining
high classification accuracy, which is suitable for the practical application scenario
of land cover classification explored in the study.