Mobile QR Code QR CODE




Mobile computing, Incomplete image, Automatic labeling, Similarity measure

1. Introduction

With the advent of the information age, the Internet has penetrated into everyone’s life, and the rapid development of the application of multimedia databases makes the scale of databases increase sharply. With the rapid development of information technology, progress has appeared in the field of digital image processing with much data. With the rapid development of the Internet, image processing has been widely used in human life, especially in application environments such as remote sensing [1], crop detection [2,3], medicine and meteorology [4], and electric power systems [5]. As the process of describing things and their corresponding characteristics becomes complicated and diverse, the introduction of image information has greatly helped users. It can be said that increasingly innovative image processing technology is an important part of modern life.

<New paragraph> Graphic information is stored on a computer by converting it into digital information. After being converted into digital information, the user can operate the computer to manipulate the image after digital informatization [6]. Therefore, the key to image processing is the computing power of the computer. The computing power of a computer is often limited by the software and the hardware of the computer. In the past, because of the limitation of a computer’s computing power, a user could only realize the preprocessing of an image in a computer. However, with the development of personal computers, digital image processing technology has changed. Image processing is more diversified, there is higher image processing accuracy, the reproducibility of an image processing algorithm is better. Therefore, it can be said that after continuous development, today’s graphics processing technology has been able to break through the preprocessing process, by the computer to make the corresponding image understanding behavior of image understanding and computer vision technology has become a new challenge in image processing <note: awkward/ambiguous; clarification by the author is needed>.

<New paragraph> The importance of image processing has been constantly presented to the world, so a worldwide upsurge is emerging, and there are many talents and materials in every corner of the world. Faced with a large number of digital images, how to effectively organize and manage these data to meet the needs of users has become an urgent and meaningful research topic. Image retrieval [7] provides an effective solution to these problems to some extent. In the early days, people added text annotation to images by hand and then did image retrieval. When the size of an image database is small, the accuracy of manual annotation is high, and the annotation process is relatively simple. With the growth of digital images, traditional manual labeling has shown great limitations: (1) when the size of the image dataset is large, manual labeling needs much time and labor, and (2) for the same digital image, different people have different understanding, which has great subjectivity. Therefore, an automatic image annotation algorithm has received extensive attention.

<New paragraph> Image captioning [8,9] is a comprehensive problem combining computer vision, natural language processing, and machine learning and requires generating a paragraph of text describing a given image. The task is remarkably simple for humans, but it is a daunting challenge for computers. The model not only needs to be able to extract the features of the image, but also to identify the objects in the image and finally use natural language to express the relationship between them. As two-dimensional data, an image has abundant spatial distribution information, including the spatial relationship between the objects contained in the image and the spatial structure of the object itself, which is of great significance for image retrieval and classification. However, in the real world, in addition to displaying complete images, there are a large number of images that are damaged or intentionally obscured.

<New paragraph> To some extent, the incompleteness of image information brings inconvenience and challenges to the interpretation and understanding of images. Therefore, it is necessary and important to find an effective method to label the incomplete images. Therefore, many scholars have carried out in-depth discussion and research on image annotation. Srinivas discussed the importance of automatic and intelligent image annotation in view of the fact that manual image annotation takes much time. Srinivas analyzed the research results of automatic image annotation in the last decade to help to remedy the defects of existing automatic image annotation methods [10].

<New paragraph> Zhang et al. proposed an image region annotation framework based on the syntactic and semantic correlation between image segmentation regions. The results show that the image annotation using this method has good performance on the Corel 5K dataset, and the annotation accuracy is high [11]. Mehmood et al. designed a support vector machine based on the weighted average of a triangle histogram and applied the improved support vector machine to image retrieval and automatic semantic annotation. Qualitative and quantitative analysis of three image benchmarks confirmed the effectiveness of this method [12].

<New paragraph> However, the existing intelligent image annotation methods still have some shortcomings, such as low efficiency, low precision, and so on. Therefore, by combining the scaling invariant feature transformation (SIFT) algorithm with image region selection and similarity transfer, an automatic image annotation algorithm based on a mobile computing environment is proposed. The proposed automatic image annotation algorithm can efficiently and accurately achieve intelligent image annotation, which plays an important role in image processing, image recognition, and other fields.

2. Overall Design of Automatic Marking of Incomplete Images in Mobile Computing Environment

The main steps of automatic labeling are image preprocessing, image feature extraction, image feature similarity measurement, model training, automatic labeling, and so on.

(1) Image preprocessing: In image annotation, the quality of an image will directly affect the effect of the annotation algorithm, so it is necessary to perform image preprocessing before extracting image features. Image preprocessing mainly removes the useless noise information in the image, strengthens the useful information, and enhances the robustness of image annotation.

(2) Image feature extraction: An image feature is a unique property of a certain type of image, and feature extraction realizes the quantitative expression of these image properties by programming with some mathematical means. Good features can often greatly improve the effect of image annotation.

(3) Similarity measurement of image features: The image to be marked has a defect, and the defect cannot provide help in understanding the content information of the image. Therefore, the similarity measurement of the image is not taken into account, and the similarity of the overall content information of the image is determined by the similarity of the information contained in the display part of the image. The missing part is selected with a rectangular selection box tangent to the edge of the missing part, and the image is divided proportionally based on the reverse extension line <note: ambiguous> of the rectangular selection. If the image contains more information, the image is divided continuously as appropriate. The aim of image segmentation is to obtain meaningful partitions from the input image [10], which is basic work in the field of image processing and is also an important step for subsequent image processing and analysis. Therefore, before the low-level feature extraction of the image, in order to improve the annotation effect of the image, all images are uniformly segmented in this form.

(4) Model training: Model training is the core part of the image labeling algorithm. Whatever kind of image labeling method is used, after obtaining good image feature expression, it constructs its own model and then learns an image feature to find the relevance between an image and semantic keywords.

(5) Automatic labeling: Automatic labeling of test images is done for different data or application scenarios.

3. Design of Automatic Labeling Algorithm for Incomplete Image

3.1 Image Preprocessing

According to the characteristics of a defect image, the defect part of the image cannot help us understand the content of the image. Therefore, when research measure the similarity of the image, research do not consider the similarity of the image, and the similarity of the overall content of the image is determined by the similarity of the information contained in the display part of the image. Image segmentation is basic work in the field of image processing and is also an important step in the following image processing and analysis. Thus, before extracting the low-level features of the image, in order to improve the effect of image annotation, all the images to be annotated are divided into regions and segmented uniformly.

In region division and image segmentation, the method of region selection was used. The aim of region selection is to select several regions from the whole image in a certain way, describe the content of the image based on these basic regions, and better mine the information of different objects in the image [12]. Image region selection methods are mainly divided into fixed division, image segmentation, and prominent points. Among these, image segmentation is the most effective and the longest method of region selection. Image segmentation aims to segment the image into regions corresponding to several objects so that each region corresponds to one object.

<New paragraph> Image segmentation is basic research content in the field of computer vision. The regions after image segmentation are often irregular regions. A regional covariance description method can be used to extract each region. Let I be a 1-D gray or 3-D color image and F be a feature image of W * H * d extracted from I:

(1)
$ F\left(x,y\right)=\phi \left(I,x,y\right) $

$\left(x,y\right)$ is the coordinates of the feature point, W is the width of the image, H is the height of the image, d is the number of features extracted, and $\phi \left(\cdot \right)$ can be any mapping function, such as the image gray value, color, gradient, and filter response. For a given region R, the order $z_{i}$ is the d-dimension characteristic point inside R. Region R is represented by the covariance of the feature points:

(2)
$ C_{R}=\frac{1}{n-1}\sum _{i=1}^{n}\left(z_{i}-\mu \right) $

$\mu =\frac{1}{n}\sum _{i=1}^{n}z_{i}$ is the mean of all feature points, and n is the number of pixels in the region.

Firstly, an image is segmented into different regions to make each region correspond to an object. Then, the region after image segmentation is described using covariance. The difference from the original covariance description is that the region corresponding to the original covariance is a regular rectangular region, while the region corresponding to the covariance matrix is an irregular region. Regular regions usually contain multiple objects, and the region after image segmentation usually corresponds to a semantic object. Covariance description is a regional representation method from the perspective of regional feature point distribution, which is independent of the size of the region. For the same semantic goal, the corresponding regional distribution will be similar, and the covariance will also be similar. Obviously, the region after image segmentation is represented by covariance, which can be used to distinguish regions of different objects.

3.2 Image Feature Extraction

3.2.1 Pre-extraction of Image Features

SIFT is a feature description method used in the field of image processing. The scale-invariant feature description method has a scale-invariant feature and can be made by detecting the key points in the image. Therefore, SIFT is used to extract image features.

Transform feature detection can be summarized as four basic steps. Firstly, the extremum of the scale space is detected, and the position information of all images in different scale spaces is calculated. Potential feature positions that are invariant with scale and rotation can be identified using Gaussian differentials. Then, the key feature points are calculated and fitted to determine the location and scale to compare the information at each location. The higher the stability is, the better the selection of feature points will be. Then, the direction is calculated, and the gradient direction is compared based on the image information.

<New paragraph> These two steps are repeated to ensure that the algorithm achieves a relatively high value in terms of invariance with scale. In the last stage, research compute the expression information of the feature points, which is used to compute the gradient in the adjacent region around the feature points. The feature of the gradient attribute is that it can capture the change information of the measured position more strongly so as to allow the deformation and illumination change of the larger local shape.

According to the data structure and data type selected in this paper, research can choose the best number of feature points returned by the algorithm. <note: ambiguous (this is not a complete sentence)> Research filter out the absolute thresholds of poor feature points <note: ambiguous>. The larger the threshold is, the smaller the number of feature points returned is. Research use a threshold to filter out edge effects. The larger the threshold is, the smaller the number of feature points filtered out is. <note: ambiguous (incomplete)> A Gaussian pyramid is a concept put forward in scale-invariant feature transformation. Firstly, a Gaussian pyramid is composed of many pyramids, and each pyramid contains several layers. The Gaussian pyramid is constructed by doubling the original image to be the 1st level of Group 1 of the Gaussian pyramid and smoothing the 1st level of Group 1 to be the 2nd level of Group 1 of the pyramid. The Gaussian smoothing function is as follows:

(3)
$ G\left(r\right)=\frac{e^{-{r^{2}}}}{2\sigma ^{2}\sqrt{2\pi \sigma ^{2}}^{N}} $

$\sigma $ is the standard deviation of the normal distribution. The larger the standard deviation is, the more blurred the image is. The blur radius is r, which refers to the distance from the target to the center of the circle. The Gaussian function in 2D space is:

(4)
$ G\left(x,y\right)=\frac{e^{-\frac{\left(x-xo\right)^{2}+\left(y-yo\right)^{2}}{2\sigma ^{2}}}}{2\pi \sigma ^{2}} $

For parameter $\sigma $, a fixed value of 1.6 is obtained in SIFT. $\sigma $ is multiplied by a scale factor k to find a new smoothing factor $\sigma $, which is used to smooth group 1 and layer 2 images, and the result image is used as layer 3. Research repeat this way to find the L-layer image.

(5)
$ L=\log _{2}\left\{\min \left(W,H\right)\right\}-t $

Generally, the number of L-levels is determined by the size of the image, and t is the logarithmic value of the dimension of the topmost image in the pyramid. In the same group, the dimensions of each layer have the same numerical size, but the values of the smoothing coefficients are not the same. Their corresponding smoothing coefficients are $0,\sigma ,k\sigma ,k^{2}\sigma ,\left| k^{3}\sigma ,\right.\cdots ,k^{l-2}\sigma $.

The first group of inverse layer 3 is sampled with scale factor 2, and the obtained image is taken as the first layer of group 2. Then, Gaussian smoothing is performed on the first layer of the second group to find the second layer of the second group. As in Step 2, the L layer of the second group is the same size in the same group. The corresponding smoothing coefficients are $0,\sigma ,k\sigma ,k^{2}\sigma ,$ $k^{3}\sigma ,\cdots ,k^{l-2}\sigma $. But group 2 is half the size of group 1.

3.2.2 Image Feature Extraction

A Kalman filter analysis model of controllable direction of the incomplete image was constructed by using the method of region merging, and the sparse feature points of the incomplete image were described by $I(i,j)$ with the method of region equivalent histogram analysis. The target template $I_{\left(k\right)}(i,j)$ is as follows:

(6)
$ I\left(i,j\right)=\sum _{k=1}^{P}I_{\left(k\right)}\left(i,j\right)\times 2^{k-1} $

k represents the equivalent area control coefficient of the regional equivalent histogram, and the output of the layered feature extraction of the controllable direction of the incomplete image is:

(7)
$ W_{i,j}=\left\{\begin{array}{ll} 0 & \left| W_{i,j}\right| \leq \lambda \\ \mathrm{sgn}W_{i,j}\left| W_{i,j}-\lambda \right| & W_{i,j}>\lambda \end{array}\right. $

When $W_{i,j}=\left\{\begin{array}{ll} 0 & \left| W_{i,j}\right| \leq \lambda \\ W_{i,j}-\beta (W_{i,j}-\mu ) & W_{i,j}>\lambda \end{array}\right.$ is used, the histogram of each window is weighted. Combining this with the method of texture recognition, the difference function of template matching is as follows:

(8)
$ \begin{align} y_{i}&=\sum _{k}h_{k}x_{2i-k} \\ \end{align} $
(9)
$ \begin{align} z_{i}&=\sum _{k}g_{k}x_{2i-k} \end{align} $

$h_{k}$ and $g_{k}$ represent image fusion and filtering coefficients. $X_{i}$ is the linear distribution sequence of the original image texture, $y_{i}$ and $z_{i}$ are the fusion coefficients of image features, and the current processing area $R_{i}$ is $A_{i}$. The gray level distribution sequence of the incomplete image is obtained as:

(10)
$ X_{i}=\sum _{k}\left(\overline{h}_{k}x_{2i-k}y_{k}+\overline{g}_{i-2k}z_{k}\right) $

Based on the histogram analysis of the matching template and linear tracking recognition, the iterative function of feature extraction output of the incomplete image is:

(11)
$ \begin{align} f_{X}(x)&=\frac{1}{\sqrt{2\pi }\sigma _{x}}e^{\frac{-(x-\mu _{x})^{2}}{2\sigma _{x}^{2}}} \\ \end{align} $
(12)
$ \begin{align} f_{\eta }(\eta )&=\frac{1}{\sqrt{2\pi }\sigma _{\eta }}e^{\frac{-(\eta -\mu _{\eta })^{2}}{2\sigma _{\eta }^{2}}} \end{align} $

Under the optimal feature matching, the edge feature extraction output of the incomplete image is as follows:

(13)
$ \begin{array}{c} E_{ext}\left(\varphi \right)=\lambda L_{g}\left(\varphi \right)+v\left(I\right)A_{g}\left(\varphi \right)\\ =\lambda \int _{\Omega }g\left(\nabla I\right)\sigma \left(\varphi \right)\left| \nabla \varphi \right| dxdy\\ +v\int _{\Omega }g\left(\nabla I\right)H\left(-\varphi \right)dxdy \end{array} $

The edge information weighting coefficients $\lambda $ and $\nu $ <note: There should be an ``and'' here instead of a comma (your file does not allow editing here)> are all constant and $\lambda >0$. The distribution area of the feature extraction of the incomplete image is:

(14)
$ v\left(I\right)=c\cdot \mathrm{sgn}\left(\Delta G_{\sigma }\times I\right)\left(\left| \nabla G_{\sigma }\times I\right| \right) $

$I(x,y)$ is the gray histogram of the incomplete image, sgn (.) is the symbol function, and $G_{\sigma }$ is the error coefficient. The directional histogram fusion algorithm is adopted to realize the feature extraction of the incomplete image.

3.3 Similarity Measure of Image Features

In the set of segmented sub-blocks with annotation, the similarity of the tagged image is measured. That is, $I=\left\{I_{1},I_{2},\cdots ,I_{s}\right\}$ of the segmented sub-blocks with annotation is used to obtain nearest neighbor $I_{i}=\left\{I_{1}^{i},I_{2}^{i},\cdots ,I_{s}^{i}\right\}$. Each image segmentation block with annotation is represented as $K$, and then all the image segmentation blocks with annotation in the training set are represented by matrix $I'=\left\{I_{1},I_{2},\cdots ,I_{i}\right\}\in R^{1\times si}$. Because each part of the image has certain spatial information, each block of the image to be labeled is located in the training set. That is, only the subset $I_{1}^{i}$ of the training set is taken into account when the nearest neighbor $I_{1}$ is obtained. For the lower-level eigenvectors extracted from the two partitioned blocks, a and b are represented as $F_{a}\left(hsv_{1}^{a},hsv_{2}^{a},\cdots ,hsv_{256}^{a},tex_{1}^{a},tex_{2}^{a},\cdots ,tex_{t}^{a}\right)$ and $F_{b}\left(hsv_{1}^{b},hsv_{2}^{b},\cdots ,hsv_{256}^{b},tex_{1}^{b},tex_{2}^{b},\cdots ,tex_{t}^{b}\right)$, respectively. The distance between the two partitioned blocks is:

(15)
$ \begin{array}{l} d\left(F_{a},F_{b}\right)=\alpha \sum _{i=1}^{256}\left(hsv_{i}^{a}-hsv_{i}^{b}\right)^{2}+\beta \sum _{j=1}^{256}\left(tex_{j}^{a}-tex_{j}^{b}\right)^{2}\\ \alpha +\beta =1,0\leq \alpha ,\beta \leq 1 \end{array} $

Then, Research can find $D_{s}=\left\{d_{s1},d_{s2},\cdots ,d_{sk}\right\}$ of the distance vector between $K$ and $I_{s}$, and $d_{s1}<d_{s2}<\cdots <d_{sk}$. $d_{s1}$ is the distance between two blocks with the highest similarity <note: awkward/ambiguous>. Matrix $D=\left[D_{1},D_{2},\cdots ,D_{s}\right]\in R^{1\times sk}$ is the distance between all partition blocks of the incomplete image to be marked and their corresponding neighborhood blocks [13,14]. <note: ambiguous (incomplete)> Research define the subblock and subblock similarity metrics as:

(16)
$ w_{ab}=\frac{\exp \left(-\frac{d_{ab}^{2}}{2}\right)}{\sqrt{2\pi }} $

$d_{ab}$ denotes the distance between partition block a of the incomplete image and its nearest neighbor block b. The closer the distance is between two blocks, the greater the similarity measure is. The similarity measure between segmented sub-block $I_{s}$ of the incomplete image to be labeled and its corresponding neighborhood segmented sub-block is $W_{s}\,\left(w_{s1},w_{s2},\cdots ,w_{sk}\right)\,,$ and the similarity measure matrix of $I$ is $W=\left[W_{1},W_{2},\cdots ,W_{s}\right]\in R^{1\times sk}$.

3.4 Training of Models

The training process for the model is shown in Fig. 1. In the training process, the input of the model is the extended training set of the sentence description generated by the model:

(17)
$ \begin{align} b_{v}&=W_{hi}\left[C_{\theta c}\left(I\right)\right] \\ \end{align} $
(18)
$ \begin{align} h_{m}&=f\left(W_{hx}x_{m}+W_{hh}h_{m-1}+b_{h}+\mathrm{I}\left(\mathrm{m}=1\right)\cdot b_{v}\right) \\ \end{align} $
(19)
$ \begin{align} y_{m}&=\textit{soft}\max \left(W_{oh}h_{m}+b_{o}\right) \end{align} $

Here, $m$ ranges from 1 to $M$, and $C_{\theta c}\left(I\right)$ represents the output of the last layer of $CNN$. $h_{m}$ is the output of the hidden layer, $h_{0}$ is initialized as a 0 vector, and the input of the neurons in the hidden layer includes the expanded word vector $x_{m}$ and the previous moment’s information $h_{m-1}$ (contextual information). But research only consider the influence of the image information $b_{v}$ in the first step of training. Experiments have shown that the effect is better than adding $b_{v}$ at each step, $x_{1}$ as a specific START vector, $x_{2}$ as the first word in a sentence, $x_{3}$ as the second word, and $x_{m}$ as the last word.

<New paragraph> $y_{m}$ is the output of the output layer, indicating the probability of a word in the dictionary and the probability of a terminator. The $y_{1}$ tag corresponds to the first word in the sentence during the exercise, the $y_{2}$ tag corresponds to the second word, and the $y_{M}$ tag being a specific END vector. The training of the model is then realized.

Fig. 1. Training process for a model.}
../../Resources/ieie/IEIESPC.2023.12.3.206/fig1.png

3.5 Image Annotation Algorithm

Regarding the incomplete image, the incomplete part and the display part of image information have relevance <note: ambiguous>. To eliminate the interference of the defective part on the image recognition, research use a certain spatial relationship between the subblocks of the image display part. This kind of spatial relation is more embodied between the partitioned sub-blocks and the partitioned sub-blocks of the image than between the objects and objects in the image. In view of this characteristic, in the process of automatic labeling of incomplete image, the fused spatial information is the proportion of the segmented sub-blocks of the image in the whole image spatial distribution (that is, the spatial structure information).

Table 1. Software and hardware conditions during the experiment.

Project

Parameter

Hardware

CPU

Intel®Core™i7-4790 CPU

Physical memory

16 G

Dominant frequency

3.60GHz

Software

Operating System

Centos 7

Development language

Python2.7

Corpus preprocessing tool

Wiki Extractor

Word vector training tools

word2vec

Keyword extraction tool

gensim

Automatic Image Annotation Evaluation Tool

coco-caption

According to the idea of similarity transfer, the similarity between tagged words and images is related. The higher the similarity between images is, the closer their tagged words should be. Therefore, the similarity between images can be transferred to the relevance between their corresponding tagged words. The similarity measure between image blocks is used to transfer the similarity relationship between images in the process of annotation and to transform the similarity from an image to annotated words. The similarity measure transfer matrix defining image I is $W^{*}$:

(20)
$ W^{*}=f\left(d\right)\cdot W $

In this equation, $f\left(d\right)$ is the transfer function of the similarity measure.

In this paper, all the annotated words corresponding to the partition block $I_{i}(i=1,2,\ldots ,9)$ of unknown annotated image I are represented by annotated word vector $T(t_{1},t_{2},\ldots ,t_{p})$. p is the total number of annotated words, and repeated annotated words are considered. For the nearest neighbor block j of the sub-block i of image I, the corresponding annotation word is marked as 1 in T, and no corresponding annotation word is marked as 0. <note: ambiguous> $T_{'}$ is obtained. Research multiply $T_{'}$ and the corresponding similarity measure transition value $w_{ij}^{*}$ to find the similarity measure transition vector of annotation words:

(21)
$ M_{ij}=T_{'}\times w_{ij}^{*} $

For the complete image,

(22)
$ M^{*}=\sum _{i=1}^{9}\sum _{j=1}^{k}M_{ij} $

$M^{*}$ is the transfer vector of similarity measure corresponding to the annotation words of the image. In order to make the data format consistent <note: ambiguous>, $M^{*}$ is normalized, and then the threshold is set according to the actual situation, and the annotation words above the threshold are reserved. Thus, the automatic labeling of the incomplete image is realized. During the annotation, research set the iterative step $r_{t}=r_{0}/(1+0.001\eta t),\,\,\eta =10\mathrm{e}-5.$ The sub-concept parameter K = 5 of the label should not be too large. In addition, when the K value is too large, the training time of the algorithm will be increased. The overall algorithm flow is:

(1) Training process:

Input: Training dataset $\left\{\left(B_{1},Y_{1}\right),\left(B_{2},Y_{2}\right),\cdots ,\left(B_{M},Y_{M}\right)\right\}$, parameters $K$ and $\gamma _{t}$ output: $u_{lk},V_{lk}\left(l=1,2,\cdots ,L+1;\right.$ $\left.k=1,2,\cdots ,K\right)$

1) Training.

2) Initialize $u_{lk}$ and $V_{lk}\left(l=1,2,\cdots ,L+1;k=1,2,\cdots ,K\right)$.

3) Circular execution.

The training images to be labeled are divided into regions and segmented, and the segmented image features are extracted to measure the similarity of the images.

(a) A package B and one of its associated tags is randomly selected.

(b) Research obtain key examples and their corres-ponding sub-concepts: $\left(X,k\right)=\arg \max _{X\in B,k\in \left\{1,\cdots ,K\right\}}f_{yk}\left(X\right)$.

(c) If $y$ is not a virtual label $\hat{y}$, then $\overline{Y}=\overline{Y}\cup \left\{\hat{y}\right\}$.

(d) Circular implementation: $i=1\colon \left| \overline{Y}\right| $.

(e) Random selection of an unrelated tag $\overline{y}$ from the $\overline{Y}$ and selection of the key example $\overline{X}$ and its corresponding sub-concept $\overline{k}\colon \left(\overline{X},\overline{k}\right)=\arg \max _{\overline{X}\in B,\overline{k}\in \left\{1,\cdots ,K\right\}}f_{\overline{y}\overline{k}}\left(\overline{X}\right)$.

(f) If $f_{y}\left(X\right)-1<f_{\overline{y}}\left(X\right)$.

(g) Order $v=i$.

(h) Updating and standardizing $u_{yk},v_{yk},u_{\overline{y}\overline{k}},v_{\overline{y}\overline{k}}$.

4)~Ending the cycle: The cycle ends when the termination conditions are met.

(2) Testing stage:

The associated label set for the test package B test is: $\left\{l\left| 1+f_{l}\left(B_{test}\right)>f_{\hat{y}}\right.\left(B_{test}\right)\right\}$.

4. Experimental Verification

4.1 Experimental Conditions

The software and hardware environment during the experiment is as follows:

Automatic image annotation datasets are mainly divided into a training set, verification set, and test set.

(1) Training set

The Image Auto-Tagging Training Set is a Corel5K dataset (https://dumps.wikimedia.org/enwiki/latest/) con-taining 82,000 images, each with varying degrees of mutilation <note: awkward/ambiguous> and five sentences of a description generated by Amazon’s Mechanical Turk service.

(2) Verification set

For the Corel5K dataset, 500 secondary images from the training set are selected as the validation set.

(3) Test set

The test set used for image annotation in this paper was 2500 images selected from the training set.

4.2 Experimental Results

4.2.1 Partial Image Annotation Results

Some of the experimental results of the designed automatic labeling algorithm are shown in Fig. 2. <note: Paragraphs should generally be at least 3 sentences> It can be seen from Fig. 2 that the proposed algorithm can be used to label the incomplete image effectively, and it does not matter if the image is cropped, covered by foreign objects, or blurred in a large area. This is because the algorithm designed in this study can be divided into regions, which is conducive to the extraction of image features, after extracting features to improve the accuracy of image annotation.

Fig. 2. Partial labeling results.
../../Resources/ieie/IEIESPC.2023.12.3.206/fig2.png

4.2.2 Recall Ratio

In the Corel5K dataset, 3000 different types of images were randomly selected as subjects. By comparing the completeness of image labeling between this method and other methods [7,8], the advantages of this method were verified. The results are shown in Fig. 3. From Fig. 3, Research can see that the image label of the incomplete image has integrity <note: awkward/ambiguous>, and the recall rate is 97%, which is obviously higher than the other two methods, so our method can effectively identify the image and has strong practicability.

Fig. 3. Label search results.}
../../Resources/ieie/IEIESPC.2023.12.3.206/fig3.png

4.2.3 Precision Rate

The standard deviation value $\sigma $ of the normal distribution in formula (4) is obtained by iteration using:

(23)
$ \sigma =\frac{H}{K}\times 100\mathrm{\% } $

$H$ is the number of labels marked, $K$ is the total number of labels. The smaller the $\sigma $ value is, the clearer the image is, the more accurate the result is, and the higher the value of the method is.

Fig. 4 shows that after 300 iterations, the standard deviation of the normal distribution tends to be stable. The $\sigma $ value is 11 when the incomplete image is labeled automatically by this method. The $\sigma $ value is 18 and 19 when the incomplete image is labeled by methods of Literature [7] and [8]. Compared with the results of the other methods, the precision of the proposed method is higher because it constructs a Gaussian pyramid, which can obtain the logarithm of the dimension of the image at the top of the pyramid structure, which reduces the standard deviation of the normal distribution and improves the precision.

The Peak Signal Noise Ratio (PSNR) was used as an indicator to evaluate the performance of the methods proposed in the literature, as shown in Fig. 5. In the process of algorithm iteration, the PSNR of the three algorithms increased. At 200 iterations, the PSNR value of the proposed method is 15.76. The PSNR value of one method in the literature [7] is 15.05, and the PSNR value of the other literature method [8] <note: It seems that one of these should be ``[8]''; clarification by the author is needed> is 14.15. It can be seen that the PSNR value of the algorithm proposed in this study is obviously superior to those of the other two algorithms, which indicates that the performance is better.

Fig. 6 shows the F1 value changes of the methods in this paper, a method in the literature [7], and another method in the literature [8] during the iteration process. It can be seen that when the number of iterations increases, the F1 value of several models also increases. This shows that in the process of iterative training, several models fit the data, and the fitting effect is improved. When the F1 value increases to a certain extent, it will not increase more, and the F1 value curve also tends to be stable. This is because after enough iterative training, the model has reached its best performance, and the accuracy of the model can no longer be improved.

<New paragraph> It can be seen that the method in this paper achieves the best accuracy when it has 164 iterations, while a method in the literature [7] needs 172 iterations, which is 8 times more than the method in this paper. However, the other method [8] needs 214 iterations, which is 50 times more than that in this paper. After reaching the best performance, the F1 value of the method in this paper is 0.95, while the F1 value of a method in the literature [7] is 0.92, which is 0.03 lower than the method in this paper. The F1 value of the other method [8] is 0.89, which is 0.06 lower than the method in this paper.

Fig. 4. Label alignment results.}
../../Resources/ieie/IEIESPC.2023.12.3.206/fig4.png
Fig. 5. PSNR value comparison.
../../Resources/ieie/IEIESPC.2023.12.3.206/fig5.png
Fig. 6. F1 value comparison.}
../../Resources/ieie/IEIESPC.2023.12.3.206/fig6.png

5. Conclusion

In recent years, with the rapid development of the network, people need more and more information from a network image. How to identify an incomplete image and mine more useful information from the image has become an urgent problem in we. Based on the algorithm of automatic image labeling, the image was preprocessed by selecting the image region, and the image segmentation sub-block matrix was constructed by using SIFT. The experimental results showed that the recall rate of the proposed method reached 97%, the standard deviation was 300 iterations, the normal distribution of automatic labeling was 11, and the PSNR value was 15.76, which were obviously superior to those of the other two methods. This lays a foundation for the further study of image automatic annotation. In the course of the experiment, less sample data was used in the study, which may lead to a certain deviation in the experimental results. Therefore, research need to continue to improve the experimental design in the future to have clearer and more accurate results.

REFERENCES

1 
L. Reichel, U. O. Ugwu. ``Tensor Krylov subspace methods with an invertible linear transform product applied to image processing,'' Applied Numerical Mathematics, 2021, 166:186-207.URL
2 
W. Z. Liang, I. Possignolo, X. Qiao, et al. ``Utilizing digital image processing and two-source energy balance model for the estimation of evapo-transpiration of dry edible beans in western Nebraska,'' Irrigation Science, 2021:1-15.URL
3 
C. D. Tormena, R. C. S. Campos, G. G. ``Marcheafave, et al. Authentication of carioca common bean cultivars (Phaseolus vulgaris L.) using digital image processing and chemometric tools,'' Food Chemistry, 2021, 364(1):130349.URL
4 
T. M. Svahn, R. Gordon, J. C. Ast, et al. ``Comparison of Photon-Counting and Flat-Panel Digital Mammmography for The Purpose of 3d Imaging Using a Novel Image Processing Method,'' Radiation Protection Dosimetry, 2021(3-4):3-4.URL
5 
M. Talaat, M. Tayseer, A. Elzein. ``Digital image processing for physical basis analysis of electrical failure forecasting in XLPE power cables based on field simulation using finite-element method,'' IET Generation Transmission & Distribution, 2020, 14(26):6703-6714.URL
6 
Y. Long, G. S. Xia, S. Li, et al. On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid. IEEE Journal of selected topics in applied earth observations and remote sensing, 2021, 14: 4205-4230.URL
7 
J. Song. ``Binary Generative Adversarial Networks for Image Retrieval,'' International Journal of Computer Vision, 2020, 128:2243-2264.URL
8 
A. G. Khatchatoorian, M. Jamzad. ``A new architecture to improve the accuracy of automatic image annotation systems,'' IET Computer Vision, 2020, 14(5): 214-223.URL
9 
W. X. Liao, P. He, J. Hao, et al. He. ``Automatic identification of breast ultrasound image based on supervised block-based region segmentation algorithm and features combination migration deep learning model,'' IEEE journal of biomedical and health informatics, 2019, 24(4): 984-993.URL
10 
R. Srinivas. An Insight on Image Annotation Approaches and their Performances. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 2021, 12(10): 5902-5910.URL
11 
J. Zhang, Y. Mu, S. Feng, et al. Image region annotation based on segmentation and semantic correlation analysis. IET Image Processing, 2018, 12(8): 1331-1337.URL
12 
Z. Mehmood, T. Mahmood, M. A. Javid. Content-based image retrieval and semantic automatic image annotation based on the weighted average of triangular histograms using support vector machine. Applied Intelligence, 2018, 48(1): 166-181.URL
13 
R. Ratnakumar, S. J. Nanda. ``A high speed roller dung beetles clustering algorithm and its architecture for real-time image segmentation,'' Applied Intelligence, 2021, 51(2), 4682-4713.URL
14 
L. Duan, S. Yang, D. Zhang. ``Multilevel thresholding using an improved cuckoo search algorithm for image segmentation,'' The Journal of Supercomputing, 2021, 77(7):6734-6753.URL
15 
C. A. Xu, A. Cl, W.A. Li, et al. ``Diverse Data Augmentation for Learning Image Segmentation with Cross-Modality Annotations,'' Medical Image Analysis, 2021, 71: 102060.URL
16 
R. Hashimoto, J. Requa, T. Dao, et al. Artificial intelligence using convolutional neural networks for real-time detection of early esophageal neoplasia in Barrett’s esophagus (with video). Gastrointestinal endoscopy, 2020, 91(6): 1264-1271.URL
17 
Y. Chen, L. Liu, J. Tao, et al. The image annotation algorithm using convolutional features from intermediate layer of deep learning. Multimedia Tools and Applications, 2021, 80(3): 4237-4261.URL
18 
E. H. Houssein, K. Hussain, L. Abualigah, et al. ``An improved opposition-based marine predators algorithm for global optimization and multilevel thresholding image segmentation,'' Knowledge-Based Systems, 2021(1):107348.URL

Qizhenshi Wang

Kim
../../Resources/ieie/IEIESPC.2023.12.3.206/au1.png

Qizhenshi Wang was obtained his MSE in Software Engineering (2008) from UESTC, Chengdu. Presently, he is working as a Professional Lecturer in the Department of Animation Art, Zibo Vocational Institute, Zibo. He was invited a by domestic enterprises as a consultant to give various technical speeches on image processing, pattern recognition and soft computing. As a domestic expert in this field,he has published research articles in this field in international and domestic well-known journals and conference records. His areas of interest include machine learning, image processing, pattern recognition and information security.