Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 06, p.622-631

ISSN (online) :

2287-5255

Received : 12 December 2023Accepted : 23 January 20244 March 2024

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.6.622

Regular Paper

Deep Network Learning based on TF-IDF Text Features for Electric Power Speech Text Pre-disposal Method

ZhaoXin^1,^* HuangChangda¹

(State Grid Xinjiang Electric Power Co., Ltd Marketing Service Center, Qinyang 454550, China)

^*Corresponding Author: Xin Zhao, zhaoxin_zx45@outlook.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Aiming at the challenge of lack of effective application of massive power operation text data, this paper proposes a graph convolutional neural network processing method including power speech text data responsible for text analysis. After pre-processing the electric power speech text, the word frequency-inverse document frequency (TF-IDF) algorithm is further used to extract the electric power operation text feature items. The power operation information model based on text data feature recognition is comprehensively designed. The recognition and classification results of power speech text data are verified through experiments on power data text datasets. The experimental results show that the accuracy of text classification of the topic model based on TF graph convolutional neural network is 76.4%. The recall rate is 75.2% and the F1 value is 75.8%, which is 3% higher than the accuracy rate of graph convolutional neural network text classification method and 3. 4% higher than the recall rate, 3.2% higher than the F1 value, and 3.2% higher than the Labeled-LDA model text classification method. The feature extraction method improves the text classification accuracy by 3.5%, recall by 1% and F1 value by 2.3%.

Keywords

Graph convolutional neural network, Text-based classification, TF_IDF, Electric power equipment, Text data recognition

1. Introduction

The popularization of the Internet has led to the rapid development of areas such as cloud computing and the Internet of Things (IoT), resulting in an exponential growth of Internet data. These data cover a variety of forms such as text, audio, video and pictures, among which text data occupies an important position. Taking power data sites as an example, the Internet is flooded with a large amount of relevant information ^[1,^2]. Meanwhile, with the rapid popularization of mobile, diversified social platforms come one after another, such as WeChat, microblogging, and so on. The rapid growth of data, on the one hand, brings convenience to people's access to information ^[3,^4], but on the other hand, people need to spend a high cost of time to get the part they need from a large amount of information. So how to effectively obtain and organize information has become an urgent problem, and data mining ^[5-^7], information-based retrieval ^[8-^10], and related information-processing technology methods have rapidly gained importance and development. In the 1950s, H.P. Luhn published a paper ^[10], that caused a great sensation in the field of text-based classification, and the paper introduced the idea of word frequency statistics into the research of text-based classification in a pioneering way. In the 1960s, Maron published a paper entitled ``Automatic Indexing: An Experimental Inquiry'' in the Journal of ACM ^[11], which had a profound impact on the subsequent research on text-based classification of search engines. In 1973, Salton ^[12] and others first proposed the VSM, which uses vectors to represent the feature terms of the text to be processed, and a new representation of the text according to a specific theory, and this representation model continued for a long time. Until the end of the 1980s, text-based categorization continued to be knowledge-engineered, with the CONSTRUE ^[13] system developed by the Carnegie Group being based on this technique. Time came the 20th century, the Internet also entered a phase of rapid development, the network every day to produce data are in the form of exponential growth, at this time, people's needs were also growing dramatically, and the traditional manual classification had been unable to meet the gradual elimination of the way. Along with the development of artificial intelligence ^[14], machine intelligence learning ^[15], pattern recognition ^[16], statistical theory ^[17], and other disciplines, automatic text-based classification systems have gradually replaced manual classification techniques, and most of these systems are based on machine-intelligence learning, and are far more efficient than the human experts, and also maintain a very high level of accuracy. Therefore, people have actively researched machine intelligence learning in the field of text-based classification, such as plain Bayes ^[18], K-nearest-neighbor ^[19], neural networks ^[20], and support vector combiners ^[21]. With the continuous development of machine intelligence learning, deep network learning has been studied ^[22]. Hidden textual features in textual data are not easy to uncover and extract in shallow neural networks and are very different from human thought patterns. The purpose of text-based classification is to make the classification process closer to the human thinking process for classification. Deep network learning is derived from machine intelligence learning but is more focused on longitudinal, multi-level data mining and analysis than shallow machine intelligence learning. It has a wide range of applications, especially in image processing ^[23] and speech recognition ^[24].

It is because of the excellent performance of deep network learning in these aspects that it has been gradually applied to text-based classification in recent years. In 2003, the distributional representation of words was used in statistical language modeling by Bengio ^[25]. in 2008, the concept of word vectors was first proposed by Collobert et al. and later introduced into convolutional neural networks. Google Inc. introduced the Word-2vec127 technique in 2013, which has been widely used in the field of text modeling. Word-vec trains each word by filtering out high and low-frequency occurrences of words in the text, combining the contextual information of the target word and representing it with a low-dimensional vector ^[26]. It can better represent the relationship between words and express the potential semantic information of words using low-dimensional vectors. Subsequently, Mikolov et al. disclosed two methods to compute word ^[27] vectors, CBOW and Skip-Gram, and accomplished efficient training of text sets using these two methods respectively ^[28]. Under the leadership of Wu Jun ^[29], an automatic Chinese corpus classification system was born in the Department of Electronic Engineering, at Tsinghua University, it was based on the corpus correlation coefficient, using word frequency, word frequency, and deactivated word list to remove non-featured words, and thus classified. In 1999, Zou Tao30 et al. introduced an automatic classification system for Chinese documents at Nanjing University. In 2000, Li Xiaoli of the Institute of Computing, Chinese Academy of Sciences (CAS), and Shi Zhongshi ^[31] developed a text-based classification system that reached a high level. Then Fanzhong ^[32] et al. at CSCU proposed a hypertext coordinated classifier, which utilized KNN as well as Bayesian algorithms and was effectively processed by text similarity.

2. Electricity Speech Text Data Mining with Graph Convolutional Networks

Meanwhile, in the research aspect of power service in the context of big data, power grid companies have accumulated massive and diverse power operation data. These data account for more than 80% of unstructured data, such as audio recordings and text data. The unstructured data mainly comes from the customer service system of power grid companies, and its text data contains customer fault reporting, information queries, business processing, and other business needs ^[33]. How to make full use of the text data, and an in-depth understanding of the real needs of customers is of great significance to further improve the level of power supply and use of electricity service, and improve the user experience of electricity use. Based on the traditional convolutional network the data mining technology can not realize the characterization of text data, so the combination of graph convolutional network text mining technology came into being. Text mining technology combines computer technology, artificial intelligence algorithms, etc., to realize the extraction of valuable information in the text ^[34-^36]. At present, the application of text mining in the field of electric power mainly includes the perception of the state of electric power equipment, the diagnosis of faults, and the assessment of system reliability ^[37-^39], but its application in the field of electric power operation is less. In this regard, the paper applies graph convolutional networks combined with text mining technology to the information processing of text data in power operations, to realize the text-based classification of power operations. At the same time, it deeply understands the needs of electric power customers and then improves the company service level of the power grid.

The main purpose of modeling in this graph convolutional network for power equipment text data is to extract spatial features in topological space. There are two methods to extract the features: one is based on convolution in the null domain and the other is based on convolution in the frequency domain. To explain in layman's terms, the null-domain type convolution can be analogized to the direct convolution on the pixel points of a picture, while the frequency domain convolution can be analogized to the Fourier transform of a picture followed by convolution. The process of seeking power speech text convolution integral can be described as follows: first through the signal sampling theorem will be the input signal, decomposed into impulse functions, and then find the impulse response of each impulse function in the system, in the sum of these impulse responses to get the input signal and then the response of the zero states of the system, the Eq. (13), the Eq. (14) indicates that the function of the null-domain convolution is:

(1)

$ f_{1}\left(t\right)*g_{1}\left(t\right)=\int _{-\infty }^{+\infty }f_{1}\left(e\right)*g_{1}\left(t-e\right)de $

(2)

$ F\left(iw\right)=\int _{-\infty }^{+\infty }f_{1}\left(t\right)e^{-iwt}dt $

The basic idea of the preliminary design of the graph convolutional neural network, and a concrete representation of the key processes are shown in Fig. 1. By iteratively realizing the steps of convolution until the desired number of layers is achieved, the function of the local output and the function of the target output of the graph convolutional neural network are obtained.

In addition, the graph convolutional neural network is applied to text-based classification to solve the TF-IDF text-based classification, the degree of classification accuracy is mainly dependent on the input of word vectors, and some word vector inputs do not take into account the important correlation information between word items and between words and documents, so the graph convolutional neural network is introduced to solve this problem. In this study, the main steps in text feature extraction are first based on the data mined to capture the text data on the Internet. After the data power speech text pre-disposal, the cluttered unstructured text is transformed into structured data, using a combination of supervised learning and unsupervised learning methods for text feature value similarity calculation and extraction, to determine the optimal text feature extraction shown. However, the number of nodes' collar nodes is not fixed, and the node features in the graph cannot be extracted directly using the traditional convolution kernel. The most important thing is to find the association relationship existing in the text information, to construct the graph vector of TF-IDF. According to the above algorithmic model, the algorithmic flow of power operation information processing based on TF-IDF-LSTM is designed. The raw text of the electric power operation is taken as input, and then data pre-disposal operations such as text cleaning and text segmentation are carried out. The extraction of text data features is further realized based on the TF-IDF algorithm. Finally, the classification and recognition of power operation text are realized by a deep classification model. The convolution process captures the local structural and semantic information in power speech text and helps to understand the intrinsic patterns of the data. While matrix transformation converts the raw text data into a matrix form suitable for convolution operations, which is crucial for the representation of node attributes and connectivity relationships. The combination of these two methods improves the model's ability to recognize and classify power operation text data, and provides new ideas to solve the problem of lack of effective application of high-power operation text data. The TF-IDF-based graph convolutional neural network text-based classification process for power equipment topics is shown in Fig. 2, which is mainly divided into two parts: the text feature extraction method of the Labeled-LDA model, and the text-based classification model of graph convolutional neural network.

Step 1: Power speech text pre-processing; Assuming that three collections with the same m training documents are D, = \{d, d2, d.), D, = (d, d2, d), D, = (d, dd), D, = (d, d, d), D, D, D, for the power speech text pre-processing work of segmentation, de-duplication, etc., and D, D, D for the segmentation, de-duplication, etc., after splitting each document paragraph into single sentences. Power speech text preprocessing work.

Step 2: The TF-IDF algorithm is input into the Labeled-LDA model algorithm model to get the feature matrix of the topic label, after which the construction of the graph vector in the power speech text is carried out.

Step 3: A graph network structure is constructed according to the electric power recognition method described, input into a graph convolution type neural network model, and after iterative training a text feature matrix is obtained, and graph recognition and classification are disposed of.

Step 4: The topic label feature matrix v, and the text feature matrix v, are spliced to obtain the multi-source fusion feature coefficients, after which the local output and target output are performed. Input the multi-source fusion features into the Soft-max classifier to get the classification results, and finally get the electric power speech text recognition results

Among them, the core idea of TF-IDF is that for a word that appears in a certain text data at a high frequency, the word appears less frequently in other text data in the total power speech text sample. Then it can be considered that the word has a strong distinguishing ability for the power speech text sample and can be used as a classification label for the text data. Therefore, the TF-IDF algorithm uses the product of word frequency and inverse document frequency as weights, which are calculated as follows.

(3)

$ TF\_ IDF_{i,j}=TF_{i,j}\_ IDF_{i} $

(4)

$ TF_{i,j}=\frac{n_{i,j}}{\sum _{k=1}^{k}n_{i,j}} $

where n, is the number of occurrences of word i in text j; and the summation term is the total number of all words in text j. The IDF describes the inverse of the frequency of occurrence of the word I in other texts and is calculated as follows:

(5)

$ IDF_{i,j}=\log \frac{D}{\left| \left\{j\colon i\in j\right\}\right| +1} $

(6)

$ TF_{i,j}=\frac{n_{i,j}}{\sum _{k=1}^{k}n_{i,j}} $

where D is the total number of power speech text samples and $\left\{j\colon i\in j\right\}$ is the number of texts containing the word i. The denominator of the TF-IDF text data extraction features is shown in Fig. 3. To avoid the situation that the denominator is zero because all the power speech text samples do not contain word i, 1 is usually added to the base of. The specific TF-IDF text data extraction features are shown in Fig. 3.

Fig. 1. Convolutional process and matrix transformation of graph convolutional neural network.

Fig. 2. TF-IDF text-based classification modeling flow for graphical convolutional neural networks.

Fig. 3. TF-IDF Text Feature Extraction with Graph Convolutional Neural Networks.

3. Experimental Results and Analysis of Electric Power Speech Text Data Being Mined

3.1 Preparation of the Experiment

All the experiments of the power speech text testing in this study were done on a computer with 128-bit Windows_10 operating system, and the hardware configuration of the computer is: Intel Core i7, 3.4GHz, dual-core four-thread CPU16.00GB RAM/256GB SSD The algorithmic code of the experiments in this paper was realized on the jury-er platform, and Python-3.6 was used as the development language to store the data, and the database version is My_SQL5.5. Python-3.6 as the development language, using 2010 Excel and My_SQL relational database for data storage, where the database version is My_SQL5.5 and Navi_cat software for visualization access.

The Lageled-LDA model proposed by this experiment ^[23] is compared with the performance of traditional TF-IDF, LDA topic model for text feature extraction to validate the effectiveness of the algorithm proposed in the previous section, the experimental process to ensure the validity of the period, and part of the core algorithm is coded in the following Table 1.

Table 1. Implementation Flow of Some Core Codes.

Code Implementation Flow

from sklearn.feature_extraction.text import TfidfVectorizer

# Define the text data

documents = [

'This is the first document.' ,

'This is the second document.' , 'This is the third document.

'This is the third document. The third document contains some repeated words.',

'The fourth document is very similar to the third.']

# Initialize the TFIDF vectorizer

vectorizer = TfidfVectorizer()

# Transform text data into TFIDF feature vectors

tfidf_matrix = vectorizer.fit_transform(documents)

# Output IDF values for each word

print('IDF values for each word:')

print(vectorizer.idf_)

# Output the shape of the TFIDF feature vector

print('Shape of TFIDF feature vector:')

print(tfidf_matrix.shape)

# Output the TFIDF feature vector

print('TFIDF feature vector:')

print(tfidf_matrix.to array())

3.2 Study of Experiment 1

Based on the LDA theme model to extract the power setup subject data mining experiments, there have been most of the literature ^[29,^30] will be set in the model parameters of a and B: a = 50/k, B = 0.01; k is the number of implied themes, according to the application scenarios and the actual situation to do the corresponding adjustment.

In Experiment 1 to realize the TF-IDF extracted keywords and LDA extracted power setting subject data matching, where the data category is 6 set the number of topics k for 6, Gibbs Sample in the number of iterations for 600 times, Table 2. for the LDA topic model-word recognition results example, where, Topic_2, Topic_3, Topic_4, Topic_5, Topic_6 are the topic model numbers recognized by LDA. The results after TF-IDF computation are shown in Table 3.

Through Sim-hash analysis of TF-IDF weight (data set words appear in the order of the size of the frequency of the keywords) the top 100 keywords and the LDA weight of the top 100 power set subject data similarity, the results are shown in Fig. 4. It can be seen that the effective degree of recognition classification is high.

To verify the performance of the proposed algorithm in experiment one: based on the plain Bayesian classifier, the performance of the three feature selection algorithms is evaluated by the degree of accuracy of classification, the recall proportion and the F1 metrics, and the three algorithms of text feature extraction, namely, the TF-IDF, the traditional LDA topic model and the Labeled-LDA model, are compared. Among them, Fig. 5(a) shows the comparison of accuracy degree, Fig. 5(b) shows the comparison of recall proportion and Fig. 5(c) shows the comparison of the F1 value of the three feature extraction methods.

The method used in Experiment 1 improves the accuracy, recall ratio, and F1 value, for example, the F1 value in the feature extraction proposed by the Labeled-LDA model is higher than that of the TF-IDF and traditional LDA topic model respectively, so the F1 value of the improved LDA topic model is higher. Overall, the accuracy of text feature extraction based on the Labeled-LDA model is higher than that of the traditional LDA theme model and TF-IDF feature extraction. Through the fusion algorithm of the traditional LDA topic model and TF-IDF, TF-IDF is used as an additional label of the LDA category, which can effectively determine the feature topic, so the text feature extraction method proposed in the previous section is more effective and stable.1.82% and 3.92%, in which the traditional LDA theme to extract the power setting subject data mainly relies on the full probability unsupervised model.

Fig. 4. Extraction of validity degree for TF-IDF weight recognition classification.

Fig. 5. Comparison of extraction results of features under three different methods.

Table 2. LDA Identification Results.

thematic	Power equipment data identification number X (Xy)
Topic_1	M (28)	C (11)	D (89)	E (25)	A (28)
Topic_2	F (21)	A (71)	L (11)	N (11)	B (35)
Topic_3	X (13)	J (12)	H (11)	K (38)	E (11)
Topic_4	F (41)	G (33)	J (11)	L (08)	F (39)
Topic_5	A (51)	S (34)	D (11)	O (48)	R (45)
Topic_6	D (16)	C (56)	B (11)	A (47)	T (751)

Table 3. TF-IDF Results for Identifying Critical Data Areas.

thematic	Key data area X (X)
major category identification	M (2)	C (1)	D (8)	E (2)	A (2)
	F (2)	A (7)	L (1)	N (1)	B (3)
	X (1)	J (1)	H (1)	K (3)	E (1)
	F (4)	G (3)	J (1)	L (0)	F (3)
	A (5)	S (3)	D (1)	O (4)	R (4)
	D (1)	C (5)	B (1)	A (4)	T (7)

3.3 Experiment 2 Study

In Experiment 2, the association relationship between word items, words, and documents is mined, graph vectors are constructed, and the application of graph convolutional neural network in text-based classification is realized, and the accuracy of the graph convolutional neural network text-based classification model is tested by changing the proportion of the training set group, the window size, and word embedding dimensions, and the recall proportion and the F1 value prove the reliability of the algorithm. The graph convolutional neural network model is applied to text-based categorization data mined in the experiment, according to the literature ^[43], the convolutional layer in the model Text-GCN is set to 2, the learning rate is set to 0.03, the dropout is set to 0.5, and the loss function canonical parameter is 0.

In Text-GCN experiments, the dimension of word embedding in the input layer by changing the proportion of the training set, the window size, and the word embedding is one of the most important hyperparameters in the model, and if the dimension is not chosen correctly, it will produce the overfitting phenomenon. In this experiment, the dimension of the word embedding is carried out in increments of 50 at a time with 50 as the base, and the experimental results are shown in Fig. 6(a). Fig. 6(a) represents the effect of different dimensions of word embeddings in the TextGCN input layer on the classification performance, where the horizontal coordinate represents the dimension of word embeddings and the vertical coordinate represents the evaluation index. Analyzing Fig. 6(a), it can be obtained that the accuracy level rises slowly with the increase of dimension size, and when the number of dimensions is 300, the accuracy level tends to be about 70 percent. This experiment can conclude that word embeddings with too low dimensions do not propagate the text information to the whole graph well, while word embeddings with high dimensions do not improve the classification performance and take more training time. In the word co-occurrence model, the size of the window for word scanning has an important impact on learning the correlation between word items. In this experiment, the window of scanning is incremented by 2 at a time for the experiment, and the experimental results are shown in Figs. 6(b) and 6(d). The results represent the accuracy of text-based classification under different numbers of windows, where the horizontal coordinate represents the size of the window scanning and the vertical coordinate represents the evaluation index of text-based classification. From Figs. 6(b) and 6(d), it is observed that when the scanned window increases with the size of the window number, the degree of accuracy rises slowly and levels off when the window number is 6. This reflects that the number of windows scanned is too small to capture the co-occurrence information between words, but if the window scanned is too large the less correlation between words. Under the condition that the window size of word scanning and word embedding dimension of the word co-occurrence model remain unchanged, the ratio of the training set is varied to test the accuracy, recall ratio, and F1 value of the text-based classification. Fig. 6(c) shows the effect of different proportions of training collection groups on the accuracy degree of text-based categorization, the horizontal coordinate indicates the proportion of training collection groups, and the vertical coordinate indicates the text-based categorization index. From Fig. 6(c), it can be observed that the text-based classification accuracy is highest when the proportion of the training ensemble group is 75%. It is further illustrated that the text-based classification model of graph convolutional neural network achieves high accuracy classification with limited category labeled documents, and the text-based graph vectors can better capture the text category information.

Fig. 6. Classification features for text recognition under three different methods.

3.3 Study of Experiment III

The effectiveness of the topic model text-based classification algorithm for graph convolutional neural networks is realized through Experiment III. In this experiment, the topic category label matrices generated from Experiment I and the text feature matrices generated from Experiment II are fused with multiple sources to achieve text-based classification. Experiment 3 Parameter selection is performed based on the parameters corresponding to the optimal experimental results produced by Experiment 1 and Experiment 2. Classification experiments are performed by combining the TextGCN text-based classification model with several other classification models under the same dataset.

The experimental results are analyzed in Fig. 7. The accuracy of the text-based classification model combining Labeled-LDA and TextGCN is 76.4%, which is higher than that of the TextGCN classification model and the combination of Labeled-LDA and Softmax classification model. There are three main reasons: 1) the construction of a graph structure with textual features can accurately capture the relationship between words, words, and documents for text-based classification; 2) word nodes can be used as a bridge to not only collect the category information of the text but also transfer the text category information to the neighboring nodes of the word node so that the textual information is propagated to the entire graph network structure; 3) the subject category labels and the textual features with the information of the words, words, and documents are spliced together with the textual features with the information of the words, words and documents. and document information of the text feature splicing of the multi-source feature fusion matrix complements the TF-IDF text feature matrix with the topic category labels. It is sufficiently shown that the text-based classification method of extracting text features by the Labeled-LDA model and then fusing them with multi-source features of graph convolutional neural network is very effective.

The traditional LDA combined with Softmax has the lowest accuracy of 66.1% for the text-based classification model. However, the text-based classification model of Word-2vec combined with TF-IDF has the highest accuracy of 81.5% among the six models. The main reason is that Word-2vec is a model that generates word vectors by constructing relationships between context and target words, which contains both CBOW and Skip-Gram modes. When TF-IDF is input, Word-2vec is utilized to train a large-scale corpus to generate word vector representations with contextual information and target words, which has good performance in text-based classification. Through the experiments on the text dataset of power data, the experimental results show that the accuracy degree of the topic model text-based classification based on graph convolutional neural network is 76.4%, the recall percentage is 75.2%, and the F1 value is 75.8%, which is 3% higher than the accuracy degree of the graph convolutional neural network text-based classification method, 3.4% higher than the recall percentage, and 3.2% higher than the F1 value. Labeled-LDA model textual feature extraction method of text-based classification accuracy increased by 3.5%, recall ratio increased by 1%, and F1 value increased by 2.3%, proving that the method proposed in this paper, TF-IDF -graph CNN method, can effectively improve the accuracy of text-based classification and recognition of power speech. In addition, as shown in Fig. 8, the complex textual data information generated in electric power equipment can be well recognized and classified by the method of this paper, and it can be seen that the overall trend of the data and the peaks and valleys of the data are in good agreement, thus proving the accuracy and efficiency of the method of this paper.

Fig. 7. Classification features for text recognition under three different methods.

Fig. 8. Schematic of recognition results with different complex text data.

4. Conclusions and Discussions

A graph convolutional neural network processing method including electric power speech text data responsible for text analysis is proposed here. The details are as follows.

(1) we propose a method for processing power speech text data using graph convolutional neural networks. The original text is first cleaned and segmented, and then classified and recognized by a deep classification and recognition model. The effectiveness of the method is experimentally verified on the electric power data text dataset, and the results show that the classification accuracy is highest when the training set is 75%. This indicates that the text-based graph convolutional neural network classification model can achieve high-accuracy classification under the condition of limited category-labeled documents, and the text-based graph vectors can better capture the text category information. The method provides a new idea for power speech text data processing and helps to improve the intelligence level of power system.

(2) The accuracy of the TF-graph convolutional neural network-based text-based classification for topic model is 76.4%, the recall ratio is 75.2%, and the F1 value is 75.8%, which is 3% higher than that of the graph convolutional neural network-based text-based classification method, 3.4% higher than that of the recall ratio, 3.2% higher than that of the Labeled-LDA model-based text feature extraction method, and 3% higher than that of the Labeled-LDA model-based text feature extraction method. type classification with a 3.5% improvement in accuracy, a 1% improvement in recall ratio, and a 2.3% improvement in F1 value. In addition, the method in this paper can recognize and classify the complex textual data information generated in electric power equipment, and it can be seen that the overall trend of the data and the peaks and valleys of the data are in good agreement.

Currently, graph convolutional neural network models for power speech text data processing face challenges, including extracting key information, handling heterogeneous data, and improving generalization capabilities. Future research can explore the combination of advanced techniques and optimization algorithms to enhance the model performance and consider practical applications to improve the intelligence of power systems.

REFERENCES

Zhou R S, Wang Z J. A Review of a Text Classification Technique: K-Nearest Neighbor[C]// International Conference on Computer Information Systems and Industrial Applications. 2015.

Mukherjee I, et al. An Improved Information Retrieval Approach to Short Text Classification. International Journal of Information Engineering and Electronic Business, 2017, 9(4):31-37.

Wang J, Li L, Ren F. An improved method of keywords extraction based on short technology text. Faculty, 2010.

Wang D, et al. Retrieval Methods of Natural Language Based on Automatic Indexing. International Conference on Computer \& Computing Technologies in Agriculture. Springer International Publishing, 2016:346-356.

Chi XX. Research of Information Filtering Model Based on BP Artificial Neural Network and Genetic Algorithm. International Conference on Natural Computation. IEEE, 2010: 1788-1791.

Huang C, Trabelsi A, Qin X, et al. Seq2Emo for Multi-label Emotion Classification Based on Latent Variable Chains Transformation. 2019.

Sundus K, Al-Haj F, Hammo B. A Deep Learning Approach for Arabic Text Classification[C]//2019 2nd International Conference on new Trends in Computing Sciences (ICTCS). 2019.

Zhang, et al. Detecting hate speech on Twitter using a convolution-GRU-based deep neural network. ESWC 2018:745-760.

Liu D, Shi T, Didonato J A, et al. Application of genetic algorithm/k-nearest neighbor method to the classification of renal cell carcinoma. IEEE, 2004.

Bolshoy A, et al. Mathematical Models for the Analysis of Natural-Language Documents. Genome Clustering. Springer Berlin Heidelberg, 2010: 23-42.

Debra, et al. A Framework for Evaluating Automatic Indexing or Classification in the Context of Retrieval. Journal of the Association for Information Science and Technology, 2016, 67(1): 3-16.

Salton G, Yang C S. On the specification of term values in automatic indexing. Journal of Documentation, 1973, 29(4): 351-372.

Dong L, Leland R P. The adaptive control system of a MEMS gyroscope with time-varying rotation rate. IEEE, 2005.

Svetlana Kiritchenko S M. Email Classification with Co-Training. Proceedings of Cascon, 2001:301-312.

Mitchell T M. Machine learning. McGraw-Hill, 2003.

Feng G, et al. Feature subset selection using naive Bayes for text classification. Pattern recognition letters, 2015, 65(NOV.1): 109-115.

Deng Breaking, et al. A text-based classification method based on statistical distribution and set theory. Journal of Beijing Institute of Technology, 2006(07): 589-592+597.

Qiang G. An Effective Algorithm for Improving the Performance of Naive Bayes for Text Classification[C]//Second International Conference on Computer Research \& Development. IEEE, 2010.

Trstenjak B, et al. KNN with TF-IDF based Framework for Text Categorization. Elsevier Ltd, 2014:1356-1364.

Meirong Wang, Text-based classification algorithm based on convolutional neural network. Journal of Jiamusi University (Natural Science Edition), 2017, 036(003): 354-357.

Costales J A, Tuquero A C B, Nolia N V, et al. The Development of Mobile-Based Symptom Analysis for Early Detection of Diseases Using Hyper-Tuned C-Support Vector Classification Algorithm[C]//2023 5th International Conference on Control and Robotics (ICCR).0[2024-03-21].

Rodriguez-Cristerna A, Guerrero-Cedillo C P, Donati-Olvera G A, et al. Study of the impact of image preprocessing approaches on the segmentation and classification of breast lesions on ultrasound[C]//2017 14th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE). IEEE, 2017.

Zhong S H, et al. Bilinear deep learning for image classification. Proceedings of the 19th International Conference on Multimedia, 2011:343-352.

Kuniaki, et al. Audio-visual speech recognition using deep learning. Applied Intelligence, 2015, 42(4):722-737.

Bengio, et al. A Neural Probabilistic Language Model. Journal of Machine Leaming Research, 2003, 3:1137-1155.

Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. Machine Learning. Proceedings of the Twenty-Fifth International Conference (ICML 2008), 2008:160-167.

Mikolov T, et al. Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations, 2013:1-12.

Xue Chunxiang, Zhang Yufang A review of research on Chinese text-based classification for power data domain. Library and Intelligence Work, 2015, 057(014):134-139.

Wu Jun, et al. Automatic classification of Chinese corpus. Journal of Chinese Information, 1995, 9(4):25-32.

Shiwu X, Juan Y, Xia W. Design and implement of urban land classification and evaluation information system based on data center[C]//2010 The 2nd Conference on Environmental Science and Information Application Technology. IEEE, 2010.

Chun C, Xiaonan W, Yanling L. Design and realization of a DNA sequence classification system based on support vector machines. Journal of China Agricultural University, 2005.

Fan Yan, et al. Performance study of hypertext coordinated classifier. Computer Research and Development, 2000, 37(9): 1026-1031.

Ferris G R. Method of Storing Data Used in Backtesting a Computer Implemented Investment TradingStrategy:US11718751[P].US20070244788A1[2024-03-21].

Li C, Jian S, Min Z, et al. Multi-scenario Application of Power IoT Data Mining for Smart Cities[C]//2019.

Ai M A M, Chen N C N, Ge X G X, et al. A CEP based ETL method of active distribution network operation monitoring and controlling signal data. IET, 2016.

Ren Q, Zhuo X. Application of an improved K-means algorithm in gene expression data analysis[C]// International Conference on Systems. IEEE, 2011.

Yang Dan, Zhu Shiling, Bian Zhengyu. Application of improved K-means-based algorithm in text mining. Computer Technology and Development, 2019, 29(4):68-71.

Wang H, Wang H, Jiang L, et al. Research and application of improved K-means based on MapReduce. Journal of Physics Conference Series, 2020,1651:012074.

Uckol H I, Ilhan S, Ozdemir A. Partial Discharge Pattern Classification based on Deep Learning for Defect Identification in MV Cable Terminations[C]// 2020 IEEE International Conference on High Voltage Engineering and Application (ICHVE). IEEE, 2020.

Xin Zhao

Xin Zhao received a Bachelor of Engineering degree from Liaoning University of Engineering and Technology in 2016, and currently works at State Grid Xinjiang Electric Power Co., Ltd Marketing Service Center be in office Special person in charge. His research interests include Channel Management, Big data analytics, Industrial Economy and Project Management.

Changda Huang

Changda Huang received a Bachelor's degree in Engineering from North China Electric Power University in 2016, and currently works at State Grid Xinjiang Electric Power Co., Ltd Marketing Service Center be in operation supervisor. His research interests include High quality service, Channel Management, Big data analytics, Industrial Economy.