3.1 Data Source and Preprocessing of Xi’an Tourist Attractions
After preprocessing original review data, the final effective evaluation data was
2548 items. On this basis, a tourist perception model of a Xi’an scenic spot image
was built using big data technology. The model first uses the term frequency – inverse
document frequency (TF-IDF) algorithm to analyze the cognitive image of tourists and
then uses the Naive Bayes (NB) method to analyze the emotional image of tourists.
It finally uses the document generation model (LDA) theme model to analyze the overall
image of scenic spots. Before constructing the tourist perception model of the image
of Xi’an tourist attractions, the original data needs to be processed through preprocessing
technology. Taking Xi’an tourist attractions as an example, a tourist perception model
of destination terrain image was constructed based on a user’s review data in a tourism
website. The selected user data are based on the real experience of users during the
travel process, so the data are highly effective. The data source material was online
evaluation data of Xi’an tourist attractions on Dianping.com and Ctrip. These two
websites have a large number of users and a high rate of comments.
The collection method for online comment data was code written in Python 3.0. The
data include the user number, number, comment content, comment time, and score. Due
to the large scale of data, it takes a relatively long time to review it, and we cannot
truly reflect the real situation of an image of tourist attractions, so the collection
time was set to 2020 - 2022. A total of 5463 original tourism evaluation data were
obtained. Table 1 shows some evaluation data of Xi’an tourist attractions.
Table 1. Partial evaluation data of Xi’an tourist attractions.
Website source
|
User No.
|
Evaluation content
|
Evaluation time
|
Tourist rating
|
Ctrip
|
M12****0563
|
The scenery is beautiful, the tour guide's explanation is also very meticulous and
professional, and he has learned a lot of historical and cultural knowledge
|
2022-06-18
|
4
|
Ctrip
|
M18****6358
|
The special snacks in the scenic spot are delicious and cheap
|
2021-05-22
|
5
|
Dianping.com
|
Dpuser_3119058389VIP
|
The weather in Huashan is suitable, especially at sunrise and sunset. It deserves
the title of the most dangerous mountain in the world
|
2022-08-23
|
4
|
Dianping.com
|
Dpuser_4251826625VIP
|
The Great Tang Never Night City is suitable for traveling at night without holidays
with a strong historical and cultural atmosphere
|
2020-05-17
|
5
|
In the online review data, because different tourists have different opinions on the
scenic spots, the evaluation contents can be summarized as a whole and multi-dimensional
evaluation. At the same time, there will be great differences in the content and format
of the evaluation, and the analysis of these evaluations will affect the whole research
process and results. Therefore, before officially starting the analysis of evaluation
data, the data need to be processed to eliminate the repeated comments, garbled code,
the number of words, the length of comments is too short and other comment data to
ensure the quality of evaluation data.
Fig. 1 shows the specific methods of pretreatment. The first is text de-duplication. The
purpose of this operation is to remove duplicate places in the evaluation data or
similar comments of the same user. When clearing the same evaluation content, df.drop_
Duplicates and df.duplicates functions can be used. The second method is compressing
words and phrases. After text reprocessing, the quality of comment data cannot meet
the requirements of modeling and analysis. Text de-duplication is only for the whole
comment, not for the phrases and words in the comment. This study deals with words
and phrases in a compressed way.
Fig. 1. Specific methods of pretreatment.
The third method is to delete numbers, symbolic expressions, English, and short sentences
[16-18]. As the number of characters of tourists’ comments varies, the comments cover hundreds
of characters, emoticons, different formats, etc. Although some of the short comments
contain two or three words with experience, they cannot obtain in-depth information
from the comments, so they need to be deleted. During the deletion of short sentences,
the limit of the number of comments was set to 6 words. If the number is lower than
this value, the comment will be deleted, and if it is higher than this value, the
analysis will be retained.
The fourth method is to eliminate stop words. The stop words used in the study are
from a list of the relevant comments on the scenic spot, the stop words list of the
Harbin Institute of Technology, the stop words list of the Machine Intelligence Laboratory
of Sichuan University, and the Baidu stop words list. The removal method was removing.
Stopwords in Python Gensim. Fifth, Chinese word segmentation and part of speech analysis
were done. In order to count the word frequency of each word and obtain the subject
words and feature words in a comment, the comment content can be divided into valid
words by using Chinese word segmentation. In view of the particularity of online reviews
of scenic spots, this study added a customized dictionary related to scenic spots
based on the reference of Tsinghua University Dictionary and Hownet Dictionary. The
part of speech of the divided effective words was analyzed. The tool used for word
segmentation processing was the Python jieba package.
3.2 Construction of Visitor Perception Model Applying Big Data Technology
After preprocessing with raw data, a visitor perception model that analyzes a valid
comment set using the TF-IDF algorithm, NB network, and LDA topic model was constructed.
The model uses the TF-IDF value of each word in the TF-IDF algorithm and ranks the
top 50 feature words by the size of the TF-IDF value to obtain information about key
topics frequently mentioned in a tourist evaluation. These key feature words obtained
by the TF-IDF algorithm help to construct the cognitive image analysis dimension of
tourists and help researchers understand the tourism-related things that tourists
care about.
The model then uses the NB network to classify the tourists' evaluation text emotionally,
obtaining the main emotions of the tourists about various tourism matters. Finally,
the model uses the LDA theme model based on the results of the TF-IDF algorithm and
NB network analysis to construct the relationship between emotional evaluation and
tourism matters and conduct a thematic clustering analysis for an overall evaluation
of tourist attractions. The TF-IDF algorithm is a numerical statistical method that
is used as a weighting factor in the search process of user modeling, text mining,
and information retrieval. The value of this factor is proportional to the number
of words in comments [19,20]. TF has many expressions, including logarithmic scale types, Boolean types, primitive
types, etc., which can be expressed by $f(t,d)$. IDF refers to a measure of the information
provided by a word, which can be referred to as $idf(t,D)$. Expression (1) refers to the text set.
The total number of text is $N$. $D$ refers to random variables in the text collection.
$d$ is an element in $D$, and $i$ represents the i-th $d$.The word set in the text
set is:
$M$ refers to the total number of words, and $W$ refers to random variables in the
text collection. Assuming that the probability $P(d_{i})$ of all elements in $D$ is
equal the corresponding value is:
The amount of information calculated for each document is $-\lg \left(\frac{1}{N}\right)$,
and the entropy of random variable $D$ is:
We set the number of documents containing the subset of $w_{i}$ to $N_{i}$. If the
probability of obtaining each document is the same, the amount of information is $-\lg
\left(\frac{1}{N_{i}}\right)$, and the entropy of random variable $D$ is:
The probability of documents without $w_{i}$ in the selected subset is 0, and $N-N_{i}$
cannot appear in formula (5). If a word $w_{i}$ is arbitrarily obtained from the text, frequency $w_{i}$ in $d_{i}$
refers to $f_{ij}$. The frequency of $w_{i}$ in the whole text is $f_{{w_{i}}}$, and
the total number of words in the text is $F$, and then the following holds.
The interactive information value $M\left(\Delta ,\Omega \right)$ is:
The calculation expression in the form of $f_{ij}$ can be obtained according to:
The IDF factor refers to the change of information quantity after observing a specific
word, and the TF factor refers to the probability estimate of actually observing a
word. Eqs. (7) and (8) refer to two different aspects. When TF refers to $f_{{w_{i}}}$, TF-IDF refers to
the measurement of word selection. When TF refers to $f_{ij}$, TF-IDF refers to the
measure of word weight [21-23].
A NB network is a probability distribution among a group of random variables, which
can be divided into a static NB network and dynamic NB network. The difference is
that the dynamic NB network considers the impact of time factors on the results. An
NB network can be referred to by $G=\left(I,L\right)$. $L$ refers to the collection
of segments connecting nodes, and $I$ refers to the collection of all nodes in the
network structure. NB network can be divided into two parts, which are variable nodes
and directed segments between nodes. The line segment is a conditional probability
value. If the two nodes are not connected with each other, the random variables can
be considered to be independent of each other, and the conditional probability value
is 0.
We set the directed acyclic network diagram as $S$, and the joint probability distribution
of variable $X=\left\{x_{1},x_{2},\cdots ,x_{n}\right\}$ as $P\left(x_{1},x_{2},\cdots
x_{n}\right)$:
In Eq. (9), $P_{ai}$ refers to the parent node of the variable. The calculation expression of
joint probability coding of variable $X=\left\{x_{1},x_{2},\cdots ,x_{n}\right\}$
is Eq. (10).
In Eq. (10), $\theta _{i}$ refers to the parameter variable. The vector formed by the parameter
set is referred to by $\theta _{s}$. The joint probability distribution obtained from
the decomposition of $S$ is $S^{b}$. The calculation expression of local distribution
function is (11).
Eq. (11) can be understood as a continuous variable regression function and discrete variable
regression function. The construction of the NB network model is as follows. We determine
the properties of the node variables, set the value range, and determine the conditional
probability of the directed segment between the nodes. From the perspective of the
reasoning direction of the NB network structure diagram, conditional probability can
be divided into a prior probability and a posterior probability. A prior probability
is obtained from background knowledge and historical data. The posterior probability
is calculated on the basis of the prior probability. The two probabilities have the
same form. We set $w_{1},\cdots w_{i}\cdots w_{n}$ as the weight of all categories,
and the NB network equation is:
In Eq. (12), $P\left(x\left| w_{i}\right.\right)$ refers to the likelihood function of category
$w_{i}$ with respect to feature vector $x$. $P\left(w_{i}\right)$ refers to the prediction
of the probability of occurrence of various categories. $P\left(x\left| w_{i}\right.\right)$
refers to the probability of occurrence of feature vector $x$ in category $w_{i}$.
$P\left(w_{i}\left| x\right.\right)$ is the posterior probability. $P\left(x\right)$
refers to the total probability of conditional probability.
A flow chart of NB network modeling is shown in Fig. 2. The key step is to determine the conditional probability and causal relationship
without knowledge based on the database and expert knowledge. The model determines
the relationship between various variables by learning the NB network structure [24,25].
Fig. 2. NB network modeling flow chart.
The LDA topic model is a three-layer Bayesian probability model, which includes a
three-layer structure for the text, topic, and word. The graph model is shown in Fig. 3. White circles and orange circles refer to hidden variables and observed variables,
respectively. $\alpha $ and ${\beta}$ refer to the hyperparameters of topic distribution
and term distribution, respectively. $\overset{\rightarrow }{\theta }_{d}$ and $\overset{\rightarrow
}{\varphi }_{k}$ refer to the subject distribution of the text and the word distribution
under the subject, respectively. $z_{d,n}$ and $w_{d,n}$ refer to the subject of the
$d$-th word item and the $d$-th word item in the text. The number of topics is $K$,
and the total number of words in the text of $d$ is $N_{d}$. Eq. (13) refers to the topic distribution of each text based on probability.
Eq. (14) refers to the term distribution of each topic $z\in \left\{1,2,\cdots ,K\right\}$
based on probability.
The joint probability of the implicit variable and the observed variable under the
given parameters is:
In Eq. (15), $\Phi $ refers to an integral.
LDA is used to identify the topic information implied in a large text set or a large
corpus. For all documents in this corpus, LDA has the following generation process.
First, it extracts a topic from the topics distributed in the document. It extracts
another word from the corresponding word distribution in the selected topic. It then
repeats the process in a loop until it traverses all the words in the document. The
LDA topic model can automatically identify the topic of the document.
The Gibbs sampling algorithm is easier to understand, and its implementation is not
very complex. Especially when the subject is extracted from a large number of samples,
the extraction effect is relatively significant. Therefore, the Gibbs sampling algorithm
can be used to estimate the parameters of LDA subject model. Using the LDA topic model,
we can calculate the topic probability of positive emotional text and negative emotional
text. At the same time, the distribution probability topic vector of words contained
in this topic is obtained, and finally, the clustering result of this topic is obtained.
The LDA thematic clustering results are refined, and the overall image perception
of Xi’an tourist attractions is summarized.
When determining the number of topics in a document set, the selection of the number
of topics greatly affects the effect of topic modeling. Therefore, it is necessary
to determine the optimal number of topics before formally establishing the LDA topic
model. This study selected the degree of confusion as an indicator to select the optimal
number of topics. When the degree of confusion is lower, the number of topics is the
best. The Gibbs sampling method was used to calculate the puzzle degree of the number
of topics between 2 and 40, and the relationship between the complexity and the number
of topics was drawn.