Mobile QR Code QR CODE

  1. (Vincent Mary School of Science and Technology, Assumption University / Bangkok, Thailand, {kwan, paitoon} )
  2. (Library and Information Science Center, Chongqing Three Gorges Medical College / Chongqing, China )

Website, Crawler, Topic model, Topic transfer, Development

1. Introduction

Educational data can often reflect the development trend of education, such as the quality of teachers and teaching ability. Analyzing the website data of a college is a very effective way to know how it develops. The website is like a gateway, and the important decisions and major historical events are continuously published as news announcements. In this way, these data can be mined to obtain important information.

Chongqing Three Gorges Medical College is a medical-orientated college in China, which is responsible for cultivating medical graduates for society. Therefore, it is of great social value to analyze the historical trends of the college. This paper attempts to use the topic model to analyze the information on the official website of this college and then draw the development trends at the decision-making level from topic transfer over time to provide intellectual support for the comprehensive development of the college.

2. Related Work

Analyzing website data has always been a major research topic with many applications. Cha et al. improved the topic model to analyze the relationship graph of social-network data. They then categorized the edges and nodes in the graph based on the topic similarity [6]. Their model was quite effective in providing relevant recommendations within a social network. Rohani et al. ran the Latent Dirichlet Allocation (LDA) model on 90,527 social media records in Aviation and Airport Management domain in 2015. They detected the topic facets and extracted their dynamics over time [7]. Fuchs et al. proposed a social media analysis model based on trending topics extracted from Twitter and Google. This paper used the textual context to enrich the trends, which helped capture the identification of semantically related topics. Their work outperformed several baselines, including knowledge-graph modeling using DBPedia and directly comparing articles or terms [8]. Wang et al. proposed a hot topic detection approach based on bursty term identification to help people assimilate all the news immediately, which considered both frequency and topicality properties to detect the bursty terms and hot topics [9]. Later in 2018, Latent Dirichlet Allocation (LDA) was used to model the correlation of news items with stock price time series data [10]. The news items in the past were trained, and the similarity between the past and current news items was calculated to build the model. The paper switched the time points to predict the future. Li et al. extracted web news topics and used a model to detect the topic content evolution based on the topic clusters. A quantification method of the topic content was proposed in the model [11]. They showed the ``increase or decrease together'' law of topic intensity evolution. Sulova constructed models by combining structured and unstructured data from databases, web pages, and server log files to organize the data from web applications and then provided a summary of the data [12]. Wu et al. proposed a multi-view learning framework to incorporate news titles and bodies as different views of news to learn unified news representations. It could achieve good performance in news topic prediction [13]. Zhang et al. proposed a dynamic topical community detection method to detect communities and topics, which integrated link, text, and time [14]. The method could find communities and their topics with temporal variations. Since the Covid-19 pandemic presented a challenge to the global research community in 2020, Zhang et al. explored the research trace of the pandemic, which changed continuously, based on resilience theory [15]. By extracting the characteristic words from the early Covid-19 research articles, they found that the pandemic has significantly disrupted existing research. Bide et al. proposed a cross-event evolutionary detection framework to detect cross-events from similar time features [16]. Segmentation clustering was conducted based on similarity computation using the Bidirectional Encoder Representations from Transformers (BERT) model to encode tweets into vectors. In 2022, Zhang et al. designed a Bi-directional Long Short Term Memory-Conditional Random Field (Bi-LSTM-CRF) model for patent entity extraction. They also proposed a topic evolution path identifying method based on knowledge outflow and inflow to calculate the semantic relationship between topics of shared entities [17]. ZareRavasan et al. used the topic model to compute the abstracts of 2824 articles published between 1990 and 2020. They reported that topics, such as information system (IS) social practice, IS emerging services, and IS sustainability had gained momentum [18]. Feng et al. used Feature Maximization (FM) measurements to select features, combined with the contrast ratio, to perform the diachronic analysis [19]. They developed an integrated method based on the Keywords-based Text Representation Matrix (KTRM) and Lamirel’s EC index, and their method performed well in analyzing the diachronic topic evolution. Ding et al. proposed an enhanced latent semantic model, which was based on user comments and regularization factors, to capture the time evolution features of potential topics [20]. Their model could capture the changes in the users' interests and provide the evolutionary relationship between users' potential topics and product ratings. To make the research work more comprehensive, Churchill et al. presented a survey on the topic models, tracing back to the origin since the 1990s, comparing these models and their evaluation metrics, and laying the foundation for the next generation of topic models [21].

3. Crawler Designing and Data Cleaning

A crawler was implemented to obtain the data on the website. First, all links were obtained. They were then judged, and the non-standard content for each link was filtered out. Finally, those pages that meet the conditions will stay. In particular, the criteria were set to be met, as shown in Table 1. The crawled data were cleaned based on this goal.

The crawled data were transformed into a dictionary according to the Table 1 format, indicating that it is the complete page data. Simultaneously, the simplified text was output with only the title and content into a ``.txt'' file, which is the data fed into the LDA model.

Table 1. Qualified Page Criteria.


Criterion Explanation


url exists


url string ends with "htm" or "html"


page contains title


page contains content


page contains publishing time


page contains publishing source

4. Topic Modeling

LDA [1] is a classic model that can generate topics automatically. It evolved from Unigram Model (UM), Latent Semantic Analysis (LSA) [2], probabilistic Latent Semantic Analysis (PLSA) [3], and then to LDA. Its concept is easy to understand, i.e., the whole corpus results from the term generation process. There are two distributions here, and for each document, there is a corresponding document-topic distribution; for each topic, there is a corresponding topic-term distribution. This generation process is that a topic was chosen from the document-topic distribution for each term position in a document. A term was then chosen from the topic-term distribution. Figs. 2 and 3 present the plate notation of LDA and the generation process, respectively. There are two ways to calculate the LDA algorithm. The first way is Variational Inference, which involves the EM (Expectation Maximization) algorithm [4]. In the E-step, the coupling relationship between the latent variables was canceled out through the variational assumption, and the variational distribution was obtained. In the M-step, the variational parameters were fixed, and the above expectations were maximized through a series of Newton steps. Through iterations, the parameters were obtained finally. In recent years, there has been an easier way of computing, Gibbs

sampling [5]. The idea behind it is that Markov stationary state can be achieved by continuously sampling from the distribution. Gibbs sampling is a special case of MCMC (Markov Chain Monte Carlo), used mainly in the multi-dimensional random variable situation. The system continuously samples the marginal probability distribution to obtain the joint probability distribution. In Gibbs sampling, the joint distribution function should be obtained first, and then we sample according to the Gibbs sampling formula. The joint distribution function can be obtained when the iteration number of samplings is large enough.

LDA has been to be an effective model in topic analysis and word clustering. The selection of the topic number K should be determined by evaluation, which we will describe in the next chapter.

Fig. 1. Flow chart of the proposed topic-analyzing framework.
Fig. 2. LDA plate notation.
Fig. 3. LDA generating process.

5. Experiment

5.1 Data

The BeautifulSoup library of Python was used to help with the implementation of the crawler, and 13682 total links were obtained. After data cleaning, 8264 qualified links were obtained. The data were transformed into two file types. The first was 8264 ``.txt'' files ready to be fed into the LDA model; each ``.txt'' was composed of only a title and content. The second was a large dictionary-form file comprised of 8264 items. Fig. 4 presents the format of each item.

For the 8264 ``.txt'' files, that is, 8264 documents, the title length in each document was from 10 to 20 Chinese characters, and the content was approximately 500 to 1000 Chinese characters.

After obtaining the pure 8264 documents, it is important to perform Chinese word segmentation before they can be fed into LDA. Here, the famous Jieba library of Python was used. The term-index map, index-term map, and the corresponding segmented documents were obtained after the text preprocessing.

Fig. 4. Dictionary format of items.

5.2 Evaluation Metric

Since perplexity cannot represent the semantic coherence, two topic coherence calculation methods were used to perform the evaluation. The first is the topic coherence score, and its calculation formula is

Coherence-Score(t)$=\sum _{i=1}^{N-1}\sum _{j=i+1}^{N}\log \frac{C\left(t_{i},t_{j}\right)+1}{C\left(t_{i}\right)}$

where N denotes the number of terms that are on the top list of a specific topic, and it is set by the researcher and was set it to 15 in the present study; $C\left(t_{i},t_{j}\right)$ is the count of documents that $t_{i}$ and $t_{j}$ both appear and $C\left(t_{i}\right)$ is the count of documents that $t_{i}$ appears. Another is the point-wise mutual information (PMI) score, and its calculation can be expressed as follows:

PMI-Score(t)$=\frac{2}{N\left(N-1\right)}\cdot \sum _{i=1}^{N-1}\sum _{j=i+1}^{N}\log \frac{p\left(t_{i},t_{j}\right)}{p\left(t_{i}\right)p\left(t_{j}\right)}$

where N is the same as the explanation above; $p\left(t_{i},t_{j}\right)$ is the co-occurrence probability of $t_{i}$ and $t_{j}$, and was computed as ``(the count of documents that $t_{i}$ and $t_{j}$ both appear) / the count of total documents in the corpus''. $p\left(t_{i}\right)$ is the occurrence probability of $t_{i}$, and was computed as ``the count of documents that $t_{i}$ appears / the count of total documents in the corpus''. Note that the above is just the coherence score and PMI score for one topic, and the final coherence score and PMI score are the averages of all topics.

5.3 Results of Topics

The coherence score fluctuated with the highest value, around 15.9, when the topic number was 10 (Fig. 5). Topic numbers of 15, 25, 45, and 40 were obtained, which were the larger ones in the figure. Four points existed at the lower position, of which the topic numbers were 5, 20, 30, and 50, respectively. In Fig. 6, the PMI score increased when the topic number ranged from 5 to 40 and decreased drastically after that. The PMI score reached a peak when the topic number was 40. Therefore, 40 was chosen as the topic number because, at this point, the topic coherence was larger with the second-largest standard deviation, and the PMI score was the largest one with the second-largest standard deviation. Note that the status in each topic should be distinguished largely from others, so it is a good division.

Overall, this paper considered K = 40 as a proper value for the topic number. All 40 topics with corresponding titles given by human interpretation were then listed, as shown in Table 3. Eight representative topics were chosen to categorize due to the limited space, with the top 10 Chinese phrases in each topic, as shown in Table 4. For example, topic-1, topic-12, and topic-13 were categorized as ``Welcome'', but the facet that each topic focused on was different from the others. Topic-1 concerns the aspect of orientation programs for the students, like some training and the dean’s message. Topic-12 tells the story about student enrollment, such as sign-up and specialty selection. Topic-30 focuses on the welcoming preparations done by the college.

Fig. 5. Topic coherence score with different topic numbers K.
Fig. 6. PMI score with different topic numbers K.
Table 2. Details of Coherence Score and PMI with the Respective Standard Deviation.

Topic number

Coherence score

Coherence score std


PMI std



















































Table 3. Title of all topics.

Topic ID

Topic title

Topic id

Topic title


Poverty alleviation


Traditional Chinese medicine


New semester starts


Party construction






Innovation & Research


College’s mission


Dean’s message


Students’ competitions


Medical alliance




Staff promotion


Jobs & Employment


Teacher training


Project declaration


Organization construction


Academic lectures


Security & Pandemics


Students’ internship


Vocational education


Welcome new students


Inspection of teaching


Nation development


Student sign up


College activities


Discipline construction


Construction of work style


Specialties construction


Enrollment affairs


Community & Volunteers




College development


Caring for teachers


Faculty & Department


Teaching ability


Academic affairs


Practical education


Strong rainfall



Table 4. Some Representative Topics with Their Top 10 Chinese Phrases.


Topic ID

Top 10 Chinese phrase



representative, classmate, health training, students, hope, graduation, ceremony, youth, scholarship, sports meeting, newborn;


sign up, students, examinee, examination, Admission, time, top up exam, complete, recruit students, information, ordinary, selection;


work, students, prepare, school opens, service, welcome new students, library, parents, guarantee, scene, canteen, security;



activity, healthy, volunteer, society, service, resident, science popularization, community, practice, theme;


competition, contest, skill, big match, competitor, national, the first prize, the final round, won, test;


activity, classmate, dorm, culture, apartment, campus, exhibition, civilization, cultural festival, the most beautiful star in campus;



train, skill, theory, participate in, assistant, practicing, operation, ability, conduct, physician;


teacher, teaching, curriculum, promote, basics, level, design, teaching quality, young teachers, develop, to open up;

5.4 Topic Transfer

First, these documents were sorted using ``date'' feature in chronological order, and the topics for each document were sorted according to the probability. Hence, the earliest document was recorded since 2014. The ground truth is that in 2014, the website underwent a revision and upgrade, and the previous data were cleared. Therefore, the historical data was available only after 2014. After that, the topics were counted for all documents in the same year under the condition that only top N topics were considered. Here, N was set to 5. For the top five topic positions, the number of these topics was sorted, and the top five topics were taken. Table 5 lists the topic evolution sorted by ``year''. Similarly, the documents within the same month were grouped to compute and obtain the topic evolution sorted by ``month'', as listed in Table 6.

Table 5 shows that the news in 2014 focused on topic-12 (student signing up), topic-14 (specialty construction), topic-18 (students’ academic affairs), topic-38 (practical education), and topic-39 (recruitment), but the information the present study obtained did not have a strong belief because there was only one document in 2014 in the real world. In 2015, the articles started to express a proper number. The focusing points are topic-5 (medical alliance), topic-36 (caring about teaching staff), topic-4 (dean’s message), topic-32 (student activities), and topic-13 (discipline construction). In 2016, the issues moved to topic-35 (examination), topic-11 (inspection of teaching), topic-25 (faculty meeting), topic-10 (vocational education), and topic-36, which appeared last year. In 2017, two topics overlapped with 2016, which were topic-35 and topic-36. In addition, three new topics were born, which are topic-38 (practical education), topic-37 (teaching ability), and topic-10 (vocational education). In 2018, two new topics appeared, which were topic-14 (specialty construction) and topic-17 (faculty and department), with the other three topics that appeared in the two previous years, which were topic-10, topic-11, and topic-36. In 2019, three topics overlapped with the previous year, which were topic-36, topic-11, and topic-14. Two topics, topic-36 and topic-11, remained in 2020. Note that topic-9 (security and pandemics) climbed to the top five here, which was mainly because of the Covid-19 pandemic outbroke at the beginning of 2020. Topic-16 (construction of school running and new development) also appeared for the first time here, with the background of receiving large financial support from the Chongqing city government. The topics in 2021 were quite different from 2020; among the five topics, only topic-16 appeared again. The other four topics were topic-21 (party construction), topic-31 (nation development), topic-14 (appeared in 2014 and 2019), and topic-29 (students’ internship). There were only 331 articles in 2022 because the data collection ended in August. Three topics in 2020 and one in 2021 were seen again in 2022, which were topic-35, topic-36, topic-16, and topic-29. Throughout history, topic-36 appeared almost every year, and topic-35 and topic-11 were the second most frequent, followed by topic-10.

Some interesting information can be obtained from the data in Table 6. (1) The months with the fewest articles are February and August because they fell in the winter and summer vacation, respectively. (2) Chinese colleges often start the new academic year in September. Thus, topic-1 (new semester starts) and topic-4 (dean’s message) appear in September. Occasionally, the start would be postponed to October, so topic-1 also appears in October. (3) January and July are often the end of the semesters. Therefore, topic-35 (examination) appears in both months. (4) Summer vacation is always in August, and it can last for more than a month. Teachers often choose this period to improve themselves. Thus, topic-7 (teacher training) appears in August.

Table 5. Topics transition over the years.


Count of Articles

Top 5 Topics (count)



12 (1), 14 (1), 18(1), 38(1), 39 (1)



5 (195), 36 (132), 4 (111), 32(111), 13(106)



35(301), 11(289), 25(287), 10(284), 36(283)



35 (394), 38 (355), 36 (352), 37 (317), 10 (313)



14 (304), 10 (287), 17 (258), 11 (257), 36 (256)



36 (359), 2 (270), 11 (254), 21 (254), 14 (253)



36 (212), 16 (204), 35 (175), 25 (144), 11 (142)



21 (302), 16 (297), 31 (288), 14 (275), 29 (269)



35 (87), 36 (80), 17 (70), 16 (65), 29 (64)

Table 6. Topics transition over months.


Count of Articles

Top 5 Topics (count)



36 (181), 35 (164), 11 (142), 25 (118), 21 (114)



30 (58), 25 (44), 35 (42), 12 (41), 5 (31)



36 (194), 35 (191), 11 (137), 25 (130), 5 (128)



36 (215), 10 (170), 35 (167), 21 (155), 25 (155)



10 (218), 24 (211), 36 (195), 11 (180), 14 (173)



36 (254), 37 (243), 14 (235), 17 (232), 24 (224)



36 (154), 35 (151), 23 (118), 37 (115), 21 (108)



35 (66), 30 (55), 14 (45), 7 (40), 2 (38), 10 (36)



36 (205), 30 (192), 1 (172), 4 (172), 11 (168)



10 (143), 36 (130), 1 (121), 38 (120), 11 (119)



11 (201), 10 (198), 24 (184), 5 (180), 31 (170)



10 (198), 36 (194), 14 (180), 11 (179), 38 (177)

6. Conclusion

Manually analyzing the huge amount of information on the network and obtaining knowledge from it is an exhausting task. An analyzing framework was constructed to analyze its topic evolution automatically and solve this problem in educational data. Taking Chongqing Three Gorges Medical College as a case, a crawler was designed to obtain website data, which was cleaned and re-constructed. The data were fed into the topic-processing module. The optimal topic number, the term distribution in each topic, and the topic transfer over the period were obtained. Owing to time limits, Gaussian LDA has not been considered to perform experiment comparisons, and frequent item mining has not been designed for this paper, which could be conducted later. This paper can provide an administration summary for the leadership level of an educational organization, enable them to have better control of the development direction, and give ideas and management experiences of other colleges.


This work was supported by the Chongqing Three Gorges Medical College of China (No. 2019XZYB13) and by the Chongqing Association of Higher Education under Chongqing Municipal of China (No. CQGJ21B128). Here author Lei would like to express his gratitude to his spouse, Zheng Teng.


So H. S., 2006, Environmental Influences and Assessment of Corrosion Rate of Reinforcing Bars using the Linear Polarization Resistance Technique, Journal of Korean Society of Civil Engineering, Vol. 22, No. 2, pp. 107-114Google Search


Lei Peng

Lei Peng is a Ph.D. candidate at the Vincent Mary School of Science and Technology at Assumption University, Thailand. He received his B.Eng. in Computer Science and Technology department at Shangqiu University of China and M.Eng. in Computer Science and Technology department at the Guizhou University of China. His research area includes information retrieval, text mining, topic modeling, and time series data mining.

Kwankamol Nongpong

Kwankamol Nongpong is a professor at the Vincent Mary School of Science and Technology, Assumption Univer-sity, Thailand. Her research interests are text processing, natural language processing, big data processing and analysis, program analysis, and enterprise applications. Nongpong is also working closely with the industry, where she has consulted for companies and government agencies on data standard, project management, business process improvement, and ERP system development.

Paitoon Porntrakoon

Paitoon Porntrakoon received his Ph.D. in Information Technology from Assumption University, Thailand in the year 2018. He is currently the Graduate Program Director in Information Technology of the Vincent Mary School of Science and Technology at Assumption University, Thailand. His research interests are similarity searching, location detection, trust and distrust, social commerce, and Thai sentiment analysis.