ParkDoohong1,2
KangDonggoo3
PaikJoonki3,4
-
(ROK Army, South Korea opdho@naver.com)
-
(National Defense AI School, Chung-Ang University, Seoul, South Korea)
-
(Department of Image, Chung-Ang University, Seoul, South Korea dgkang@ipis.cau.ac.kr)
-
(Department of Artificial Intelligence, Chung-Ang University, Seoul, South Korea paikj@cau.ac.kr)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Defense issue analysis, Topic modeling, LDA, BERT
1. Introduction
Defense Innovation 4.0 is the current government's defense reform plan. It was announced
as part of the 110th National Agenda Tasks on May 3, 2022, by the 20th Presidential
Transition Committee [1]. It aims to redesign the overall defense posture at a high level, akin to creating
a new military branch, and to drive Defense Innovation 4.0 for the development of
an AI-based science and technology powerhouse within the defense sector. Furthermore,
a key focus is on addressing the shortage of military human resources and minimizing
casualties. Since the announcement of national tasks, the Ministry of National Defense
has embarked on the comprehensive pursuit of Defense Innovation 4.0, as evidenced
in Table 1. From the establishment of the Defense Innovation 4.0 Task Force [2] to the unveiling of the basic plan [3], the ministry has invested substantial effort. However, despite these endeavors,
an analysis regarding the extent to which Defense Innovation 4.0 has become an issue
in the general media or among citizens, and the impact it has exerted, has not yet
been conducted.
In this paper, we aims to analyze the level of public interest and awareness regarding
Defense Innovation 4.0, the current government's defense reform initiative. Additionally,
it seeks an objective investigation into whether the private media addresses issues
related to Defense Innovation 4.0 within the sphere of national defense. To achieve
this, articles concerning defense matters from private media sources were collected
using the Korea Press Foundation's BigKinds platform [4].
Table 1. Defense Innovation 4.0 progress
|
Progress Overview
|
Key Content
|
Remarks (Lead)
|
|
2022. 07. 01
Establishment of Defense Innovation 4.0 Task Force
|
Conception of the basic concept of "Defense Innovation 4.0"
|
Deputy Minister of Defense
|
|
2022. 07. 14
1st Meeting of Defense Innovation 4.0 Task Force
|
Issuance of the Ministry of Defense guidelines for "Defense Innovation 4.0"
|
Minister of Defense
|
|
2022. 08. 10
2nd Meeting of Defense Innovation 4.0 Task Force
|
Review of tasks in the basic plan of "Defense Innovation 4.0"
|
Minister of Defense
|
|
2022. 08. 12
1st Seminar on Defense Innovation 4.0
|
Gathering opinions of internal and external military experts on the direction of "Defense
Innovation 4.0"
|
Deputy Minister of Defense
|
|
2022. 09. 27
2nd Seminar on Defense Innovation 4.0
|
Presenting major issues and alternatives for "Defense Innovation 4.0"
|
Minister of Defense
|
|
2022. 10. 26
3rd Seminar on Defense Innovation 4.0
|
Proposal of development strategies for defense R&D systems and augmentation processes,
the foundation of "Defense Innovation 4.0"
|
Chairman of the National Defense Committee & Minister of Defense
|
|
2022. 11. 16
4th Seminar on Defense Innovation 4.0
|
Forming a consensus among key military officials regarding "Defense Innovation 4.0"
|
Minister of Defense
|
|
2023. 03. 03
Announcement of Defense Innovation 4.0 Basic Plan
|
Presidential endorsement of the "Defense Innovation 4.0 Basic Plan"
|
Ministry of Defense
|
For effective analysis, we employ the Latent Dirichlet Allocation (LDA) method to
identify the top N significant topics from these articles. The LDA method filters
out unnecessary noise and models word associations with specific topics. Through LDA
topic modeling, we identify representative topics in three categories: 1) pre-Defense
Innovation 4.0 articles, 2) post-Defense Innovation 4.0 articles, and 3) articles
specifically related to Defense Innovation 4.0. To assess the impact of Defense Innovation
4.0, we compute sentence similarity among topics using the bidirectional encoder representations
from transformers (BERT) model. This approach allows us to analyze the evolution of
defense issues from the official announcement date of Defense Innovation 4.0.
Moreover, this research underscores the importance of leveraging advanced machine-learning
techniques in policy analysis. By employing methods like LDA and BERT, we can provide
a nuanced understanding of how major policy initiatives are reflected in public discourse.
This approach not only aids in measuring the effectiveness of communication strategies
but also highlights the evolving nature of public opinion and media focus. Such insights
are invaluable for policymakers to adapt and respond proactively to public sentiment,
ensuring that the objectives of initiatives like Defense Innovation 4.0 are met with
informed support and constructive feedback.
Our objectives are threefold:
• Analyze public and media awareness of South Korea's Defense Innovation 4.0 reform
plan using LDA topic modeling on media articles before and after its announcement.
• Introduce text mining methods, specifically LDA and BERT-based cosine similarity,
to quantify changes in the prominence and significance of defense innovation themes.
• Compare the most prominent topics associated with Defense Innovation 4.0 to those
in general defense-related discussions, revealing the policy shift's potential impacts
and implications on future discourses and priorities.
2. Related Work
2.1. Topic Modeling and Issue Analysis
In recent years, there has been a growing interest in applying topic modeling techniques
to analyze large-scale text data in various domains, including the military domain.
This section provides an overview of the existing literature on this topic and highlights
some key contributions that have influenced our research.
One of the earliest works on topic modeling for analyzing military documents is by
Wu {et al.} [5]. Wu et al. argue that the surge in network data presents new opportunities for military
intelligence acquisition but stretches traditional intelligence analysis methods.
They propose enhancing military intelligence analysis by integrating data mining technology
and creating a network military intelligence analysis model relying on data mining.
Additionally, Clement, Addo Prince, et al. [6] conducted a timely analysis of 362,566 tweets surrounding the US troop withdrawal
from Afghanistan and consequent fall of the Afghan government, employing text mining
techniques such as sentiment analysis and word clouds. Findings indicate varied topics
and predominantly negative reactions, implying that social media platforms serve as
crucial sources for real-time assessment and potentially informing rapid response
strategies during crises.
Expounding further, Lee et al. [7] maintain that adapting national defense policies to changing national/international
security conditions and foreign strategies, coupled with technological innovations
in defense operations, is imperative. Applying topic modeling and LDA analysis, they
distill fourteen themes in defense policy research, verifying connections via IDM
analysis; outcomes aspire to bolster understanding of defense technology spheres and
guide institutional policymaking.
2.2. Latent Dirichlet Allocation
Topic modeling is a method used to discern the structure of topics within textual
data, identifying the topics each document covers within a document set. Various techniques
have been developed for topic analysis, including prominent examples such as `latent
Dirichlet allocation' (LDA) [8], `latent semantic analysis' (LSA) [9], `non-negative matrix factorization' (NMF) [10].
LDA, a prominent algorithm in topic modeling, is widely utilized across various domains.
It assumes documents to be mixtures of specific topics and models the process in which
topics generate words. LDA represents the probability of generating a given document
set with a stochastic modeling approach, tracing back the assumed process of document
generation. Utilizing the Dirichlet distribution, it defines document-topic distribution
and topic-word distribution, modeling the distribution of topics within each document
and the words generated by each topic. Key parameters of the LDA model include document-topic
distribution, topic-word distribution, and the number of topics. Researchers employing
LDA for topic modeling determine an appropriate number of topics, enabling the extraction
of topics from given documents and prediction of words generated by those topics.
LDA makes three primary assumptions. Firstly, it assumes that `documents can encompass
various topics, and the extent to which a document holds a particular topic can be
represented as a probability vector.' This is mathematically expressed as $P(t\mid
d)$, where `$t$' represents a topic and `$d$' signifies a document.
The second assumption of LDA is that 'topics can be represented by the ratio of words
used within that topic.' This can be expressed in the topic-specific word generation
probability distribution denoted as $P(w \mid t)$, where `$w$' represents a word and
`$t$' signifies a topic. The third assumption states that 'the likelihood of specific
words being used in a document can be expressed as the product of the probability
distributions from the first and second assumptions.' Given that LDA topic modeling
learns topic vectors represented by words, it's essential to consider the high dimensionality
associated with the number of words when analyzing topics.
In the process of LDA model depicted in Fig. 1, $\alpha$, $\beta$, and $K$ represent the initial parameter settings. `$D$' signifies
the set of documents, `$K$' denotes the set of topics (quantity), and `$N$' stands
for the number of words within a document. $\theta_d$ represents the Dirichlet distribution
of topics in document `$d$,' while $\varphi_k$ signifies the word distribution per
topic. $Z_{d,n}$ refers to the topic at the nth word position in document `$d$,' and
$W_{d,n}$ represents a specific word at the nth positional notation in document `$d$'.
2.3. BERT
Bidirectional Encoder Representations from Transformers (BERT) has become one of the
most popular pretrained language representation models in natural language processing
[11]. BERT leverages the Transformer architecture [12] and is pretrained on large unlabeled corpora using two objectives: masked language
modeling and next sentence prediction. This pretraining scheme allows BERT to learn
bidirectional representations that incorporate context from both directions.
After pretraining, BERT can be fine-tuned on downstream NLP tasks by adding task-specific
output layers. This transfer learning approach allows BERT to achieve state-of-the-art
results on a wide range of tasks including question answering, sentiment analysis,
natural language inference, and named entity recognition [13]. Key advantages of BERT include its bidirectional training, larger model size, and
fine-tuning based transfer learning.
2.4. Similarity Analysis
measure and compare similarities between textual data. It aids in understanding the
relevance between sentences, words, or specific words within sentences, and analyzing
how similar two documents are in terms of the topics they cover. Similarity analysis
involves quantifying and comparing textual data using various methods and techniques.
Cosine similarity is a widely used technique for estimating the semantic similarity
between two texts [14]. It measures the cosine of the angle between two vectors representing the texts in
a multi-dimensional space. Cosine similarity is often applied on tf-idf weighted bag-of-words
vectors. With the emergence of dense word embeddings like word2vec [15] and GloVe [16], cosine similarity between the mean embedding vectors of two texts has become a popular
semantic similarity technique.
More recently, contextual word embeddings from pretrained language models like BERT
[11] have shown impressive gains on similarity tasks [17]. BERT-based sentence embeddings can effectively capture semantic closeness via cosine
similarity. Our approach employs cosine similarity between BERT embeddings to assess
similarity. We find BERT captures nuanced contextual signals that are missed by traditional
methods like tf-idf. Fine-tuning BERT for our similarity objective yields further
improvements in performance.
In the similarity measurement step, as depicted in Fig. 2, the primary approach involves utilizing cosine similarity to compare the embeddings
of two sentences, quantifying the semantic similarity between sentences.
Fig. 2. BERT-based similarity analysis.
3. Proposed Method
3.1. Dataset Configuration
The dataset of defense-related news articles was collected from the Bigkinds database
by the Korea Press Foundation spanning from May 1, 2021, to May 31, 2023 (spanning
two years, with one year before and after the announcement of national tasks on May
3, 2022). A total of 17,794 articles containing 'national defense,' 'army,' 'navy,'
'air force,' and 'marine' in their titles or main texts were gathered from 11 national
daily newspapers and 5 broadcasting companies. Among these, 1,001 articles were excluded
from analysis due to duplicates or containing insignificant content such as personal
notices, condolences, condolences, photo reports, etc. Therefore, 16,793 articles
were analyzed, and a summary of the dataset composition is presented in the following
Table 2.
Table 2. News article dataset composition.
|
Category
|
Defense-related news article dataset
|
|
Collection Period
|
From May 1, '21 to May 31, '23
|
|
Media Outlets
|
11 National Daily Newspapers
Kyunghyang Shinmun (1,274),
Kukmin Ilbo (1,124),
Naeil Shinmun(190), Donga Ilbo(907)
Munhwa Ilbo (737),
Seoul Shinmun (1,612)
Segye Ilbo(2,158), Chosun Ilbo(1,414)
JoongAng Ilbo(1,694), Hankyoreh(939)
Hankook Ilbo (1,082)
5 Broadcasting Stations
KBS(971), MBC(455), OBS(287)
SBS(401), YTN(2,549)
|
|
Analysis Target
|
Analyzed Articles (16,793)
Excluded from Analysis (1,001)
|
Table 3. Number of collected articles by keyword.
|
Date
|
National defense
|
Army
|
Navy
|
Air Force
|
Marine
|
|
2021.05
|
88
|
139
|
111
|
131
|
21
|
|
2021.06
|
254
|
300
|
192
|
1924
|
59
|
|
2021.07
|
118
|
188
|
143
|
572
|
43
|
|
2021.08
|
150
|
265
|
526
|
566
|
89
|
|
2021.09
|
117
|
138
|
159
|
185
|
29
|
|
2021.10
|
150
|
187
|
236
|
230
|
68
|
|
2021.11
|
70
|
180
|
86
|
174
|
41
|
|
2021.12
|
82
|
106
|
77
|
171
|
27
|
|
2022.01
|
124
|
106
|
124
|
242
|
31
|
|
2022.02
|
109
|
113
|
84
|
103
|
18
|
|
2022.03
|
259
|
157
|
234
|
239
|
93
|
|
2022.04
|
138
|
97
|
265
|
212
|
69
|
|
2022.05
|
125
|
139
|
119
|
161
|
37
|
|
2022.06
|
145
|
98
|
193
|
215
|
50
|
|
2022.07
|
83
|
121
|
261
|
150
|
64
|
|
2022.08
|
101
|
136
|
157
|
437
|
40
|
|
2022.09
|
88
|
128
|
131
|
216
|
279
|
|
2022.10
|
285
|
158
|
258
|
364
|
44
|
|
2022.11
|
119
|
117
|
131
|
295
|
37
|
|
2022.12
|
153
|
151
|
78
|
287
|
24
|
|
2023.01
|
139
|
142
|
85
|
171
|
32
|
|
2023.02
|
130
|
131
|
193
|
196
|
40
|
|
2023.03
|
152
|
111
|
179
|
173
|
60
|
|
2023.04
|
113
|
153
|
154
|
190
|
56
|
|
2023.05
|
116
|
160
|
124
|
172
|
57
|
In June 2021, there was a notably high number of news articles related to the keyword
'air force.' This observation suggests a concentrated focus on certain events during
that period. No deliberate efforts were made to artificially balance the analysis
for a substantive examination of defense-related issues. However, articles containing
duplicate content, personal notices, condolences, condolences, or photo reports, which
were unlikely to yield meaningful results, were excluded from the analysis.
3.2. LDA based Topic Modeling
For topic modeling, we employed the Latent Dirichlet Allocation (LDA) method to extract
the core keywords from the collected articles. To collect news articles related to
national defense, we utilized the BigKinds database provided by the Korea Press Foundation
[4].
To begin, we extracted keywords from the articles as a means of eliminating noisy
information. The articles were obtained via a web-crawling algorithm and thus may
contain redundant words at times. To do this, we employed the KPF-KeyBERT model, which
has been pre-trained on articles from the Korea Press Foundation and exhibits high
performance in Korean language keyword extraction as illustrated in Table 4 and Fig. 4.
Table 4. Comparison evaluation results of KPF-BERT and existing BER.T
|
Category
|
KLUE-NLI Natural Language Inference
|
KorQuAD Machine Reading Comprehension
|
|
Evaluation method
|
Accuracy
|
Accuracy regression rate
|
|
BERT base
|
73.30%
|
90.02%
|
|
KoBERT
(SKT)
|
79.53%
|
71.36%
|
|
KoBERT
(ETRI)
|
80.56%
|
82.00%
|
|
KPF BERT
|
87.67%
|
94.95%
|
Fig. 3. Number of monthly collected articles by keyword.
Fig. 4. Keyword extraction using KeyBERT.
Fig. 5. Process of BERT-based similarity analysis.
The article is encoded into a fixed-length vector representation capturing semantic
meaning using a pretrained BERT model. Subsequently, keywords and n-grams are extracted
from the document. A bag-of-words (BOW) approach represents the text based only on
word frequencies ignoring word order. N-grams divide the text into continuous word
or character sequences of length N during document and sentence analysis. Next, cosine
similarity between vectors identifies the most similar keywords. Finally, the words
and phrases most representative of the full document are extracted.
Given keywords, we performed LDA topic modeling on keywords extracted from articles
published before and after the announcement of national defense innovation 4.0. We
set 10 topics as the optimal number to avoid excessive topic overlap. The inferred
topics were visualized using the pyLDAvis [19] and genism library [18] to visualize inter-topic distances and representative keywords in a coherent manner.
To optimize LDA model performance, hyperparameters including the number of iterations
and topics were tuned. Appropriately determining the topic quantity is critical. Comparing
models with 10 and 11 topics revealed substantially higher topic overlap and closer
topic proximity with 11 topics. This made interpreting distinct characteristics within
each topic difficult. Hence, 10 topics was deemed optimal. Fig. 6 shows visualization results on one year of news articles before (top) and after (bottom)
presenting national tasks, demonstrating the impact of selecting 10 versus 11 topics.
Fig. 6. Visualization results according to the change in the number of topics ($\pmb{10
\rightarrow 11}$) before (top) and after (bottom) the announcement of national affairs.
3.3. BERT-based Similarity Analysis
The extracted topics are utilized for similarity analysis to examine the effects of
Defense Innovation 4.0 on the news article topics. We employed the RoBERTa [13] model to conduct similarity analysis. RoBERTa is a model based on BERT but enhanced
with a larger dataset, model size, and longer training duration. The process of conducting
similarity analysis experiments is illustrated in Fig. 6.
Given a query $q$, we use the fine-tuned RoBERTa model to encode the query and a
set of candidate documents ${D}$. The encoding process can be represented as
where $E_q$ and $E_d$ are the embeddings of the query and candidate documents, respectively.
We compute the cosine similarity between the query embedding and each candidate document
embedding to obtain a ranking score
where $|E_q|$ and $|E_d|$ are the magnitudes of the query and candidate document embeddings,
respectively.
For each topic, we employed a pre-trained RoBERTa model to conduct embedding, obtaining
representative sets of 10 topics per topic. Additionally, we compute embedding for
3 topics extracted from the Defense Innovation 4.0. Subsequently, we computed cosine
similarity using by calculating vector inner products between the vectorized Defense
Innovation 4.0 topics and the 10 defense issue topics extracted before and after the
presentation of national tasks. These calculations were quantified and visualized
as heatmaps for ease of analysis.
4. Experimental Results
4.1. Topic Modeling Results
4.1.1 Before Defense Innovation 4.0
We conduct LDA on defense-related news articles before the presentation of national
tasks (May 1, 2021, to May 2, 2022, comprising 8,895 articles). Table 5 illustrates the top 10 words best representing each topic. The comprehensive visualization
of the LDA topic modeling results is depicted in Fig. 7.
Fig. 7. Visualization of defense news topic modeling before national task announcement.
Table 5. Topic modeling results before Defense Innovation 4.0.
|
Topic 1
|
Topic 2
|
Topic 3
|
Topic 4
|
Topic5
|
|
Incident, Air Force, Sergeant, Investigation, Sexual Harassment, Victim, Harm, Ministry
of National Defense, Perpetrator, Inquiry
|
United States, China, Missile, North Korea, South Korea, Rocket, Government, Space,
Palestine, Japan
|
Vaccine, COVID, Vaccination, People, Start, Vehicle, Situation, Thought, Meals, Level
|
Israel, United States, Hamas, Attack, Gaza Strip, President, United Kingdom, US Military,
War, Airstrike
|
President, Ministry of National Defense, Minister, Incident, Chief, That day, Citizens,
Representative, National Assembly, Seo Wook
|
|
Topic 6
|
Topic 7
|
Topic 8
|
Topic 9
|
Topic 10
|
|
Thoughts, Citizens, Anchor, Representative, President, Members of Parliament, Democratic
Party, Politics, Words, Talk
|
Women, Military, Men, Society, Education, Sergeant, Illegal, Service, Female Soldiers,
Myanmar
|
Air Force, Accident, Navy, Crash, Fighter Jet, Debris, Occurrence, Helicopter, Flight,
Maritime
|
Army, Unit, Soldier, Officer, Enlisted, Navy, Military Headquarters, Soldiers, Lieutenant
Colonel, Mobile Phone
|
Coast Guard, Crew, Fishing Boat, Missing, Captain, Search, Ulleungdo Island, Fishing
Industry, Fishing Operation, Maritime
|
Upon analyzing the visualization results, the topics can be broadly categorized into
five groups based on their distribution. The 1st group, represented by Topic 1 in
the first quadrant, is positioned distinctly apart from other topics, indicating lower
relevance compared to the rest. Additionally, with the largest circle size in the
visualization at 18.4%, it can be inferred that this topic holds the highest contribution
among defense-related issue articles before the announcement of national tasks. The
2nd group, situated in the fourth quadrant, includes Topics 2, 3, 4, 6, and 7. These
topics are closely adjacent, indicating a higher level of interrelatedness among them.
Each topic's contribution reveals Topic 2 at 17.8%, Topic 3 at 12.8%, Topic 4 at 11.2%,
Topic 6 at 10%, and Topic 7 at 10%. The 3rd group encompasses Topic 5, positioned
across the first and fourth quadrants. It stands somewhat distant from other topics
without significant overlaps, indicating its relatively lower relevance and independent
nature, with a contribution of 10.5%. The 4th group involves Topics 8 and 9, spanning
the second and third quadrants. These topics, adjacent to each other, demonstrate
higher relevance but are notably distinct from the remaining groups. Topic 8 contributes
5.5%, while Topic 9 contributes 5.4%. Lastly, the 5th group comprises Topic 10 located
in the third quadrant, considerably distant from other groups, indicating its independent
nature. With a contribution of 0.4%, it shows minimal relevance to defense issues
before the announcement of national tasks.
4.1.2 After Defense Innovation 4.0
The results of topic modeling for defense-related news articles after the announcement
of national tasks (from May 3, 2022, to May 31, 2023, comprising 7,898 articles) are
presented in Table 6. The visualization of these results is depicted in Fig. 8.
Fig. 8. Visualization of defense news topic modeling after announcement of national
affairs.
Table 6. Topic modeling results after Defense Innovation 4.0.
|
Topic 1
|
Topic 2
|
Topic 3
|
Topic 4
|
Topic 5
|
|
United States, China, South Korea, Taiwan, Japan, world, security, military, government,
US military
|
Government, Seoul, nation, society, center, candidates, women, COVID-19, support,
target
|
President, citizens, government, lawmakers, conversation, anchor, thoughts, remarks,
Yoon Suk-yeol, representative
|
Ukraine, Russia, Russian army, war, attack, region, Ukrainian army, president, missile,
Putin
|
person, murder, name, South Korea, oneself, think, photo, navy, family, country
|
|
Topic 6
|
Topic 7
|
Topic 8
|
Topic 9
|
Topic 10
|
|
North Korea, missile, launch, exercise, provocation, ROK-US, ballistic, response,
that day, navy
|
Incident, investigation, sergeant, charges, Air Force, victim, Ministry of National
Defense, inquiry, death, prosecution
|
That day, area, damage, navy, residents, rescue, nearby, vehicle, sea, deployment.
|
Special investigation, police, accident, suspicion, discovery, confirmation, investigation,
occurrence, Marine Corps, army.
|
Air wing, junior grade, Osan, transportation, human rights committee, memo, labor
organization, public relations, illegal activities, informant.
|
The visualization analysis reveals a division of topics into 6 main groups based on
their distribution. The 1st group, represented by Topic 1 located in the 4th quadrant,
stands distinctly independent without overlapping with other groups, indicating relatively
lower relevance with the rest of the topics. With a contribution of 18.2%, Topic 1
can be considered as the primary representative of defense issues post the national
task announcement. Group 2 comprises Topics 2, 3, and 5 positioned in the 1st and
4th quadrants, closely adjacent to each other and distanced from other groups. Topic
2 contributes 15.9%, Topic 3 contributes 15%, and Topic 5 contributes 10.3%. Group
3 consists of Topic 4 positioned in the 4th quadrant, presenting an independent topic
distant from others with a contribution of 12.4%. Group 4 involves Topics 6 and 8
located in the 3rd quadrant, indicating a notably high association between these adjacent
topics. Topic 6 contributes 9.4%, while Topic 8 contributes 6.2%. Group 5, represented
by Topic 7 in the 2nd quadrant, stands apart from other topics, indicating an independent
aspect and contributing 7.3%. Lastly, Group 6 encompasses Topics 9 and 10 in the 2nd
and 3rd quadrants, respectively. These topics show considerable distance from other
groups. While Topic 9 contributes 5.1%, Topic 10 displays a low contribution of 0.2%,
indicating relatively low relevance to defense issues after the national task announcement.
Comparing the visualization analysis before and after the national task announcement,
focusing on the highest-contributing Topic 1, there's a notable shift. Previously,
Topic 1 was positioned in the 1st quadrant, maintaining a considerable distance from
other topics, indicating its independent nature. However, post the national task announcement,
Topic 1 now shares the 4th quadrant with Topics 2 and 4, and notably closer proximity
to the other four topics (Topics 2, 3, 4, and 5).
4.1.3 Topic modeling results on Defense Innovation 4.0
The topic modeling results of the Defense Innovation 4.0 Basic Plan press release
are shown in Table 7, and the corresponding visualization is depicted in Fig. 9.
Table 7. Defense Innovation 4.0 Basic Plan press release topic modeling results.
|
Topic 1
|
Topic 2
|
Topic 3
|
|
Technology, System, Innovation, Field, Future, System, Industry, Weapon, Battle, Missile.
|
Military, AI, Development, Structure, Ministry of Defense, Organization, Operation,
Advanced, Capability, Training.
|
Defense, Science, Base, Power, Stage, Field, Center, Efficiency, Data, Concept.
|
Fig. 9. Similarity analysis process.
According to the visualization analysis, Topic 1, Topic 2, and Topic 3 are positioned
far apart from each other, indicating their low interrelatedness and independence
as topics. Moreover, in terms of contribution, Topic 1 holds 42.5%, Topic 2 comprises
30.7%, and Topic 3 accounts for 26.8%. All three topics are considered representative
of the Defense Innovation 4.0.
4.2. Similarity Analysis Results
Using the RoBERTa model, we conducted a cosine similarity analysis between topics,
visualizing the results with heatmaps to facilitate comprehension. In Fig. 8, the vertical axis represents the topics derived from the Defense Innovation 4.0
Basic Plan press releases (labeled as Topics 1, 2, and 3), while the horizontal axis
presents 10 comparative topics. These comparative topics were generated from sentences
containing 10 representative keywords each. These sentences were embedded using the
RoBERTa model to transform them into multi-dimensional vectors. We then calculated
the cosine similarity by computing the inner product of these vectorized sentences,
and visualized the scores with heatmaps in Figs. 9 and 10.
Fig. 10. Similarity matrix between topics before Defense Innovation 4.0 announcement.
Fig. 11. Similarity matrix between topics after Defense Innovation 4.0 announcement.
Fig. 10 depicts the cosine similarity comparison between Defense Innovation 4.0 topics and
those from the pre-announcement period, showing that only two topics have a similarity
score of 0.4 or above. Conversely, Fig. 11 illustrates the post-policy announcement similarity analysis, where seven topics
exhibit a significant similarity of 0.4 or higher. This shift suggests an increase
in defense-related topics aligning more closely with the Defense Innovation 4.0 initiative
after the policy announcement, indicating a greater emphasis on these areas in subsequent
discussions.
Furthermore, focusing on the most influential topic, Topic 1, in both pre and post-policy
announcement periods reveals insightful trends. In Fig. 10, the first column shows the pre-announcement similarities for Defense Innovation
4.0 Topic 1 as 0.19, Topic 2 as 0.34, and Topic 3 as 0.26, reflecting lower similarities.
However, post-announcement data in Fig. 11's first column indicates higher similarities: 0.36 for Topic 1, 0.47 for Topic 2,
and 0.34 for Topic 3. This significant increase in similarity scores post-policy announcement
suggests an amplified focus on Defense Innovation 4.0 themes, particularly in the
most influential topic, Topic 1.
These findings imply that the Defense Innovation 4.0 policy announcement has heightened
the focus on defense innovation and technology within media coverage of defense issues.
Private media outlets appear to increasingly align their discussions with the government's
defense reform agenda, underscoring the growing recognition of defense innovation
and technology as critical components in addressing modern defense challenges. This
alignment highlights the media's role in emphasizing the importance of the Defense
Innovation 4.0 initiative and its relevance to contemporary defense discourse.
4.3. Performance Comparison
To further validate our approach, we conducted a performance comparison between our
RoBERTa-based method and open-sourced language models, such as GPT-2 [22] and Longformer [23]. Figs. 12 and 13 show similarity matrix before and after announcement. Both GPT-2 and LongFormer exhibit
high similarity scores across all instances, suggesting a reduced sensitivity to contextual
variations. This indicates that RoBERTa may be better suited for tasks requiring nuanced
understanding of textual changes over time.
GPT-2 and LongFormer struggle with similarity detection due to their designs. GPT-2,
focused on next-token prediction, doesn't capture full-text relationships well, resulting
in high, uniform similarity scores that miss contextual changes. LongFormer, optimized
for long documents with mixed attention mechanisms, also shows consistently high similarity
scores, indicating it can't pinpoint subtle shifts in shorter texts.
Fig. 12. Similarity matrix before(left) and after(right) using GPT-2.
Fig. 13. Similarity matrix before(left) and after(right) using LongFormer.
5. Conclusion and Future Work
In this paper, we analyzed the government's current defense reform plan Defense Innovation
4.0 through advanced language modeling techniques. We perform this research by gathering
and analyzing defense-related news articles published before and after the official
announcement of this initiative. Employing BERT-based models for keyword extraction
and LDA for topic modeling, we were able to identify and examine the defense topics
that attracted attention during these periods. Additionally, we utilized RoBERTa to
perform a similarity analysis on the main topics extracted from the Defense Innovation
4.0 announcement document, which enabled us to trace shifts in focus on defense-related
issues before and after the policy was unveiled. Our findings indicate a notable alignment
between core topics of Defense Innovation 4.0 post-announcement, especially when analyzing
the most significant topic, Topic 1, from our LDA results.
However, this work cannot avoid limitations, chiefly the temporal scope of our
data collection, which was confined to one year before and after the policy's announcement,
and the focus on selected group of media outlets. To build upon our findings, future
research should extend the analysis to cover the entire duration of the previous administration
led by President Moon Jae-in, as well as to examine defense-related publications throughout
the current government's term. This broader temporal analysis could facilitate a more
nuanced understanding of the shifts in defense disclosure. Furthermore, a detailed
examination of the specific tasks within Defense Innovation 4.0 and their respective
levels of media interest could enrich our understanding of the initiative's impact.
Expanding the scope of future investigations in these directions promises to offer
deeper and more comprehensive insights into the dynamics of defense innovation disclosure.
ACKNOWLEDGMENTS
This work was supported partly by the Institute of Information & Communications Technology
Planning & Evaluation (IITP) grant funded by Korea government (MSIT) [2021-0-01341,
Artificial Intelligent Graduate School Program(Chung-Ang University)], and partly
by the Institute of Information & Communications Technology Planning & Evaluation
(IITP) grant funded by the Korea government(MSIT) (No.2022-0-00601, Military AI Development
and Management Program).
REFERENCES
Republic of Korea Policy Briefing, ``110 major national agenda policies of the Yoon
Suk-yeol administration,'' registered on May 2, 2022.

Ministry of National Defense, ``Accelerating the establishment of `Defense Innovation
4.0' basic plan,'' registered on August 10, 2022.

Ministry of National Defense, ``Defense Innovation 4.0 basic plan press release,''
registered on March 3, 2023.

Korea Press Foundation, BigKinds, searched on November 2, 2023.

S. Wu, J. Liu, and L. Liu, ``Modeling method of internet public information data mining
based on probabilistic topic model,'' The Journal of Supercomputing, vol. 75, pp.
5882-5897, 2019.

A. P. C. Addo, S. K. Akpatsa, M. Dorgbefu Jr, J. C. Dagadu, V. D. Tattrah, N. K. Dzoagbe,
D. D. Fiawoo, and J. Nartey, ``Topic Modeling and Sentiment Analysis of US’ Afghan
Exit Twitter Data: A Text Mining Approach,'' International Journal of Information
and Management Sciences, vol. 34, no. 1, pp. 51-64, 2023.

J. Lee, ``Trend analysis in defense policy studies using topic modeling,'' Journal
of The Korean Institute of Defense Technology, vol. 1, no. 2, pp. 1-4, 2019.

D. M. Blei, A. Y. Ng, and M. I. Jordan, ``Latent Dirichlet allocation,'' Journal of
Machine Learning Research, vol. 3, pp. 993-1022, January 2003.

S. T. Dumais, ``Latent semantic analysis,'' Annual Review of Information Science and
Technology (ARIST), vol. 38, pp. 189-230, 2004.

D. Lee and H. S. Seung, ``Algorithms for non-negative matrix factorization,'' Advances
in Neural Information Processing Systems, vol. 13, pp. 535-541, 2000.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ``BERT: Pre-training of deep bidirectional
transformers for language understanding,'' arXiv preprint arXiv:1810.04805, 2018.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
and I. Polosukhin, ``Attention is all you need,'' Advances in Neural Information Processing
Systems, vol. 30, pp. 5998-6008, 2017.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
and V. Stoyanov, ``RoBERTa: A robustly optimized BERT pretraining approach,'' arXiv
preprint arXiv:1907.11692, 2019.

A. Singhal, ``Modern information retrieval: A brief overview,'' IEEE Data Engineering
Bulletin, vol. 24, no. 4, pp. 35-43, 2001.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ``Distributed representations
of words and phrases and their compositionality,'' Advances in Neural Information
Processing Systems, vol. 26, pp. 3111-3119, 2013.

J. Pennington, R. Socher, and C. D. Manning, ``GloVe: Global vectors for word representation,''
Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 1532-1543, 2014.

N. Reimers and I. Gurevych, ``Sentence-BERT: Sentence embeddings using siamese BERT-networks,''
arXiv preprint arXiv:1908.10084, 2019.

R. Rehurek and P. Sojka, ``Gensim - Python framework for vector space modelling,''
NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, vol.
3, no. 2, 2011.

C. Sievert and K. Shirley, ``LDAvis: A method for visualizing and interpreting topics,''
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces,
pp. 63-70, 2014.

G.-W. Jeon, I. Kang, and J.-h. Jeon, ``Systematic analysis on the trend of defense
technologies using topic modeling: A case of an armoured fighting vehicle,'' Industrial
Innovation Research, vol. 36, no. 1, pp. 69-94, 2020.

M.-J. Kwon, ``Identifying Seoul city issues based on topic modeling of news article,''
Proc. of the Korean Society of Broadcasting and Media Engineering Conference, pp.
11-13, 2019.

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, A. Akkaya, M. Aleman, et al., ``GPT-4 technical
report,'' arXiv preprint arXiv:2303.08774, 2023.

I. Beltagy, M. E. Peters, and A. Cohanr, ``Longformer: The long-document transformer,''
arXiv preprint arXiv:2004.05150, 2020.

Author
Doohong Park received his B.S. degree in computer engineering from Soongsil University
in 2010. In 2023, he was chosen for specialized military AI training by the Ministry
of National Defense and successfully completed the AI Core Talent Development Policy
Course at Chung-Ang University’s National Defense AI School. Also, he received an
M.S. degree in defense information management from the National Defense University,
South Korea, in 2024. Commissioned as an officer in the South Korean Army in 2010,
he has been serving in the army. From 2021 to 2022, he worked in the Army’s AI policy
department, and since 2024, he has been employed at the Defense AI Center of the Agency
for Defense Development under the Ministry of National Defense. His research interests
include classification and prediction through big data analysis, natural language
processing, and, particularly, object detection and recognition in image processing.
Donggoo Kang received his M.S. degree in AI imaging at Chung-Ang University, South
Korea, in 2020. Currently, he is pursuing a Ph.D. degree in AI imaging at Chung-Ang
University. His research interests include computational photography and human-object
interaction discovery.
Joonki Paik completed his B.S. degree in control and instrumentation engineering from
Seoul National University in 1984. He continued his education in the United States,
earning his M.S. and Ph.D. degrees in electrical engineering and computer science
from Northwestern University in 1987 and 1990, respectively. Dr. Paik began his career
at Samsung Electronics from 1990 to 1993, where he played a key role in designing
image stabilization chipsets for consumer camcorders. In 1993, he joined the faculty
at ChungAng University in Seoul, Korea. He is currently a professor with the Graduate
School of Advanced Imaging Science, Multimedia, and Film at the university. From 1999
to 2002, he served as a visiting professor in the Department of Electrical and Computer
Engineering at the University of Tennessee, Knoxville. Since 2005, Dr. Paik has been
the director of a national research laboratory in Korea specializing in image processing
and intelligent systems. He held the position of Dean for the Graduate School of Advanced
Imaging Science, Multimedia, and Film from 2005 to 2007 and concurrently served as
the director of the Seoul Future Contents Convergence Cluster. In 2008, Dr. Paik took
on the role of a full-time technical consultant for the Systems LSI Division of Samsung
Electronics. Here, he developed various computational photographic techniques, including
an extended depth of field system. Dr. Paik has had a notable influence in scientific
and governmental circles in Korea. He is a member of the Presidential Advisory Board
for Scientific/Technical Policy with the Korean Government and serves as a technical
consultant for computational forensics with the Korean Supreme Prosecutor’s Office.
His accolades include being a two-time recipient of the Chester-Sall Award from the
IEEE Consumer Electronics Society. He has also received the Academic Award from the
Institute of Electronic Engineers of Korea and the Best Research Professor Award from
Chung-Ang University. He has actively participated in various professional societies.
He served the Consumer Electronics Society of the IEEE in several capacities, including
as a member of the Editorial Board, Vice President of International Affairs, and Director
of Sister and Related Societies Committee. In 2018, he was appointed as the president
of the Institute of Electronics and Information Engineers. Since 2020, Dr. Paik has
held the position of Vice President of Academic Affairs at Chung-Ang University. In
an exceptional move in 2021, he simultaneously assumed the roles of Vice President
of Research and Dean of the Artificial Intelligence Graduate School at Chung-Ang University
for a oneyear term. Expanding his scope of responsibilities in 2022, Dr. Paik accepted
a five-year appointment as Project Manager for the Military AI Education Program under
Korea’s Department of Defense. With a career spanning over three decades, Dr. Joonki
Paik has made significant contributions to the fields of image processing, intelligent
systems, and higher education.