Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 05, p.644-656

ISSN (online) :

2287-5255

Received : 13 February 2024Revised : 18 June 2024Accepted : 25 June 2024

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.5.644

Regular Paper

Analyzing South Korea’s Defense Challenges with Defense Innovation 4.0 through LDA and BERT Techniques

ParkDoohong^1,² KangDonggoo³ PaikJoonki^3,⁴

(ROK Army, South Korea opdho@naver.com)
(National Defense AI School, Chung-Ang University, Seoul, South Korea)
(Department of Image, Chung-Ang University, Seoul, South Korea dgkang@ipis.cau.ac.kr)
(Department of Artificial Intelligence, Chung-Ang University, Seoul, South Korea paikj@cau.ac.kr)

^* Corresponding Author:Joonki Paik, paikj@cau.ac.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper explores the influence of South Korea’s defense reform initiative, “Defense Innovation 4.0,” by analyzing media coverage before and after its establishment as a national agenda. We collected articles related to defense from a year prior to a year following the announcement. Utilizing the KeyBERT model, we extracted key terms to guide our application of Latent Dirichlet Allocation (LDA) for topic modeling. This approach allowed us to discern the evolving discourse on defense issues, particularly in relation to “Defense Innovation 4.0.” Furthermore, we employed a BERT-based model to assess the content of the “Defense Innovation 4.0” policy against the backdrop of defense discussions pre- and post-announcement. Our findings reveal a notable surge in media attention towards the defense sector, with a significant shift in the thematic focus of private media coverage aligning with the policy’s introduction.

Keywords

Defense issue analysis, Topic modeling, LDA, BERT

1. Introduction

Defense Innovation 4.0 is the current government's defense reform plan. It was announced as part of the 110th National Agenda Tasks on May 3, 2022, by the 20th Presidential Transition Committee ^[1]. It aims to redesign the overall defense posture at a high level, akin to creating a new military branch, and to drive Defense Innovation 4.0 for the development of an AI-based science and technology powerhouse within the defense sector. Furthermore, a key focus is on addressing the shortage of military human resources and minimizing casualties. Since the announcement of national tasks, the Ministry of National Defense has embarked on the comprehensive pursuit of Defense Innovation 4.0, as evidenced in Table 1. From the establishment of the Defense Innovation 4.0 Task Force ^[2] to the unveiling of the basic plan ^[3], the ministry has invested substantial effort. However, despite these endeavors, an analysis regarding the extent to which Defense Innovation 4.0 has become an issue in the general media or among citizens, and the impact it has exerted, has not yet been conducted.

In this paper, we aims to analyze the level of public interest and awareness regarding Defense Innovation 4.0, the current government's defense reform initiative. Additionally, it seeks an objective investigation into whether the private media addresses issues related to Defense Innovation 4.0 within the sphere of national defense. To achieve this, articles concerning defense matters from private media sources were collected using the Korea Press Foundation's BigKinds platform ^[4].

Table 1. Defense Innovation 4.0 progress

Progress Overview	Key Content	Remarks (Lead)
2022. 07. 01 Establishment of Defense Innovation 4.0 Task Force	Conception of the basic concept of "Defense Innovation 4.0"	Deputy Minister of Defense
2022. 07. 14 1st Meeting of Defense Innovation 4.0 Task Force	Issuance of the Ministry of Defense guidelines for "Defense Innovation 4.0"	Minister of Defense
2022. 08. 10 2nd Meeting of Defense Innovation 4.0 Task Force	Review of tasks in the basic plan of "Defense Innovation 4.0"	Minister of Defense
2022. 08. 12 1st Seminar on Defense Innovation 4.0	Gathering opinions of internal and external military experts on the direction of "Defense Innovation 4.0"	Deputy Minister of Defense
2022. 09. 27 2nd Seminar on Defense Innovation 4.0	Presenting major issues and alternatives for "Defense Innovation 4.0"	Minister of Defense
2022. 10. 26 3rd Seminar on Defense Innovation 4.0	Proposal of development strategies for defense R&D systems and augmentation processes, the foundation of "Defense Innovation 4.0"	Chairman of the National Defense Committee & Minister of Defense
2022. 11. 16 4th Seminar on Defense Innovation 4.0	Forming a consensus among key military officials regarding "Defense Innovation 4.0"	Minister of Defense
2023. 03. 03 Announcement of Defense Innovation 4.0 Basic Plan	Presidential endorsement of the "Defense Innovation 4.0 Basic Plan"	Ministry of Defense

For effective analysis, we employ the Latent Dirichlet Allocation (LDA) method to identify the top N significant topics from these articles. The LDA method filters out unnecessary noise and models word associations with specific topics. Through LDA topic modeling, we identify representative topics in three categories: 1) pre-Defense Innovation 4.0 articles, 2) post-Defense Innovation 4.0 articles, and 3) articles specifically related to Defense Innovation 4.0. To assess the impact of Defense Innovation 4.0, we compute sentence similarity among topics using the bidirectional encoder representations from transformers (BERT) model. This approach allows us to analyze the evolution of defense issues from the official announcement date of Defense Innovation 4.0.

Moreover, this research underscores the importance of leveraging advanced machine-learning techniques in policy analysis. By employing methods like LDA and BERT, we can provide a nuanced understanding of how major policy initiatives are reflected in public discourse. This approach not only aids in measuring the effectiveness of communication strategies but also highlights the evolving nature of public opinion and media focus. Such insights are invaluable for policymakers to adapt and respond proactively to public sentiment, ensuring that the objectives of initiatives like Defense Innovation 4.0 are met with informed support and constructive feedback.

Our objectives are threefold:

• Analyze public and media awareness of South Korea's Defense Innovation 4.0 reform plan using LDA topic modeling on media articles before and after its announcement.

• Introduce text mining methods, specifically LDA and BERT-based cosine similarity, to quantify changes in the prominence and significance of defense innovation themes.

• Compare the most prominent topics associated with Defense Innovation 4.0 to those in general defense-related discussions, revealing the policy shift's potential impacts and implications on future discourses and priorities.

2. Related Work

2.1. Topic Modeling and Issue Analysis

In recent years, there has been a growing interest in applying topic modeling techniques to analyze large-scale text data in various domains, including the military domain. This section provides an overview of the existing literature on this topic and highlights some key contributions that have influenced our research.

One of the earliest works on topic modeling for analyzing military documents is by Wu {et al.} ^[5]. Wu et al. argue that the surge in network data presents new opportunities for military intelligence acquisition but stretches traditional intelligence analysis methods. They propose enhancing military intelligence analysis by integrating data mining technology and creating a network military intelligence analysis model relying on data mining.

Additionally, Clement, Addo Prince, et al. ^[6] conducted a timely analysis of 362,566 tweets surrounding the US troop withdrawal from Afghanistan and consequent fall of the Afghan government, employing text mining techniques such as sentiment analysis and word clouds. Findings indicate varied topics and predominantly negative reactions, implying that social media platforms serve as crucial sources for real-time assessment and potentially informing rapid response strategies during crises.

Expounding further, Lee et al. ^[7] maintain that adapting national defense policies to changing national/international security conditions and foreign strategies, coupled with technological innovations in defense operations, is imperative. Applying topic modeling and LDA analysis, they distill fourteen themes in defense policy research, verifying connections via IDM analysis; outcomes aspire to bolster understanding of defense technology spheres and guide institutional policymaking.

2.2. Latent Dirichlet Allocation

Topic modeling is a method used to discern the structure of topics within textual data, identifying the topics each document covers within a document set. Various techniques have been developed for topic analysis, including prominent examples such as `latent Dirichlet allocation' (LDA) ^[8], `latent semantic analysis' (LSA) ^[9], `non-negative matrix factorization' (NMF) ^[10].

LDA, a prominent algorithm in topic modeling, is widely utilized across various domains. It assumes documents to be mixtures of specific topics and models the process in which topics generate words. LDA represents the probability of generating a given document set with a stochastic modeling approach, tracing back the assumed process of document generation. Utilizing the Dirichlet distribution, it defines document-topic distribution and topic-word distribution, modeling the distribution of topics within each document and the words generated by each topic. Key parameters of the LDA model include document-topic distribution, topic-word distribution, and the number of topics. Researchers employing LDA for topic modeling determine an appropriate number of topics, enabling the extraction of topics from given documents and prediction of words generated by those topics.

LDA makes three primary assumptions. Firstly, it assumes that `documents can encompass various topics, and the extent to which a document holds a particular topic can be represented as a probability vector.' This is mathematically expressed as $P(t\mid d)$, where `$t$' represents a topic and `$d$' signifies a document.

The second assumption of LDA is that 'topics can be represented by the ratio of words used within that topic.' This can be expressed in the topic-specific word generation probability distribution denoted as $P(w \mid t)$, where `$w$' represents a word and `$t$' signifies a topic. The third assumption states that 'the likelihood of specific words being used in a document can be expressed as the product of the probability distributions from the first and second assumptions.' Given that LDA topic modeling learns topic vectors represented by words, it's essential to consider the high dimensionality associated with the number of words when analyzing topics.

Fig. 1. Process of LDA.

In the process of LDA model depicted in Fig. 1, $\alpha$, $\beta$, and $K$ represent the initial parameter settings. `$D$' signifies the set of documents, `$K$' denotes the set of topics (quantity), and `$N$' stands for the number of words within a document. $\theta_d$ represents the Dirichlet distribution of topics in document `$d$,' while $\varphi_k$ signifies the word distribution per topic. $Z_{d,n}$ refers to the topic at the nth word position in document `$d$,' and $W_{d,n}$ represents a specific word at the nth positional notation in document `$d$'.

2.3. BERT

Bidirectional Encoder Representations from Transformers (BERT) has become one of the most popular pretrained language representation models in natural language processing ^[11]. BERT leverages the Transformer architecture ^[12] and is pretrained on large unlabeled corpora using two objectives: masked language modeling and next sentence prediction. This pretraining scheme allows BERT to learn bidirectional representations that incorporate context from both directions.

After pretraining, BERT can be fine-tuned on downstream NLP tasks by adding task-specific output layers. This transfer learning approach allows BERT to achieve state-of-the-art results on a wide range of tasks including question answering, sentiment analysis, natural language inference, and named entity recognition ^[13]. Key advantages of BERT include its bidirectional training, larger model size, and fine-tuning based transfer learning.

2.4. Similarity Analysis

measure and compare similarities between textual data. It aids in understanding the relevance between sentences, words, or specific words within sentences, and analyzing how similar two documents are in terms of the topics they cover. Similarity analysis involves quantifying and comparing textual data using various methods and techniques.

Cosine similarity is a widely used technique for estimating the semantic similarity between two texts ^[14]. It measures the cosine of the angle between two vectors representing the texts in a multi-dimensional space. Cosine similarity is often applied on tf-idf weighted bag-of-words vectors. With the emergence of dense word embeddings like word2vec ^[15] and GloVe ^[16], cosine similarity between the mean embedding vectors of two texts has become a popular semantic similarity technique.

More recently, contextual word embeddings from pretrained language models like BERT ^[11] have shown impressive gains on similarity tasks ^[17]. BERT-based sentence embeddings can effectively capture semantic closeness via cosine similarity. Our approach employs cosine similarity between BERT embeddings to assess similarity. We find BERT captures nuanced contextual signals that are missed by traditional methods like tf-idf. Fine-tuning BERT for our similarity objective yields further improvements in performance.

In the similarity measurement step, as depicted in Fig. 2, the primary approach involves utilizing cosine similarity to compare the embeddings of two sentences, quantifying the semantic similarity between sentences.

Fig. 2. BERT-based similarity analysis.

3. Proposed Method

3.1. Dataset Configuration

The dataset of defense-related news articles was collected from the Bigkinds database by the Korea Press Foundation spanning from May 1, 2021, to May 31, 2023 (spanning two years, with one year before and after the announcement of national tasks on May 3, 2022). A total of 17,794 articles containing 'national defense,' 'army,' 'navy,' 'air force,' and 'marine' in their titles or main texts were gathered from 11 national daily newspapers and 5 broadcasting companies. Among these, 1,001 articles were excluded from analysis due to duplicates or containing insignificant content such as personal notices, condolences, condolences, photo reports, etc. Therefore, 16,793 articles were analyzed, and a summary of the dataset composition is presented in the following Table 2.

Table 2. News article dataset composition.

Category	Defense-related news article dataset
Collection Period	From May 1, '21 to May 31, '23
Media Outlets	11 National Daily Newspapers Kyunghyang Shinmun (1,274), Kukmin Ilbo (1,124), Naeil Shinmun(190), Donga Ilbo(907) Munhwa Ilbo (737), Seoul Shinmun (1,612) Segye Ilbo(2,158), Chosun Ilbo(1,414) JoongAng Ilbo(1,694), Hankyoreh(939) Hankook Ilbo (1,082) 5 Broadcasting Stations KBS(971), MBC(455), OBS(287) SBS(401), YTN(2,549)
Analysis Target	Analyzed Articles (16,793) Excluded from Analysis (1,001)

Table 3. Number of collected articles by keyword.

Date	National defense	Army	Navy	Air Force	Marine
2021.05	88	139	111	131	21
2021.06	254	300	192	1924	59
2021.07	118	188	143	572	43
2021.08	150	265	526	566	89
2021.09	117	138	159	185	29
2021.10	150	187	236	230	68
2021.11	70	180	86	174	41
2021.12	82	106	77	171	27
2022.01	124	106	124	242	31
2022.02	109	113	84	103	18
2022.03	259	157	234	239	93
2022.04	138	97	265	212	69
2022.05	125	139	119	161	37
2022.06	145	98	193	215	50
2022.07	83	121	261	150	64
2022.08	101	136	157	437	40
2022.09	88	128	131	216	279
2022.10	285	158	258	364	44
2022.11	119	117	131	295	37
2022.12	153	151	78	287	24
2023.01	139	142	85	171	32
2023.02	130	131	193	196	40
2023.03	152	111	179	173	60
2023.04	113	153	154	190	56
2023.05	116	160	124	172	57

In June 2021, there was a notably high number of news articles related to the keyword 'air force.' This observation suggests a concentrated focus on certain events during that period. No deliberate efforts were made to artificially balance the analysis for a substantive examination of defense-related issues. However, articles containing duplicate content, personal notices, condolences, condolences, or photo reports, which were unlikely to yield meaningful results, were excluded from the analysis.

3.2. LDA based Topic Modeling

For topic modeling, we employed the Latent Dirichlet Allocation (LDA) method to extract the core keywords from the collected articles. To collect news articles related to national defense, we utilized the BigKinds database provided by the Korea Press Foundation ^[4].

To begin, we extracted keywords from the articles as a means of eliminating noisy information. The articles were obtained via a web-crawling algorithm and thus may contain redundant words at times. To do this, we employed the KPF-KeyBERT model, which has been pre-trained on articles from the Korea Press Foundation and exhibits high performance in Korean language keyword extraction as illustrated in Table 4 and Fig. 4.

Table 4. Comparison evaluation results of KPF-BERT and existing BER.T

Category	KLUE-NLI Natural Language Inference	KorQuAD Machine Reading Comprehension
Evaluation method	Accuracy	Accuracy regression rate
BERT base	73.30%	90.02%
KoBERT (SKT)	79.53%	71.36%
KoBERT (ETRI)	80.56%	82.00%
KPF BERT	87.67%	94.95%

Fig. 3. Number of monthly collected articles by keyword.

Fig. 4. Keyword extraction using KeyBERT.

Fig. 5. Process of BERT-based similarity analysis.

The article is encoded into a fixed-length vector representation capturing semantic meaning using a pretrained BERT model. Subsequently, keywords and n-grams are extracted from the document. A bag-of-words (BOW) approach represents the text based only on word frequencies ignoring word order. N-grams divide the text into continuous word or character sequences of length N during document and sentence analysis. Next, cosine similarity between vectors identifies the most similar keywords. Finally, the words and phrases most representative of the full document are extracted.

Given keywords, we performed LDA topic modeling on keywords extracted from articles published before and after the announcement of national defense innovation 4.0. We set 10 topics as the optimal number to avoid excessive topic overlap. The inferred topics were visualized using the pyLDAvis ^[19] and genism library ^[18] to visualize inter-topic distances and representative keywords in a coherent manner.

To optimize LDA model performance, hyperparameters including the number of iterations and topics were tuned. Appropriately determining the topic quantity is critical. Comparing models with 10 and 11 topics revealed substantially higher topic overlap and closer topic proximity with 11 topics. This made interpreting distinct characteristics within each topic difficult. Hence, 10 topics was deemed optimal. Fig. 6 shows visualization results on one year of news articles before (top) and after (bottom) presenting national tasks, demonstrating the impact of selecting 10 versus 11 topics.

Fig. 6. Visualization results according to the change in the number of topics ($\pmb{10 \rightarrow 11}$) before (top) and after (bottom) the announcement of national affairs.

3.3. BERT-based Similarity Analysis

The extracted topics are utilized for similarity analysis to examine the effects of Defense Innovation 4.0 on the news article topics. We employed the RoBERTa ^[13] model to conduct similarity analysis. RoBERTa is a model based on BERT but enhanced with a larger dataset, model size, and longer training duration. The process of conducting similarity analysis experiments is illustrated in Fig. 6.

Given a query $q$, we use the fine-tuned RoBERTa model to encode the query and a set of candidate documents ${D}$. The encoding process can be represented as

(1)

$ E_q=\mathcal{M}(q),~E_d=\mathcal{M}(d), $

where $E_q$ and $E_d$ are the embeddings of the query and candidate documents, respectively. We compute the cosine similarity between the query embedding and each candidate document embedding to obtain a ranking score

(2)

$ s(q,d)=\frac{E_q\cdot E_d}{|E_q|\cdot |E_d|}, $

where $|E_q|$ and $|E_d|$ are the magnitudes of the query and candidate document embeddings, respectively.

For each topic, we employed a pre-trained RoBERTa model to conduct embedding, obtaining representative sets of 10 topics per topic. Additionally, we compute embedding for 3 topics extracted from the Defense Innovation 4.0. Subsequently, we computed cosine similarity using by calculating vector inner products between the vectorized Defense Innovation 4.0 topics and the 10 defense issue topics extracted before and after the presentation of national tasks. These calculations were quantified and visualized as heatmaps for ease of analysis.

4. Experimental Results

4.1. Topic Modeling Results

4.1.1 Before Defense Innovation 4.0

We conduct LDA on defense-related news articles before the presentation of national tasks (May 1, 2021, to May 2, 2022, comprising 8,895 articles). Table 5 illustrates the top 10 words best representing each topic. The comprehensive visualization of the LDA topic modeling results is depicted in Fig. 7.

Fig. 7. Visualization of defense news topic modeling before national task announcement.

Table 5. Topic modeling results before Defense Innovation 4.0.

Topic 1	Topic 2	Topic 3	Topic 4	Topic5
Incident, Air Force, Sergeant, Investigation, Sexual Harassment, Victim, Harm, Ministry of National Defense, Perpetrator, Inquiry	United States, China, Missile, North Korea, South Korea, Rocket, Government, Space, Palestine, Japan	Vaccine, COVID, Vaccination, People, Start, Vehicle, Situation, Thought, Meals, Level	Israel, United States, Hamas, Attack, Gaza Strip, President, United Kingdom, US Military, War, Airstrike	President, Ministry of National Defense, Minister, Incident, Chief, That day, Citizens, Representative, National Assembly, Seo Wook
Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
Thoughts, Citizens, Anchor, Representative, President, Members of Parliament, Democratic Party, Politics, Words, Talk	Women, Military, Men, Society, Education, Sergeant, Illegal, Service, Female Soldiers, Myanmar	Air Force, Accident, Navy, Crash, Fighter Jet, Debris, Occurrence, Helicopter, Flight, Maritime	Army, Unit, Soldier, Officer, Enlisted, Navy, Military Headquarters, Soldiers, Lieutenant Colonel, Mobile Phone	Coast Guard, Crew, Fishing Boat, Missing, Captain, Search, Ulleungdo Island, Fishing Industry, Fishing Operation, Maritime

Upon analyzing the visualization results, the topics can be broadly categorized into five groups based on their distribution. The 1st group, represented by Topic 1 in the first quadrant, is positioned distinctly apart from other topics, indicating lower relevance compared to the rest. Additionally, with the largest circle size in the visualization at 18.4%, it can be inferred that this topic holds the highest contribution among defense-related issue articles before the announcement of national tasks. The 2nd group, situated in the fourth quadrant, includes Topics 2, 3, 4, 6, and 7. These topics are closely adjacent, indicating a higher level of interrelatedness among them.

Each topic's contribution reveals Topic 2 at 17.8%, Topic 3 at 12.8%, Topic 4 at 11.2%, Topic 6 at 10%, and Topic 7 at 10%. The 3rd group encompasses Topic 5, positioned across the first and fourth quadrants. It stands somewhat distant from other topics without significant overlaps, indicating its relatively lower relevance and independent nature, with a contribution of 10.5%. The 4th group involves Topics 8 and 9, spanning the second and third quadrants. These topics, adjacent to each other, demonstrate higher relevance but are notably distinct from the remaining groups. Topic 8 contributes 5.5%, while Topic 9 contributes 5.4%. Lastly, the 5th group comprises Topic 10 located in the third quadrant, considerably distant from other groups, indicating its independent nature. With a contribution of 0.4%, it shows minimal relevance to defense issues before the announcement of national tasks.

4.1.2 After Defense Innovation 4.0

The results of topic modeling for defense-related news articles after the announcement of national tasks (from May 3, 2022, to May 31, 2023, comprising 7,898 articles) are presented in Table 6. The visualization of these results is depicted in Fig. 8.

Fig. 8. Visualization of defense news topic modeling after announcement of national affairs.

Table 6. Topic modeling results after Defense Innovation 4.0.

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
United States, China, South Korea, Taiwan, Japan, world, security, military, government, US military	Government, Seoul, nation, society, center, candidates, women, COVID-19, support, target	President, citizens, government, lawmakers, conversation, anchor, thoughts, remarks, Yoon Suk-yeol, representative	Ukraine, Russia, Russian army, war, attack, region, Ukrainian army, president, missile, Putin	person, murder, name, South Korea, oneself, think, photo, navy, family, country
Topic 6	Topic 7	Topic 8	Topic 9	Topic 10
North Korea, missile, launch, exercise, provocation, ROK-US, ballistic, response, that day, navy	Incident, investigation, sergeant, charges, Air Force, victim, Ministry of National Defense, inquiry, death, prosecution	That day, area, damage, navy, residents, rescue, nearby, vehicle, sea, deployment.	Special investigation, police, accident, suspicion, discovery, confirmation, investigation, occurrence, Marine Corps, army.	Air wing, junior grade, Osan, transportation, human rights committee, memo, labor organization, public relations, illegal activities, informant.

The visualization analysis reveals a division of topics into 6 main groups based on their distribution. The 1st group, represented by Topic 1 located in the 4th quadrant, stands distinctly independent without overlapping with other groups, indicating relatively lower relevance with the rest of the topics. With a contribution of 18.2%, Topic 1 can be considered as the primary representative of defense issues post the national task announcement. Group 2 comprises Topics 2, 3, and 5 positioned in the 1st and 4th quadrants, closely adjacent to each other and distanced from other groups. Topic 2 contributes 15.9%, Topic 3 contributes 15%, and Topic 5 contributes 10.3%. Group 3 consists of Topic 4 positioned in the 4th quadrant, presenting an independent topic distant from others with a contribution of 12.4%. Group 4 involves Topics 6 and 8 located in the 3rd quadrant, indicating a notably high association between these adjacent topics. Topic 6 contributes 9.4%, while Topic 8 contributes 6.2%. Group 5, represented by Topic 7 in the 2nd quadrant, stands apart from other topics, indicating an independent aspect and contributing 7.3%. Lastly, Group 6 encompasses Topics 9 and 10 in the 2nd and 3rd quadrants, respectively. These topics show considerable distance from other groups. While Topic 9 contributes 5.1%, Topic 10 displays a low contribution of 0.2%, indicating relatively low relevance to defense issues after the national task announcement.

Comparing the visualization analysis before and after the national task announcement, focusing on the highest-contributing Topic 1, there's a notable shift. Previously, Topic 1 was positioned in the 1st quadrant, maintaining a considerable distance from other topics, indicating its independent nature. However, post the national task announcement, Topic 1 now shares the 4th quadrant with Topics 2 and 4, and notably closer proximity to the other four topics (Topics 2, 3, 4, and 5).

4.1.3 Topic modeling results on Defense Innovation 4.0

The topic modeling results of the Defense Innovation 4.0 Basic Plan press release are shown in Table 7, and the corresponding visualization is depicted in Fig. 9.

Table 7. Defense Innovation 4.0 Basic Plan press release topic modeling results.

Topic 1	Topic 2	Topic 3
Technology, System, Innovation, Field, Future, System, Industry, Weapon, Battle, Missile.	Military, AI, Development, Structure, Ministry of Defense, Organization, Operation, Advanced, Capability, Training.	Defense, Science, Base, Power, Stage, Field, Center, Efficiency, Data, Concept.

Fig. 9. Similarity analysis process.

According to the visualization analysis, Topic 1, Topic 2, and Topic 3 are positioned far apart from each other, indicating their low interrelatedness and independence as topics. Moreover, in terms of contribution, Topic 1 holds 42.5%, Topic 2 comprises 30.7%, and Topic 3 accounts for 26.8%. All three topics are considered representative of the Defense Innovation 4.0.

4.2. Similarity Analysis Results

Using the RoBERTa model, we conducted a cosine similarity analysis between topics, visualizing the results with heatmaps to facilitate comprehension. In Fig. 8, the vertical axis represents the topics derived from the Defense Innovation 4.0 Basic Plan press releases (labeled as Topics 1, 2, and 3), while the horizontal axis presents 10 comparative topics. These comparative topics were generated from sentences containing 10 representative keywords each. These sentences were embedded using the RoBERTa model to transform them into multi-dimensional vectors. We then calculated the cosine similarity by computing the inner product of these vectorized sentences, and visualized the scores with heatmaps in Figs. 9 and 10.

Fig. 10. Similarity matrix between topics before Defense Innovation 4.0 announcement.

Fig. 11. Similarity matrix between topics after Defense Innovation 4.0 announcement.

Fig. 10 depicts the cosine similarity comparison between Defense Innovation 4.0 topics and those from the pre-announcement period, showing that only two topics have a similarity score of 0.4 or above. Conversely, Fig. 11 illustrates the post-policy announcement similarity analysis, where seven topics exhibit a significant similarity of 0.4 or higher. This shift suggests an increase in defense-related topics aligning more closely with the Defense Innovation 4.0 initiative after the policy announcement, indicating a greater emphasis on these areas in subsequent discussions.

Furthermore, focusing on the most influential topic, Topic 1, in both pre and post-policy announcement periods reveals insightful trends. In Fig. 10, the first column shows the pre-announcement similarities for Defense Innovation 4.0 Topic 1 as 0.19, Topic 2 as 0.34, and Topic 3 as 0.26, reflecting lower similarities. However, post-announcement data in Fig. 11's first column indicates higher similarities: 0.36 for Topic 1, 0.47 for Topic 2, and 0.34 for Topic 3. This significant increase in similarity scores post-policy announcement suggests an amplified focus on Defense Innovation 4.0 themes, particularly in the most influential topic, Topic 1.

These findings imply that the Defense Innovation 4.0 policy announcement has heightened the focus on defense innovation and technology within media coverage of defense issues. Private media outlets appear to increasingly align their discussions with the government's defense reform agenda, underscoring the growing recognition of defense innovation and technology as critical components in addressing modern defense challenges. This alignment highlights the media's role in emphasizing the importance of the Defense Innovation 4.0 initiative and its relevance to contemporary defense discourse.

4.3. Performance Comparison

To further validate our approach, we conducted a performance comparison between our RoBERTa-based method and open-sourced language models, such as GPT-2 ^[22] and Longformer ^[23]. Figs. 12 and 13 show similarity matrix before and after announcement. Both GPT-2 and LongFormer exhibit high similarity scores across all instances, suggesting a reduced sensitivity to contextual variations. This indicates that RoBERTa may be better suited for tasks requiring nuanced understanding of textual changes over time.

GPT-2 and LongFormer struggle with similarity detection due to their designs. GPT-2, focused on next-token prediction, doesn't capture full-text relationships well, resulting in high, uniform similarity scores that miss contextual changes. LongFormer, optimized for long documents with mixed attention mechanisms, also shows consistently high similarity scores, indicating it can't pinpoint subtle shifts in shorter texts.

Fig. 12. Similarity matrix before(left) and after(right) using GPT-2.

Fig. 13. Similarity matrix before(left) and after(right) using LongFormer.

5. Conclusion and Future Work

In this paper, we analyzed the government's current defense reform plan Defense Innovation 4.0 through advanced language modeling techniques. We perform this research by gathering and analyzing defense-related news articles published before and after the official announcement of this initiative. Employing BERT-based models for keyword extraction and LDA for topic modeling, we were able to identify and examine the defense topics that attracted attention during these periods. Additionally, we utilized RoBERTa to perform a similarity analysis on the main topics extracted from the Defense Innovation 4.0 announcement document, which enabled us to trace shifts in focus on defense-related issues before and after the policy was unveiled. Our findings indicate a notable alignment between core topics of Defense Innovation 4.0 post-announcement, especially when analyzing the most significant topic, Topic 1, from our LDA results.

However, this work cannot avoid limitations, chiefly the temporal scope of our data collection, which was confined to one year before and after the policy's announcement, and the focus on selected group of media outlets. To build upon our findings, future research should extend the analysis to cover the entire duration of the previous administration led by President Moon Jae-in, as well as to examine defense-related publications throughout the current government's term. This broader temporal analysis could facilitate a more nuanced understanding of the shifts in defense disclosure. Furthermore, a detailed examination of the specific tasks within Defense Innovation 4.0 and their respective levels of media interest could enrich our understanding of the initiative's impact. Expanding the scope of future investigations in these directions promises to offer deeper and more comprehensive insights into the dynamics of defense innovation disclosure.

ACKNOWLEDGMENTS

This work was supported partly by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by Korea government (MSIT) [2021-0-01341, Artificial Intelligent Graduate School Program(Chung-Ang University)], and partly by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2022-0-00601, Military AI Development and Management Program).

REFERENCES

Republic of Korea Policy Briefing, ``110 major national agenda policies of the Yoon Suk-yeol administration,'' registered on May 2, 2022.

Ministry of National Defense, ``Accelerating the establishment of `Defense Innovation 4.0' basic plan,'' registered on August 10, 2022.

Ministry of National Defense, ``Defense Innovation 4.0 basic plan press release,'' registered on March 3, 2023.

Korea Press Foundation, BigKinds, searched on November 2, 2023.

S. Wu, J. Liu, and L. Liu, ``Modeling method of internet public information data mining based on probabilistic topic model,'' The Journal of Supercomputing, vol. 75, pp. 5882-5897, 2019.

A. P. C. Addo, S. K. Akpatsa, M. Dorgbefu Jr, J. C. Dagadu, V. D. Tattrah, N. K. Dzoagbe, D. D. Fiawoo, and J. Nartey, ``Topic Modeling and Sentiment Analysis of US’ Afghan Exit Twitter Data: A Text Mining Approach,'' International Journal of Information and Management Sciences, vol. 34, no. 1, pp. 51-64, 2023.

J. Lee, ``Trend analysis in defense policy studies using topic modeling,'' Journal of The Korean Institute of Defense Technology, vol. 1, no. 2, pp. 1-4, 2019.

D. M. Blei, A. Y. Ng, and M. I. Jordan, ``Latent Dirichlet allocation,'' Journal of Machine Learning Research, vol. 3, pp. 993-1022, January 2003.

S. T. Dumais, ``Latent semantic analysis,'' Annual Review of Information Science and Technology (ARIST), vol. 38, pp. 189-230, 2004.

D. Lee and H. S. Seung, ``Algorithms for non-negative matrix factorization,'' Advances in Neural Information Processing Systems, vol. 13, pp. 535-541, 2000.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ``BERT: Pre-training of deep bidirectional transformers for language understanding,'' arXiv preprint arXiv:1810.04805, 2018.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ``Attention is all you need,'' Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, ``RoBERTa: A robustly optimized BERT pretraining approach,'' arXiv preprint arXiv:1907.11692, 2019.

A. Singhal, ``Modern information retrieval: A brief overview,'' IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 35-43, 2001.

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, ``Distributed representations of words and phrases and their compositionality,'' Advances in Neural Information Processing Systems, vol. 26, pp. 3111-3119, 2013.

J. Pennington, R. Socher, and C. D. Manning, ``GloVe: Global vectors for word representation,'' Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, 2014.

N. Reimers and I. Gurevych, ``Sentence-BERT: Sentence embeddings using siamese BERT-networks,'' arXiv preprint arXiv:1908.10084, 2019.

R. Rehurek and P. Sojka, ``Gensim - Python framework for vector space modelling,'' NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, vol. 3, no. 2, 2011.

C. Sievert and K. Shirley, ``LDAvis: A method for visualizing and interpreting topics,'' Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63-70, 2014.

G.-W. Jeon, I. Kang, and J.-h. Jeon, ``Systematic analysis on the trend of defense technologies using topic modeling: A case of an armoured fighting vehicle,'' Industrial Innovation Research, vol. 36, no. 1, pp. 69-94, 2020.

M.-J. Kwon, ``Identifying Seoul city issues based on topic modeling of news article,'' Proc. of the Korean Society of Broadcasting and Media Engineering Conference, pp. 11-13, 2019.

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, A. Akkaya, M. Aleman, et al., ``GPT-4 technical report,'' arXiv preprint arXiv:2303.08774, 2023.

I. Beltagy, M. E. Peters, and A. Cohanr, ``Longformer: The long-document transformer,'' arXiv preprint arXiv:2004.05150, 2020.

Author

Doohong Park

Doohong Park received his B.S. degree in computer engineering from Soongsil University in 2010. In 2023, he was chosen for specialized military AI training by the Ministry of National Defense and successfully completed the AI Core Talent Development Policy Course at Chung-Ang University’s National Defense AI School. Also, he received an M.S. degree in defense information management from the National Defense University, South Korea, in 2024. Commissioned as an officer in the South Korean Army in 2010, he has been serving in the army. From 2021 to 2022, he worked in the Army’s AI policy department, and since 2024, he has been employed at the Defense AI Center of the Agency for Defense Development under the Ministry of National Defense. His research interests include classification and prediction through big data analysis, natural language processing, and, particularly, object detection and recognition in image processing.

Donggoo Kang

Donggoo Kang received his M.S. degree in AI imaging at Chung-Ang University, South Korea, in 2020. Currently, he is pursuing a Ph.D. degree in AI imaging at Chung-Ang University. His research interests include computational photography and human-object interaction discovery.

Joonki Paik

Joonki Paik completed his B.S. degree in control and instrumentation engineering from Seoul National University in 1984. He continued his education in the United States, earning his M.S. and Ph.D. degrees in electrical engineering and computer science from Northwestern University in 1987 and 1990, respectively. Dr. Paik began his career at Samsung Electronics from 1990 to 1993, where he played a key role in designing image stabilization chipsets for consumer camcorders. In 1993, he joined the faculty at ChungAng University in Seoul, Korea. He is currently a professor with the Graduate School of Advanced Imaging Science, Multimedia, and Film at the university. From 1999 to 2002, he served as a visiting professor in the Department of Electrical and Computer Engineering at the University of Tennessee, Knoxville. Since 2005, Dr. Paik has been the director of a national research laboratory in Korea specializing in image processing and intelligent systems. He held the position of Dean for the Graduate School of Advanced Imaging Science, Multimedia, and Film from 2005 to 2007 and concurrently served as the director of the Seoul Future Contents Convergence Cluster. In 2008, Dr. Paik took on the role of a full-time technical consultant for the Systems LSI Division of Samsung Electronics. Here, he developed various computational photographic techniques, including an extended depth of field system. Dr. Paik has had a notable influence in scientific and governmental circles in Korea. He is a member of the Presidential Advisory Board for Scientific/Technical Policy with the Korean Government and serves as a technical consultant for computational forensics with the Korean Supreme Prosecutor’s Office. His accolades include being a two-time recipient of the Chester-Sall Award from the IEEE Consumer Electronics Society. He has also received the Academic Award from the Institute of Electronic Engineers of Korea and the Best Research Professor Award from Chung-Ang University. He has actively participated in various professional societies. He served the Consumer Electronics Society of the IEEE in several capacities, including as a member of the Editorial Board, Vice President of International Affairs, and Director of Sister and Related Societies Committee. In 2018, he was appointed as the president of the Institute of Electronics and Information Engineers. Since 2020, Dr. Paik has held the position of Vice President of Academic Affairs at Chung-Ang University. In an exceptional move in 2021, he simultaneously assumed the roles of Vice President of Research and Dean of the Artificial Intelligence Graduate School at Chung-Ang University for a oneyear term. Expanding his scope of responsibilities in 2022, Dr. Paik accepted a five-year appointment as Project Manager for the Military AI Education Program under Korea’s Department of Defense. With a career spanning over three decades, Dr. Joonki Paik has made significant contributions to the fields of image processing, intelligent systems, and higher education.