ChoiYoola
KwonYoung-Woo
-
(School of Computer Science and Engineering, Kyungpook National University / Daegu,
Korea
{yura0812, ywkwon}@knu.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Vulnerability detection, Graph neural networks, Static program analysis
1. Introduction
A vulnerability is a weakness that security attackers can target and may cause serious
problems. As the number of software vulnerabilities has increased significantly, many
vulnerability detection techniques have been developed. A static analysis approach
is one of the frequently used approaches to discover vulnerabilities before software
release. The existing approaches [1-3] require the labor of human experts to define patterns or strategies of each vulnerability.
On the other hand, it is difficult to define all the patterns and impossible to detect
new vulnerability patterns. According to the National Vulnerability Database (NVD)
[4], only in 2021, there have been 11,874 Common Vulnerabilities and Exposures (CVE).
Every month more than 1,400 new vulnerabilities have been reported (Fig. 1). Because of the limitations in pattern-based vulnerability detection approaches,
recent studies have focused on utilizing machine learning or deep learning techniques
to reduce manual efforts and deal with more vulnerabilities. In particular, some approaches
using various program representations have shown better results than existing static
analysis tools. Program representations can be obtained from the analysis results,
and they include details of the program, e.g., control flow, data flow, and program
dependency.
Both syntactic information and semantic information are needed for effective vulnerability
discovery. Syntactic information is represented by the Abstract Syntax Tree (AST),
and semantic information is represented by many graphs, such as Control Flow Graph
(CFG), Data Flow Graph (DFG), and Program Dependency Graph (PDG). Several representations,
such as Code Property Graph (CPG), need to be combined to utilize both information
[5]. This study compared the program representations with syntax only, semantic only,
and both syntax and semantic information. Therefore, program representations were
extracted, such as AST, CFG, PDG, CPG from Clang Static Analyzer, Joern, and SVF.
A graph is non-Euclidean data so that it should be treated differently from Euclidean
data, such as images and text. Existing many models were designed for Euclidean data.
Therefore, Graph Neural Networks (GNN) [6] should be used to prevent the loss of the important information of graphs (e.g.,
topological dependency information on the node) when running models. In this paper,
we used GraphSAGE [7], which is a graph-convolutional network model applying inductive learning to gain
a better prediction of unseen data.
This paper suggests which program representation and static analysis tool for graph
extraction are suitable for vulnerability detection with GNN. The effectiveness of
each program representation when running with GNN to discover vulnerabilities in source
code was compared using experiments. As a result, PDG from Joern shows the best performance.
Every metric indicates over 99%.
$\textbf{·}$ We compare the effectiveness of program represen-tation and static analysis
tool used for graph extraction;
$\textbf{·}$ We describe how much the program representation is important to detect
vulnerabilities when running Graph Neural Networks;
$\textbf{·}$ The results can be used by those who want to conduct vulnerability detection
with deep learning.
The remainder of this article is organized as follows. The next section describes
the related works. The experiments section represents the dataset, graph extraction,
preprocessing, which are important parts to run models, and graph neural network model
used for experiments. The performance evaluation section includes the performance
evaluations and a discussion of the results. Finally, this paper is concluded in the
last section.
Fig. 1. Number of CVEs in 2021 (From January to July).
2. Related Work
Detecting vulnerabilities with graph neural networks show better results than traditional
static analysis approaches. The proposal [8] extracts AST, CFG, DFG, and NCS (Natural Code Sequence) with Joern and concatenates
these into a single graph. After extraction, it applies the gated graph sequence neural
network (GGNN) model to detect vulnerabilities. Another proposal [9] build the PDG to reflect both control- and data-dependence from SVF and classify
vulnerable codes with a deep graph neural network model.
Second, the proposal does not provide a comparison between representations. It just
contains comparisons of the granularity and three GNN models. The first one compares
the performance between representations (AST, CFG, DFG, NCS, and composite). On the
other hand, it uses only Joern. Thus, it cannot show differences from static analysis
tools. Therefore, we compare the performance of both representations and the static
analysis tools used to extract the representations.
3. Experiments
Experiments were conducted to analyze the performance of the program representations
and static analysis tools. Fig. 2 provides an overview of the experiments. First, representations were extracted using
three static analysis tools from the dataset composed of vulnerable source codes.
Second, the representations were preprocessed to fit the GNN. Preprocessing contains
the vectorization of the node features and labeling class to each graph. Last, each
preprocessed representation dataset was used in the GraphSAGE model [7], and the results for vulnerability detection were obtained.
Fig. 2. Overview of the classification methodology.
3.1 Dataset
Vulnerable and non-vulnerable codes are needed for the training model to classify
source codes into vulnerable or not. Therefore, we used Juliet Test Suite v1.3 for
C/C++ [10] as dataset. The Juliet Test Suite is a set of test programs with vulnerabilities.
It provides both vulnerable code and non-vulnerable code. Thus, both can be obtained
in a single dataset. Each test program has a single function involving vulnerability
(bad function) and has more than one function similar in behavior to bad function
but not vulnerable (good function). Therefore, the dataset was divided into two parts,
bad functions and good functions, for labeling easily in the preprocessing step. Moreover,
it provides test cases for C/C++ and Java, but only C/C++ cases were used in the experiments.
3.2 Graph Extraction
The static analysis tool is used to extract the program representations. First, it
analyzes the source code and generates an abstract syntax tree. Other representations
(e.g., control-flow graph) are then constructed from the tree. Using the static analysis
tools, seven graphs of six types were extracted (AST, CFG, PDG, CPG, ICFG, and VFG).
Each type includes syntax and semantic information. The following are brief descriptions
of the graphs.
$\textbf{Abstract Syntax Tree (AST)}$ AST shows the syntactic structure of the program
with a tree structure. It leaves necessary parts by excluding the unnecessary parts,
such as white space, semicolons, and annotations. The operator is used as the root
node, and the operands are used as the leaf node. The shape of AST depends on the
compiler.
$\textbf{Control-Flow Graph (CFG)}$ The CFG represents all paths that can be traversed
during program execution. It describes the inner flow of the procedure in detail.
The graph consists of four components (basic block, directed edge, entry block, and
exit block). One basic block can contain more than one line. The entry and exit blocks
indicate the start and endpoints of the program.
$\textbf{Program Dependence Graph (PDG)}$ The PDG specifies both the control-dependence
and data dependence. The control-dependence denotes which parts of a program can control
the other parts [11], and the data dependence denotes which parts of the data refer to the same memory
locations with other parts. The PDG can obtain a summary of dependence in a program.
$\textbf{Code Property Graph (CPG)}$ CPG is a combination of three existing representations.
Its nodes are based on the AST, and its edges have four types: AST, CFG, CDG, DDG.
The syntax and semantic information are contained in one graph. Hence, it has recently
been used for vulnerability detection research. Only Joern supports CPG. Generating
this with other tools might require additional manual efforts.
$\textbf{Inter-procedural Control-Flow Graph (ICFG)}$ While CFG shows the control-flow
only in one procedure, the ICFG presents the entire control-flow of a program with
multiple procedures. It ties up multiple CFG into a single graph. Simply, each of
its nodes is the CFG, and its edges are calling the relationships between procedures.
$\textbf{Value-Flow Graph (VFG)}$ Usually, data-flow analysis uses two types of algorithm
to obtain each equivalence: semantic and syntactic. On the other hand, the VFG utilizes
syntactic algorithms to obtain semantic equivalence. Hence, it indicates the semantic
equivalence of terms [12].
In the experiments, three static analysis tools were chosen for graph extraction.
The tools were as follows: Table 1 lists which graphs are extracted from which tools.
Table 1. Program representations extracted from each static analysis tool.
Static analysis tools
|
Program Representations
|
Clang Static Analyzer
|
CFG
|
Joern
|
AST, CFG, PDG, CPG
|
SVF
|
ICFG, VFG
|
$\textbf{Clang Static Analyzer [1]}$ Clang Static Analyzer is a part of the Clang
compiler. It implements path-sensitive, inter-procedural analysis of C, C++, and Objective-C
and provides abstract syntax tree (AST), control-flow graph (CFG), and call graph.
In this article, however, only the CFG was used to extract the representation from
LLVM IR (intermediate representation)
$\textbf{Joern [13]}$ Joern is a platform for analyzing C and C++. Similar to Clang
Static Analyzer, it provides AST, CFG, and program dependence graph (PDG). Uniquely,
it produces a new graph representation called the code property graph (CPG), which
is a combination of several existing graphs.
$\textbf{SVF [14]}$ SVF is a static tool specialized in inter-procedural pointer analysis
of C and C++. It supports the inter-procedural control flow graph (ICFG) and value
flow graph (VFG). A representation can be obtained with semantic information but not
with syntax information. It also generates graphs with LLVM IR
The performance of each graph on vulnerability detection was measured using graph
neural networks. On the other hand, these graphs cannot be used as the input of the
GNN directly. Preprocessing is required to utilize graphs as input. This is explained
in detail in the next part.
3.3 Preprocessing
The preprocessing phase consists of two steps. One is the vectorization of node features.
The node features are necessary to train and generate the output with the GNN. Preprocessing
calculates the node embeddings based on the node features, and these embeddings are
used to predict the node label. The other one is labeling graphs with 0 (safe) and
1 (vulnerable). In the previous phase, the dataset was divided into two parts: bad
and good functions. The granularity of data is function-level. Hence, they were labeled
0 or 1 by the graph, not the node. Therefore, every node in one graph has the same
label. Fig. 3 shows how the preprocessing step works.
$\textbf{Step 1: Vectorization of node features.}$ The GNN takes both graphs and node
features as input. The node features usually mean statements or conditions composing
nodes, which are provided with vectors. The GNN is utilized to compute the embedding
of each node. The final embedding of a node is used to predict the label of the graph.
Doc2Vec [15] was used for vectorization. Doc2Vec is a natural language processing tool that can
produce fixed-length low dimensional vectors from node features. Vectors with lengths
of 10 were used. One node includes only a few lines of code because the size of the
function used in experiments is quite small. This may change depending on the size
of the function or programs.
$\textbf{Step 2: Graph labeling.}$ In the experiments, a binary classification was
used to indicate if a graph contains a vulnerability. Therefore, all graphs should
have labels to specify whether they are vulnerable or not. The dataset was already
split into bad and good functions. Following the dataset, they were labeled 0 for
good functions and 1 for bad functions. If the prediction is 0, the graph might be
safe. Otherwise, the graph might have a vulnerability.
Fig. 3. Preprocessing steps.
3.4 Graph Neural Network Model
Among the various graph neural network models, GraphSAGE [7] was chosen to classify the graphs. The model is based on the Graph Convolutional
Network (GCN) [16]. The GCN uses transductive learning, but it cannot generalize nodes not seen before.
To overcome this limitation, GraphSAGE applies inductive learning to produce embeddings
for unseen nodes efficiently. It generates embeddings by sampling and aggregating
features from the neighbors of a node instead of learning embedding vectors of individual
nodes. Fig. 4 shows the process of the model.
Supervised learning was used because all the data have labels for classification.
The additional simple test for measuring the performance between supervised and unsupervised
learning showed better results in supervised learning. Moreover, mini-batch was applied
to compute smoothly, and MEAN is an aggregation function. The dataset was split into
three sets: training set for 50%, validation set for 15%, and test set for 35%.
Fig. 4. Left: input graph and sampled nodes, Right: process of aggregating features.
4. Performance Evaluation
4.1 Evaluation
The experiments under the same conditions to evaluate the performance of each program
representations. First, the representations were compared by tool, and then the entire
representations. For the comparison between various program representations, the number
of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative
(FN) were counted. Four metrics were used for the evaluation, accuracy (ACC), precision
(P), recall (R), F1-score (F1). Moreover, the total elapsed time was also recorded
to check efficiency. Table 2 lists the classification results of the experiments. The results were rounded off
to the 2$^{\mathrm{nd}}$ place.
$\textbf{· Accuracy}$
$\text{Accuracy}=\frac{TP+TN}{\textit{Total number of predictions}}$
$\textbf{· Precision}$
$\Pr \text{ecision}=\frac{TP}{TP~ +~ FP}$
$\textbf{· Recall}$
$\mathrm{Re}\text{call}=\frac{TP}{TP~ +~ FN}$
$\textbf{· F1 Score}$
$\mathrm{F}1\,\,\text{Score}=2*\left(\frac{\textit{Precision} * \textit{Recall}}{\textit{Precision}
+ \textit{Recall}}\right)$
$\textbf{A. Comparison of representations according to the tool}$
$\textbf{Clang Static Analyzer}$ Using the Clang Static Analyzer, only extract CFG
was extracted. The analyzer also provides AST, but its shape is different from the
general form because it is usually used for analysis in the tools, not to generate
a representation. The performance of the CFG was not very good. In particular, the
precision was less than half. This means that there were many FN cases. On the other
hand, it takes less time than other methods because its node consists of LLVM IR,
which has an abbreviated form.
$\textbf{Joern}$ PDG shows the best performance among the whole metrics, but it took
the second most time. The AST, which only contains syntactic information, performs
worst, even in precision (only 40%). The performance of CFG and CPG was similar because
the CFG performs better than the CPG in precision and F1-score, but the CPG performs
well in the other metrics (accuracy and recall). Nevertheless, if the total elapsed
time is considered, CPG takes approximately 8.5 times more time.
$\textbf{SVF}$ The VFG presents better performance than the ICFG. The ICFG has huge
nodes consisting of the node, node ID, IR, and other detailed information. Therefore,
it needs more time to run the model. Even though it contains control-flow information
from a procedure and the call relationship between the procedures, the performance
is worse than VFG, which has data-flow information obtained syntactically.
$\textbf{B. Comparison of entire representations}$
The PDG showed better performance than other representations, not only between the
representations from Joern but also between all representations. The CPG also shows
good performance. It took the second most time to compute but produced the second-best
result. The lowest performance was ICFG from SVF. Even it takes the longest time to
produce results. All metrics were not so good. Three of the metrics indicated less
than half. The other four representations indicated similar accuracy between 68~73%.
On the other hand, the CFGs, both from the Clang Static Analyzer and Joern, costs
less than 1,000 secs. Considering the total elapsed time, the CFGs can be efficient
choices. Furthermore, it can be estimated that Joern is a suitable tool for extracting
the program representations required for vulnerability detection.
4.2 Discussion
Syntactic and semantic information are required to detect the vulnerabilities. Therefore,
it is expected that CPG, which includes both information, would show the highest performance
in the evaluation. On the other hand, the PDG showed the best performance, even though
it did not contain syntactic information. Moreover, because control-flow (e.g., CFG,
ICFG) and data-flow (e.g., VFG) graphs have limited dependence information, their
detection performances were lower than those using PDG.
In addition, when a graph has multiple edge types and data is transferred by edges,
such information is not fully fitted to the model used, which cannot reflect the information
related to edges. To utilize edge information, the model accepting various types of
edges, such as gated graph sequence neural network [17] is recommended.
The evaluation results can be affected if the vulnerabilities are sliced accurately.
In this study, as a whole program containing both vulnerable and non-vulnerable parts
was used as an input to the model, unrelated information (e.g., variable declarations)
can be considered vulnerable. As a result, the overall detection performance will
be increased if a program slicing technique is used to extract vulnerable codebases
from a program accurately. Moreover, the size of graphs will be reduced so that the
total execution time can be lower than in current experiments.
Because PDG contains both control and data-dependence information, it might be suitable
to learn all possible flows of a program. Based on this result, we can conclude that
dependence information is most useful when detecting vulnerabilities using flow information.
Table 2. Classification results of the experiments (Training set: 50%, Validation set: 15%, Test set: 35%).
5. Conclusion
The research on the analysis will be more important because of the increasing need
to detect vulnerabilities before the software release phase. Methods based on deep
learning or machine learning would be required to utilize more data in research. In
this regard, our study indicates which program representation is most suitable and
most efficient when analyzing GNN for vulnerability detection. We conducted experiments
to compare the effectiveness of each program representation for vulnerability detection.
Among the seven graphs obtained from the three tools, the program dependency graph
from Joern showed the best performance. From the result, we can figure out that Joern
is suitable for graph extractions and program slicing and model selection should be
considered to increase the overall detection performance. Moreover, the results of
the experiments can be used as a basis for other vulnerability detection research.
ACKNOWLEDGMENTS
This research was supported by Basic Science Research Program through the National
Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2021R1I1A3043889).
REFERENCES
Clang Static Analyzer., accessed on July, 2021
Flawfinder., accessed on July, 2021
Infer., accessed on July, 2021
National Vulnerability Database, American Information Technology Laboratory, accessed
on July, 2021
Yamaguchi F., Golde N., Arp D., Rieck K, 2014, Modeling and discovering vulnerabilities
with code property graphs., 2014 IEEE Symposium on Security and Privacy, IEEE, pp.
590-604
Zhou Jie, et al. , 2020, Graph neural networks: A review of methods and applications.,
AI Open, Vol. 1, pp. 57-81
Hamilton William L. , Rex Ying , Jure Leskovec. , 2017, Inductive representation learning
on large graphs., Proceedings of the 31st International Conference on Neural Information
Processing Systems, pp. 1025-1035
Zhou Yaqin , et al. , 2019, Devign: Effective vulnerability identification by learning
comprehensive program semantics via graph neural networks., arXiv preprint arXiv:1909.03496
Cheng Xiao, et al. , 2021, DeepWukong: Statically detecting software vulnerabilities
using deep graph neural network., ACM Transactions on Software Engineering and Methodology
(TOSEM), Vol. 30, No. 3, pp. 1-33
Juliet Test Suite for C/C++ Version 1.3., accessed on July, 2021
Ferrante, Jeanne , Karl J. Ottenstein , Joe D. Warren. , 1987, The program dependence
graph and its use in optimization., ACM Transactions on Programming Languages and
Systems (TOPLAS), Vol. 9, No. 3, pp. 319-349
Steffen, Bernhard , Jens Knoop , Oliver Rüthing. , 1990, The value flow graph: A program
representation for optimal program transformations., European Symposium on Programming.
Springer, Berlin, Heidelberg, pp. 389-405
Joern., accessed on July, 2021
Sui. Yulei , Jingling Xue. , 2016, SVF: interprocedural static value-flow analysis
in LLVM., Proceedings of the 25th international conference on compiler construction
Le, Quoc , Tomas Mikolov. , 2014, Distributed representations of sentences and documents.,
International conference on machine learning (PMLR)
Kipf, Thomas N. , Max Welling. , 2016, Semi-supervised classification with graph convolutional
networks., arXiv preprint arXiv:1609.02907
Li, Yujia , et al. , 2015, Gated graph sequence neural networks., arXiv preprint arXiv:1511.05493
Author
Yoola Choi received her B.S. degree in School of Computer Science and Engineering
from Kyungpook National University (KNU), where she is pursuing a M.S degree. Her
research interests include graph neural networks, vulnerability detection, and static
analysis.
Young-Woo Kwon is an associate professor at the School of Computer Science and
Engineering, Kyungpook National University. He received his PhD in Computer Science
from Virginia Tech. His research interests span distributed systems, bigdata analytics,
disaster ICT, and security.