Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Kyungpook National University / Daegu, Korea {yura0812, ywkwon}@knu.ac.kr )



Vulnerability detection, Graph neural networks, Static program analysis

1. Introduction

A vulnerability is a weakness that security attackers can target and may cause serious problems. As the number of software vulnerabilities has increased significantly, many vulnerability detection techniques have been developed. A static analysis approach is one of the frequently used approaches to discover vulnerabilities before software release. The existing approaches [1-3] require the labor of human experts to define patterns or strategies of each vulnerability. On the other hand, it is difficult to define all the patterns and impossible to detect new vulnerability patterns. According to the National Vulnerability Database (NVD) [4], only in 2021, there have been 11,874 Common Vulnerabilities and Exposures (CVE). Every month more than 1,400 new vulnerabilities have been reported (Fig. 1). Because of the limitations in pattern-based vulnerability detection approaches, recent studies have focused on utilizing machine learning or deep learning techniques to reduce manual efforts and deal with more vulnerabilities. In particular, some approaches using various program representations have shown better results than existing static analysis tools. Program representations can be obtained from the analysis results, and they include details of the program, e.g., control flow, data flow, and program dependency.

Both syntactic information and semantic information are needed for effective vulnerability discovery. Syntactic information is represented by the Abstract Syntax Tree (AST), and semantic information is represented by many graphs, such as Control Flow Graph (CFG), Data Flow Graph (DFG), and Program Dependency Graph (PDG). Several representations, such as Code Property Graph (CPG), need to be combined to utilize both information [5]. This study compared the program representations with syntax only, semantic only, and both syntax and semantic information. Therefore, program representations were extracted, such as AST, CFG, PDG, CPG from Clang Static Analyzer, Joern, and SVF.

A graph is non-Euclidean data so that it should be treated differently from Euclidean data, such as images and text. Existing many models were designed for Euclidean data. Therefore, Graph Neural Networks (GNN) [6] should be used to prevent the loss of the important information of graphs (e.g., topological dependency information on the node) when running models. In this paper, we used GraphSAGE [7], which is a graph-convolutional network model applying inductive learning to gain a better prediction of unseen data.

This paper suggests which program representation and static analysis tool for graph extraction are suitable for vulnerability detection with GNN. The effectiveness of each program representation when running with GNN to discover vulnerabilities in source code was compared using experiments. As a result, PDG from Joern shows the best performance. Every metric indicates over 99%.

$\textbf{·}$ We compare the effectiveness of program represen-tation and static analysis tool used for graph extraction;

$\textbf{·}$ We describe how much the program representation is important to detect vulnerabilities when running Graph Neural Networks;

$\textbf{·}$ The results can be used by those who want to conduct vulnerability detection with deep learning.

The remainder of this article is organized as follows. The next section describes the related works. The experiments section represents the dataset, graph extraction, preprocessing, which are important parts to run models, and graph neural network model used for experiments. The performance evaluation section includes the performance evaluations and a discussion of the results. Finally, this paper is concluded in the last section.

Fig. 1. Number of CVEs in 2021 (From January to July).
../../Resources/ieie/IEIESPC.2021.10.6.477/fig1.png

2. Related Work

Detecting vulnerabilities with graph neural networks show better results than traditional static analysis approaches. The proposal [8] extracts AST, CFG, DFG, and NCS (Natural Code Sequence) with Joern and concatenates these into a single graph. After extraction, it applies the gated graph sequence neural network (GGNN) model to detect vulnerabilities. Another proposal [9] build the PDG to reflect both control- and data-dependence from SVF and classify vulnerable codes with a deep graph neural network model.

Second, the proposal does not provide a comparison between representations. It just contains comparisons of the granularity and three GNN models. The first one compares the performance between representations (AST, CFG, DFG, NCS, and composite). On the other hand, it uses only Joern. Thus, it cannot show differences from static analysis tools. Therefore, we compare the performance of both representations and the static analysis tools used to extract the representations.

3. Experiments

Experiments were conducted to analyze the performance of the program representations and static analysis tools. Fig. 2 provides an overview of the experiments. First, representations were extracted using three static analysis tools from the dataset composed of vulnerable source codes. Second, the representations were preprocessed to fit the GNN. Preprocessing contains the vectorization of the node features and labeling class to each graph. Last, each preprocessed representation dataset was used in the GraphSAGE model [7], and the results for vulnerability detection were obtained.

Fig. 2. Overview of the classification methodology.
../../Resources/ieie/IEIESPC.2021.10.6.477/fig2.png

3.1 Dataset

Vulnerable and non-vulnerable codes are needed for the training model to classify source codes into vulnerable or not. Therefore, we used Juliet Test Suite v1.3 for C/C++ [10] as dataset. The Juliet Test Suite is a set of test programs with vulnerabilities. It provides both vulnerable code and non-vulnerable code. Thus, both can be obtained in a single dataset. Each test program has a single function involving vulnerability (bad function) and has more than one function similar in behavior to bad function but not vulnerable (good function). Therefore, the dataset was divided into two parts, bad functions and good functions, for labeling easily in the preprocessing step. Moreover, it provides test cases for C/C++ and Java, but only C/C++ cases were used in the experiments.

3.2 Graph Extraction

The static analysis tool is used to extract the program representations. First, it analyzes the source code and generates an abstract syntax tree. Other representations (e.g., control-flow graph) are then constructed from the tree. Using the static analysis tools, seven graphs of six types were extracted (AST, CFG, PDG, CPG, ICFG, and VFG). Each type includes syntax and semantic information. The following are brief descriptions of the graphs.

$\textbf{Abstract Syntax Tree (AST)}$ AST shows the syntactic structure of the program with a tree structure. It leaves necessary parts by excluding the unnecessary parts, such as white space, semicolons, and annotations. The operator is used as the root node, and the operands are used as the leaf node. The shape of AST depends on the compiler.

$\textbf{Control-Flow Graph (CFG)}$ The CFG represents all paths that can be traversed during program execution. It describes the inner flow of the procedure in detail. The graph consists of four components (basic block, directed edge, entry block, and exit block). One basic block can contain more than one line. The entry and exit blocks indicate the start and endpoints of the program.

$\textbf{Program Dependence Graph (PDG)}$ The PDG specifies both the control-dependence and data dependence. The control-dependence denotes which parts of a program can control the other parts [11], and the data dependence denotes which parts of the data refer to the same memory locations with other parts. The PDG can obtain a summary of dependence in a program.

$\textbf{Code Property Graph (CPG)}$ CPG is a combination of three existing representations. Its nodes are based on the AST, and its edges have four types: AST, CFG, CDG, DDG. The syntax and semantic information are contained in one graph. Hence, it has recently been used for vulnerability detection research. Only Joern supports CPG. Generating this with other tools might require additional manual efforts.

$\textbf{Inter-procedural Control-Flow Graph (ICFG)}$ While CFG shows the control-flow only in one procedure, the ICFG presents the entire control-flow of a program with multiple procedures. It ties up multiple CFG into a single graph. Simply, each of its nodes is the CFG, and its edges are calling the relationships between procedures.

$\textbf{Value-Flow Graph (VFG)}$ Usually, data-flow analysis uses two types of algorithm to obtain each equivalence: semantic and syntactic. On the other hand, the VFG utilizes syntactic algorithms to obtain semantic equivalence. Hence, it indicates the semantic equivalence of terms [12].

In the experiments, three static analysis tools were chosen for graph extraction. The tools were as follows: Table 1 lists which graphs are extracted from which tools.

Table 1. Program representations extracted from each static analysis tool.

Static analysis tools

Program Representations

Clang Static Analyzer

CFG

Joern

AST, CFG, PDG, CPG

SVF

ICFG, VFG

$\textbf{Clang Static Analyzer [1]}$ Clang Static Analyzer is a part of the Clang compiler. It implements path-sensitive, inter-procedural analysis of C, C++, and Objective-C and provides abstract syntax tree (AST), control-flow graph (CFG), and call graph. In this article, however, only the CFG was used to extract the representation from LLVM IR (intermediate representation)

$\textbf{Joern [13]}$ Joern is a platform for analyzing C and C++. Similar to Clang Static Analyzer, it provides AST, CFG, and program dependence graph (PDG). Uniquely, it produces a new graph representation called the code property graph (CPG), which is a combination of several existing graphs.

$\textbf{SVF [14]}$ SVF is a static tool specialized in inter-procedural pointer analysis of C and C++. It supports the inter-procedural control flow graph (ICFG) and value flow graph (VFG). A representation can be obtained with semantic information but not with syntax information. It also generates graphs with LLVM IR

The performance of each graph on vulnerability detection was measured using graph neural networks. On the other hand, these graphs cannot be used as the input of the GNN directly. Preprocessing is required to utilize graphs as input. This is explained in detail in the next part.

3.3 Preprocessing

The preprocessing phase consists of two steps. One is the vectorization of node features. The node features are necessary to train and generate the output with the GNN. Preprocessing calculates the node embeddings based on the node features, and these embeddings are used to predict the node label. The other one is labeling graphs with 0 (safe) and 1 (vulnerable). In the previous phase, the dataset was divided into two parts: bad and good functions. The granularity of data is function-level. Hence, they were labeled 0 or 1 by the graph, not the node. Therefore, every node in one graph has the same label. Fig. 3 shows how the preprocessing step works.

$\textbf{Step 1: Vectorization of node features.}$ The GNN takes both graphs and node features as input. The node features usually mean statements or conditions composing nodes, which are provided with vectors. The GNN is utilized to compute the embedding of each node. The final embedding of a node is used to predict the label of the graph. Doc2Vec [15] was used for vectorization. Doc2Vec is a natural language processing tool that can produce fixed-length low dimensional vectors from node features. Vectors with lengths of 10 were used. One node includes only a few lines of code because the size of the function used in experiments is quite small. This may change depending on the size of the function or programs.

$\textbf{Step 2: Graph labeling.}$ In the experiments, a binary classification was used to indicate if a graph contains a vulnerability. Therefore, all graphs should have labels to specify whether they are vulnerable or not. The dataset was already split into bad and good functions. Following the dataset, they were labeled 0 for good functions and 1 for bad functions. If the prediction is 0, the graph might be safe. Otherwise, the graph might have a vulnerability.

Fig. 3. Preprocessing steps.
../../Resources/ieie/IEIESPC.2021.10.6.477/fig3.png

3.4 Graph Neural Network Model

Among the various graph neural network models, GraphSAGE [7] was chosen to classify the graphs. The model is based on the Graph Convolutional Network (GCN) [16]. The GCN uses transductive learning, but it cannot generalize nodes not seen before. To overcome this limitation, GraphSAGE applies inductive learning to produce embeddings for unseen nodes efficiently. It generates embeddings by sampling and aggregating features from the neighbors of a node instead of learning embedding vectors of individual nodes. Fig. 4 shows the process of the model.

Supervised learning was used because all the data have labels for classification. The additional simple test for measuring the performance between supervised and unsupervised learning showed better results in supervised learning. Moreover, mini-batch was applied to compute smoothly, and MEAN is an aggregation function. The dataset was split into three sets: training set for 50%, validation set for 15%, and test set for 35%.

Fig. 4. Left: input graph and sampled nodes, Right: process of aggregating features.
../../Resources/ieie/IEIESPC.2021.10.6.477/fig4.png

4. Performance Evaluation

4.1 Evaluation

The experiments under the same conditions to evaluate the performance of each program representations. First, the representations were compared by tool, and then the entire representations. For the comparison between various program representations, the number of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) were counted. Four metrics were used for the evaluation, accuracy (ACC), precision (P), recall (R), F1-score (F1). Moreover, the total elapsed time was also recorded to check efficiency. Table 2 lists the classification results of the experiments. The results were rounded off to the 2$^{\mathrm{nd}}$ place.

$\textbf{· Accuracy}$

      $\text{Accuracy}=\frac{TP+TN}{\textit{Total number of predictions}}$

$\textbf{· Precision}$

      $\Pr \text{ecision}=\frac{TP}{TP~ +~ FP}$

$\textbf{· Recall}$

      $\mathrm{Re}\text{call}=\frac{TP}{TP~ +~ FN}$

$\textbf{· F1 Score}$

      $\mathrm{F}1\,\,\text{Score}=2*\left(\frac{\textit{Precision} * \textit{Recall}}{\textit{Precision} + \textit{Recall}}\right)$

$\textbf{A. Comparison of representations according to the tool}$

$\textbf{Clang Static Analyzer}$ Using the Clang Static Analyzer, only extract CFG was extracted. The analyzer also provides AST, but its shape is different from the general form because it is usually used for analysis in the tools, not to generate a representation. The performance of the CFG was not very good. In particular, the precision was less than half. This means that there were many FN cases. On the other hand, it takes less time than other methods because its node consists of LLVM IR, which has an abbreviated form.

$\textbf{Joern}$ PDG shows the best performance among the whole metrics, but it took the second most time. The AST, which only contains syntactic information, performs worst, even in precision (only 40%). The performance of CFG and CPG was similar because the CFG performs better than the CPG in precision and F1-score, but the CPG performs well in the other metrics (accuracy and recall). Nevertheless, if the total elapsed time is considered, CPG takes approximately 8.5 times more time.

$\textbf{SVF}$ The VFG presents better performance than the ICFG. The ICFG has huge nodes consisting of the node, node ID, IR, and other detailed information. Therefore, it needs more time to run the model. Even though it contains control-flow information from a procedure and the call relationship between the procedures, the performance is worse than VFG, which has data-flow information obtained syntactically.

$\textbf{B. Comparison of entire representations}$

The PDG showed better performance than other representations, not only between the representations from Joern but also between all representations. The CPG also shows good performance. It took the second most time to compute but produced the second-best result. The lowest performance was ICFG from SVF. Even it takes the longest time to produce results. All metrics were not so good. Three of the metrics indicated less than half. The other four representations indicated similar accuracy between 68~73%. On the other hand, the CFGs, both from the Clang Static Analyzer and Joern, costs less than 1,000 secs. Considering the total elapsed time, the CFGs can be efficient choices. Furthermore, it can be estimated that Joern is a suitable tool for extracting the program representations required for vulnerability detection.

4.2 Discussion

Syntactic and semantic information are required to detect the vulnerabilities. Therefore, it is expected that CPG, which includes both information, would show the highest performance in the evaluation. On the other hand, the PDG showed the best performance, even though it did not contain syntactic information. Moreover, because control-flow (e.g., CFG, ICFG) and data-flow (e.g., VFG) graphs have limited dependence information, their detection performances were lower than those using PDG.

In addition, when a graph has multiple edge types and data is transferred by edges, such information is not fully fitted to the model used, which cannot reflect the information related to edges. To utilize edge information, the model accepting various types of edges, such as gated graph sequence neural network [17] is recommended.

The evaluation results can be affected if the vulnerabilities are sliced accurately. In this study, as a whole program containing both vulnerable and non-vulnerable parts was used as an input to the model, unrelated information (e.g., variable declarations) can be considered vulnerable. As a result, the overall detection performance will be increased if a program slicing technique is used to extract vulnerable codebases from a program accurately. Moreover, the size of graphs will be reduced so that the total execution time can be lower than in current experiments.

Because PDG contains both control and data-dependence information, it might be suitable to learn all possible flows of a program. Based on this result, we can conclude that dependence information is most useful when detecting vulnerabilities using flow information.

Table 2. Classification results of the experiments (Training set: 50%, Validation set: 15%, Test set: 35%).
../../Resources/ieie/IEIESPC.2021.10.6.477/tb1.png

5. Conclusion

The research on the analysis will be more important because of the increasing need to detect vulnerabilities before the software release phase. Methods based on deep learning or machine learning would be required to utilize more data in research. In this regard, our study indicates which program representation is most suitable and most efficient when analyzing GNN for vulnerability detection. We conducted experiments to compare the effectiveness of each program representation for vulnerability detection. Among the seven graphs obtained from the three tools, the program dependency graph from Joern showed the best performance. From the result, we can figure out that Joern is suitable for graph extractions and program slicing and model selection should be considered to increase the overall detection performance. Moreover, the results of the experiments can be used as a basis for other vulnerability detection research.

ACKNOWLEDGMENTS

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF-2021R1I1A3043889).

REFERENCES

1 
Clang Static Analyzer., accessed on July, 2021URL
2 
Flawfinder., accessed on July, 2021URL
3 
Infer., accessed on July, 2021URL
4 
National Vulnerability Database, American Information Technology Laboratory, accessed on July, 2021URL
5 
Yamaguchi F., Golde N., Arp D., Rieck K, 2014, Modeling and discovering vulnerabilities with code property graphs., 2014 IEEE Symposium on Security and Privacy, IEEE, pp. 590-604DOI
6 
Zhou Jie, et al. , 2020, Graph neural networks: A review of methods and applications., AI Open, Vol. 1, pp. 57-81DOI
7 
Hamilton William L. , Rex Ying , Jure Leskovec. , 2017, Inductive representation learning on large graphs., Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025-1035URL
8 
Zhou Yaqin , et al. , 2019, Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks., arXiv preprint arXiv:1909.03496URL
9 
Cheng Xiao, et al. , 2021, DeepWukong: Statically detecting software vulnerabilities using deep graph neural network., ACM Transactions on Software Engineering and Methodology (TOSEM), Vol. 30, No. 3, pp. 1-33DOI
10 
Juliet Test Suite for C/C++ Version 1.3., accessed on July, 2021URL
11 
Ferrante, Jeanne , Karl J. Ottenstein , Joe D. Warren. , 1987, The program dependence graph and its use in optimization., ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 9, No. 3, pp. 319-349DOI
12 
Steffen, Bernhard , Jens Knoop , Oliver Rüthing. , 1990, The value flow graph: A program representation for optimal program transformations., European Symposium on Programming. Springer, Berlin, Heidelberg, pp. 389-405DOI
13 
Joern., accessed on July, 2021URL
14 
Sui. Yulei , Jingling Xue. , 2016, SVF: interprocedural static value-flow analysis in LLVM., Proceedings of the 25th international conference on compiler constructionDOI
15 
Le, Quoc , Tomas Mikolov. , 2014, Distributed representations of sentences and documents., International conference on machine learning (PMLR)URL
16 
Kipf, Thomas N. , Max Welling. , 2016, Semi-supervised classification with graph convolutional networks., arXiv preprint arXiv:1609.02907URL
17 
Li, Yujia , et al. , 2015, Gated graph sequence neural networks., arXiv preprint arXiv:1511.05493URL

Author

Yoola Choi
../../Resources/ieie/IEIESPC.2021.10.6.477/au1.png

Yoola Choi received her B.S. degree in School of Computer Science and Engineering from Kyungpook National University (KNU), where she is pursuing a M.S degree. Her research interests include graph neural networks, vulnerability detection, and static analysis.

Young-Woo Kwon
../../Resources/ieie/IEIESPC.2021.10.6.477/au2.png

Young-Woo Kwon is an associate professor at the School of Computer Science and Engineering, Kyungpook National University. He received his PhD in Computer Science from Virginia Tech. His research interests span distributed systems, bigdata analytics, disaster ICT, and security.