Mobile QR Code QR CODE

  1. (Department of Computer Engineering, Chungnam National University/ Daejeon, Korea
  2. (Department of Computer Engineering, Chungnam National University/ Daejeon, Korea
  3. (Department of Computer Engineering, Chungnam National University/ Daejeon, Korea )

Join order optimization, Deep reinforcement learning, Spark SQL

1. Introduction

Large amounts of data are generated and managed every day in various systems. For example, in a smart grid system, thousands of petabytes of data are generated annually, including data on power generation, transmission, distribution, and electricity usage [1]. Fig. 1 shows a diagram of a Meter Data Management System (MDMS), which is used for smart grids to collect, verify, aggregate, and analyze a large amount of data generated from smart meters. It analyzes power loss, usage, and consumption patterns and provides remote control. Within the system, there is a steady stream of queries for data aggregation, data analysis, or other tasks such as electricity usage inquiries. Such systems for retrieving and analyzing large datasets require responses in real time, so it is necessary to process queries quickly.

However, such query requests need to search through several input tables and include various operations such as joins, selections, and aggregate operations. High computational complexity is involved in executing a query that contains several join operations, which are high-cost operations. Therefore, query optimization must be performed to reduce the computational complexity and reduce query execution time.

Query optimization algorithms of many database engines select a plan with the lowest cost among possible plans. They have a trade-off problem between the query optimization complexity and the query execution time. To mitigate this issue, recent query optimization studies have proposed using a deep reinforcement learning model [2-5], [7], which recognizes the current state and selects an action that maximizes a reward among selectable actions. Database tuning improves the performance by modifying the database design and adjusting computational resources. Tuning is performed by the DBA or programmatically but is limited by heuristic or manual adjustment. To address this issue, one study applied reinforcement learning to database tuning to adjust parameters such as the write size and memory size [6]. It was demonstrated that using a deep reinforcement learning model can improve query performance compared to traditional database systems. Query optimization models have been developed based on single-node database systems such as PostgreSQL. However, as mentioned, multiple queries are requested for a large amount of data in many systems such as smart grid systems, so there is a limitation to processing in a single environment.

In our previous study, we proposed optimizing the join orders of queries based on Spark SQL [9] using a Deep Q-Network (DQN) model in order to improve the query performance for massive datasets [8]. DQN is a deep reinforcement learning algorithm. However, the DQN algorithm does not work well for a large action space, so in this study, we used a Proximal Policy Optimization (PPO) Algorithm [10], which is also a technique used in deep reinforcement learning algorithms.

We propose optimizing the join orders of queries on Spark SQL by learning the optimized orders using a PPO algorithm. By learning the optimized join orders using the cost computation method of Spark SQL, the PPO-based join order optimization model follows the behavior of Spark SQL’s join order optimization. It has two advantages. First, the model is able to find more join plans with lower costs than the plans that Spark SQL finds. The reason is that Spark SQL is limited in low search spaces because it uses heuristics. Second, the model does not need to plan the join order of a query performed in Spark SQL. The model can derive optimized join plans similar to Spark SQL without analyzing statistical information about the many input tables. The dataset and queries used in the previous study were limited, so more generalized experiments are presented with an extended dataset and queries.

This paper is organized as follows. Section 2 describes previous studies that applied the deep reinforcement learning for the query optimization problem. Section 3 describes the query optimization of Spark SQL and the PPO-based optimization model for the join orders of queries on Spark SQL. The results of comparing the PPO-based optimization model with Spark SQL are provided in Section 4. Finally, we conclude our study in Section 5.

Fig. 1. MDMS system.

2. Related Work

One optimization method was proposed for join orders of queries based on the costs of the query plans using policy gradient methods, and a method of expressing queries as the environment of the model was presented [2]. In another study, a solution to the problem arising from the cost-based optimization was presented [3]. A learning method was proposed for the complexity problem that occurs when the optimization factor is increased. The state representation and cardinality prediction were resolved using deep learning and deep reinforcement learning algorithms [4].

A query optimization study was conducted using policy gradient methods and Q-learning methods, and a query expression method was presented [5]. A method of optimizing join order was proposed using Deep Q-learning, and a method of expressing a query as a model environment using Tree-LSTM was presented [7]. Furthermore, a database tuning method was proposed to find optimal parameter values such as memory size, write size, and query workload using policy gradient methods [6]. Deep reinforcement learning was used to improve the problems of database tuning and query optimization [2-7]. However, these studies were based on single environment database systems and have limitations for large amounts of data.

We previously proposed a method to optimize the join orders of queries using DQN on a distributed-parallel processing framework, Spark SQL, which is suitable for processing large amounts of data [8]. We applied a previous method [2] to load the information of queries into the deep reinforcement learning model. The method describes the information of queries as vectors. The DQN-based model then learned the optimized join order using the costs derived from Spark SQL.

However, the dataset and queries were limited to make the model stable for other data and query sets. The DQN-based model showed poor performance when we extended the size of datasets and the number of queries. In this study, the proposed model optimizes the join order of queries on Spark SQL, which is a distributed environment database system based on Apache Spark. We trained the model using extended data and query sets.

3. PPO-based Join Order Optimization

3.1 Query Optimization of Spark SQL

Spark SQL has a built-in query optimization function based on a Catalyst Optimizer. The process of processing queries in Spark SQL can be divided into four steps. The first step is the analysis phase. In this step, the query requested by the user is parsed to verify its validity and converted into a logical plan, which is a data structure that processes the query in Spark SQL. The second step is the logical optimization phase, where the converted query is optimized using pre-defined rules. Abstract optimization is performed, which involves actions such as filter operation, join order, and column pruning.

During the physical planning step, which is the third step, further optimization is performed using strategies of Spark SQL on the plan that performed logical optimization in the previous step. If abstract optimization has been performed for the operation, in the physical optimization stage, a specific algorithm such as a table scan method or join algorithm is selected to use when joining is performed. The fourth step is the code generation step. The query plan optimized by the previous three steps is converted into Resilient Distributed Data (RDD), which is a data structure for the job execution and fault tolerance in Spark. The converted RDD is partitioned into multiple nodes, and each node processes the partitioned RDD. Then the query results are returned.

Like other database systems, Spark SQL also uses a static and heuristic cost model for query optimization. Therefore, it could miss other possible query plans, especially certain join plans that are able to show lower costs than the costs of the join plans derived by Spark SQL. Spark SQL performs the optimization algorithms on other elements besides the join orders, but determining the join orders of a query occupies the largest part in the entire optimization process because of its high complexity.

The proposed PPO-based join order optimization model follows the behavior of Spark SQL. The learning model is able to find more join plans with lower costs than the plans that Spark SQL actually finds because of the limitations mentioned. Therefore, we replace the join order optimization process in Spark SQL with a taught model. Then, we use the cost computation method of Spark SQL to estimate the join plans generated by the model.

Fig. 2. Overall architecture.
Fig. 3. Example of converting the query information into the states of the model.

3.2 PPO-based Join Order Optimization for Spark SQL

The join order optimization model based on the cost computation of Spark SQL is shown in Fig. 2. The overall process of the model is as follows. First, to optimize the join orders of queries using the deep reinforcement learning algorithm, each query to be optimized needs to be initialized as the environment of the model so that they can be used as network inputs. We use a method similar to a previous one [2]. The states and actions are defined using clause information in the query. The state is constructed using three characteristics of the query. First, the relation information of the query is defined as a relation state. The relation vector is expressed in the form of N by N, where N is the number of tables in the database schema to be learned.

An example is shown in Fig. 3. The upper portion of Fig. 3(a) shows the state in which no join has occurred, while the lower portion represents the updated state after joining the relations A and C. Second, the join operation information is defined as a join predicate state in Fig. 3(b) and represents a join condition between relations included in the input query. In addition, this state is a vector used to distinguish join actions between relations that are possible in a certain state. Like the relation state, it is expressed in the form of N by N. The operations contained in the query are represented as either 1 or 0.

Third, the information of selection operations is defined as a selection state. The state indicates the selection operations in the ``where'' clause of the input query. The selection state is expressed in the form of 1 by M, where M is the number of attributes in the database schema. This is shown in Fig. 3(c). If an attribute is used in the selection operation in the query, it is represented as 1. Otherwise, it is represented as 0.

An action of the model is an action that can be taken in a specific relation state. In this paper, it refers to performing a certain join operation between relations. There are zero or more actions that can be selected. When one of the actions is selected, the relation state vector is updated because it means that a certain join operation is performed.

In addition, we describe the definition of a reward. In order to solve the limitation of a previous reward function [8], we redefine the reward by adding one condition. We define the minimum cost $C_{\min }$ that is generated during the current execution of the model. When the query plan is executed, the cost $\textit{C}$ is returned. If the model generates the first query plan, $C_{\min }$ will be $\textit{C}$. Otherwise, it is compared with $C_{\min }$, and if $\textit{C}$ is lower than $C_{\min }$, $C_{\min }$ is updated as $\textit{C}$. $\textit{C}$ is converted into a reward. If $\textit{C}$ of the query plan is 1000 times or more than $C_{\min }$, the plan is judged to have a large cost, and the reward is given as ``–1''. Otherwise, if $\textit{C}$ is 1000 times or less, the plan takes the reciprocal and the model uses it as a reward.

The join order optimization model proposed in this paper is trained as shown in Fig. 4. First, the PPO-based join order optimization model initializes the state in the form of vectors from the input query. The model considers the initialized state and lists possible actions, which are the join operations, according to the conditions of the current state. After that, the state is updated according to the selected action. Under the condition that the state vector is updated, the model again lists a set of actions. Next, it selects the optimized action and repeats the same process until the model reaches the terminal state where there are no more selectable actions. Finally, a final optimized join order is derived. Then, Spark SQL executes the input query according to the final join plan generated by the model and returns its cost. In this study, we train the model by converting these costs into rewards.

In our model, we apply the PPO algorithm, not the DQN algorithm [8]. The DQN algorithm-based model is trained to maximize the Q-value. In addition to ordering joins, there are various factors for query optimization, such as physical operations and index selection. If the factors to be considered increase, the optimization problem becomes more complicated, and its action space increases. In general, the DQN algorithm does not work well for such complex problems. Even though we consider the join ordering in this study, if we use the DQN algorithm for this problem, the model will be difficult to extend to more complicated optimization cases. For this reason, we used the PPO algorithm.

Fig. 4. Example of the overall training process.
Fig. 5. Cost changes based on number of training cycles of the model for query 8 with 10 GB.
Fig. 6. Cost changes according to the number of training cycles of the model for query 8 with 100 GB.
Fig. 7. Cost changes according to the number of training cycles of the model for query 8 with 300 GB.

4. Performance Evaluation

We conducted a performance evaluation using the TPC-H Benchmark dataset and query set, which are widely used as database performance indicators. We generated 10 GB, 100 GB, and 300 GB datasets of TPC-H, which consists of 8 input tables. For the query set, there are 22 queries in TPC-H. The proposed model optimizes the join orders, so we picked 11 queries, which each contain at least two or more join operations.

A total of 13 servers were set. One master server was used to train the PPO-based model, and the other 12 slave servers executed queries according to the generated join orders in parallel. Each server has the same hardware specifications (Intel Xeon E5-2620 v4@2.7GHz, 64 GB of RAM, and 1 TB HDD). Spark 2.4.3 (which contains Spark SQL 2.1.1) is installed on each server. The PPO-based model has two hidden layers, an Adam optimizer, and a ReLU activation function. We developed our model for Spark SQL using previous code [11].

The PPO-based model was trained 500 times for each of the 11 queries using data sizes of 10 GB, 100 GB, and 300 GB. Fig. 5 shows the changes of the costs for the join plans generated by the model as the model is trained 500 times for TPC-H query 8 with the 10 GB dataset. The model mainly selects join plans with various costs until it is trained 250 times. However, after the model is trained 250 times, it mainly selects join plans with low costs, which is similar to the performance of Spark SQL.

Similar training results were observed for the other TPC-H queries. We also conducted the same experiments for query 8 using different data sizes of 100 GB and 300 GB, as shown in Figs. 6 and 7. On the extended datasets, the model tends to converge to certain costs after it is trained 300 times and 400 times, respectively. After that, similar to the experiment with the data size of 10 GB, they mainly select join plans with low costs.

Fig. 8 shows a comparison of the costs of the join plans generated by Spark SQL and the PPO model for 11 queries. For queries 2, 3, and 5, the join plans generated by the model have lower costs than the join plans of Spark SQL. As we mentioned in the last part of Section 3.1, the reason why several join plans of the model have lower costs than the plans of Spark SQL is that the model finds additional join plans that Spark SQL cannot find because it uses a static and heuristic planning algorithm. For the remaining eight queries, the graph shows that the join plans generated by the model have the same cost as the join plans derived by Spark SQL.

Fig. 8. Performance comparison of Spark SQL and the PPO-based model for 11 queries for 10 GB, 100 GB, and 300 GB.

5. Conclusion

In this paper, we proposed a learned model for optimizing the join orders of queries on Spark SQL using a PPO algorithm to improve the query evaluation performance for large amounts of data. The model follows the join order optimization mechanism of Spark SQL by using the costs computed by Spark SQL to train the model. As a result, the model optimizes the join order of queries without actually executing the optimization algorithm. In addition, the model is able to generate join plans with lower costs than the plans that Spark SQL finds because of Spark SQL’s static and heuristic algorithm.

Experiments demonstrated that the model gradually selected the join plans with low costs as the number of training cycles increased. After the model was trained a sufficient number of times, it selected join plans with costs equal to or lower than the costs generated by Spark SQL. However, the cost computation method of Spark SQL that the model uses estimates the join plans using the simple statistics of input tables, so it can be inaccurate. In other words, join plans with low costs can actually have very long execution times. Therefore, we need to train the model using the actual execution times of generated join plans on Spark SQL, not the costs as a reward of the model. In future work, we plan to improve the performance of the model by training it with the actual execution time on Spark SQL.


This work was supported by research fund of Chungnam National University.


Jeong S-Y, et al. , 2020, Efficient Network Administration for Smart Grid Data Center, In: 2020 22nd International Conference on Advanced Communi-cation Technology (ICACT), pp. 48-51DOI
Marcus R., et al. , 2018, Deep reinforcement learning for join order enumeration, Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1-4DOI
Marcus R., et al. , 2018, Towards a hands-free query optimizer through deep learning, arXiv preprint arXiv:1809.10212URL
ORTIZ , Jennifer , et al. , 2018, Learning state representations for query optimization with deep reinforcement learning., In: Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, pp. 1-4DOI
HEITZ , et al. , 2019, Join Query Optimization with Deep Reinforcement Learning Algorithms, arXiv preprint arXiv:1911.11689URL
Li , Guoliang , et al. , 2019, Qtune: A query-aware database tuning system with deep reinforcement learning., Proceedings of the VLDB Endowment 12.12:, pp. 2118-2130DOI
Yu , Xiang , et al. , 2020, Reinforcement learning with tree-lstm for join order selection., 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, pp. 1297-1308DOI
Lee K-M, et al. , 2020, DQN-based Join Order Optimization of Spark SQL in Smart Grid, Summer Annual Conference of IEIE, pp. 1919-1821URL
Armbrust , Michael , et al. , 2015, Spark sql: Relational data processing in spark., Proceedings of the 2015 ACM SIGMOD international conference on management of data.DOI
SCHULMAN , et al. , 2017, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347URL


Kyeong-Min Lee

Kyeong-Min Lee received his B.S. degree in computer engineering from Chungnam National University, Korea, in 2019. He is now an M.S. candidate at Chungnam National University. His main research interests include reinforcement learning, deep learning, distributed processing, and big data.

InA Kim

InA Kim received her B.S. and M.S. degrees in computer engineering from Chungnam National University, Korea, in 2014 and 2016. Since 2016, she has worked as a researcher at the Software Research Center (SOREC) in Chungnam National University. Her main research interests include database systems and parallel processing for big data.

Kyu-Chul Lee

Kyu-Chul Lee received his B.S., M.S., and Ph.D. degrees in computer engineering at Seoul National University in 1984, 1986, and 1996, respectively. In 1994, he worked as a visiting researcher at the IBM Almaden Research Center, San Jose, California. From 1995 to 1996, he worked as a Visiting Professor at the CASE Center at Syracuse University, Syracuse, New York. He is currently a professor in the Department of Computer Engineering at Chungnam National University, Daejeon, Korea. His current areas include database systems, semantic web, big data processing, and artificial Intelligence. He has published over 100 technical articles in various journals and conferences. He is a member of ACM, the IEEE Computer Society, and the Korea Information Science Society.