LeeKyeong-Min1
KimInA2
LeeKyu-Chul3
-
(Department of Computer Engineering, Chungnam National University/ Daejeon, Korea
2kminnn@gmail.com)
-
(Department of Computer Engineering, Chungnam National University/ Daejeon, Korea
dodary0214@gmail.com)
-
(Department of Computer Engineering, Chungnam National University/ Daejeon, Korea
kclee@cnu.ac.kr )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Join order optimization, Deep reinforcement learning, Spark SQL
1. Introduction
Large amounts of data are generated and managed every day in various systems.
For example, in a smart grid system, thousands of petabytes of data are generated
annually, including data on power generation, transmission, distribution, and electricity
usage [1]. Fig. 1 shows a diagram of a Meter Data Management System (MDMS), which is used for smart
grids to collect, verify, aggregate, and analyze a large amount of data generated
from smart meters. It analyzes power loss, usage, and consumption patterns and provides
remote control. Within the system, there is a steady stream of queries for data aggregation,
data analysis, or other tasks such as electricity usage inquiries. Such systems for
retrieving and analyzing large datasets require responses in real time, so it is necessary
to process queries quickly.
However, such query requests need to search through several input tables and include
various operations such as joins, selections, and aggregate operations. High computational
complexity is involved in executing a query that contains several join operations,
which are high-cost operations. Therefore, query optimization must be performed to
reduce the computational complexity and reduce query execution time.
Query optimization algorithms of many database engines select a plan with the
lowest cost among possible plans. They have a trade-off problem between the query
optimization complexity and the query execution time. To mitigate this issue, recent
query optimization studies have proposed using a deep reinforcement learning model
[2-5], [7], which recognizes the current state and selects an action that maximizes a reward
among selectable actions. Database tuning improves the performance by modifying the
database design and adjusting computational resources. Tuning is performed by the
DBA or programmatically but is limited by heuristic or manual adjustment. To address
this issue, one study applied reinforcement learning to database tuning to adjust
parameters such as the write size and memory size [6]. It was demonstrated that using a deep reinforcement learning model can improve query
performance compared to traditional database systems. Query optimization models have
been developed based on single-node database systems such as PostgreSQL. However,
as mentioned, multiple queries are requested for a large amount of data in many systems
such as smart grid systems, so there is a limitation to processing in a single environment.
In our previous study, we proposed optimizing the join orders of queries based
on Spark SQL [9] using a Deep Q-Network (DQN) model in order to improve the query performance for
massive datasets [8]. DQN is a deep reinforcement learning algorithm. However, the DQN algorithm does
not work well for a large action space, so in this study, we used a Proximal Policy
Optimization (PPO) Algorithm [10], which is also a technique used in deep reinforcement learning algorithms.
We propose optimizing the join orders of queries on Spark SQL by learning the
optimized orders using a PPO algorithm. By learning the optimized join orders using
the cost computation method of Spark SQL, the PPO-based join order optimization model
follows the behavior of Spark SQL’s join order optimization. It has two advantages.
First, the model is able to find more join plans with lower costs than the plans that
Spark SQL finds. The reason is that Spark SQL is limited in low search spaces because
it uses heuristics. Second, the model does not need to plan the join order of a query
performed in Spark SQL. The model can derive optimized join plans similar to Spark
SQL without analyzing statistical information about the many input tables. The dataset
and queries used in the previous study were limited, so more generalized experiments
are presented with an extended dataset and queries.
This paper is organized as follows. Section 2 describes previous studies that
applied the deep reinforcement learning for the query optimization problem. Section
3 describes the query optimization of Spark SQL and the PPO-based optimization model
for the join orders of queries on Spark SQL. The results of comparing the PPO-based
optimization model with Spark SQL are provided in Section 4. Finally, we conclude
our study in Section 5.
2. Related Work
One optimization method was proposed for join orders of queries based on the costs
of the query plans using policy gradient methods, and a method of expressing queries
as the environment of the model was presented [2]. In another study, a solution to the problem arising from the cost-based optimization
was presented [3]. A learning method was proposed for the complexity problem that occurs when the optimization
factor is increased. The state representation and cardinality prediction were resolved
using deep learning and deep reinforcement learning algorithms [4].
A query optimization study was conducted using policy gradient methods and Q-learning
methods, and a query expression method was presented [5]. A method of optimizing join order was proposed using Deep Q-learning, and a method
of expressing a query as a model environment using Tree-LSTM was presented [7]. Furthermore, a database tuning method was proposed to find optimal parameter values
such as memory size, write size, and query workload using policy gradient methods
[6]. Deep reinforcement learning was used to improve the problems of database tuning
and query optimization [2-7]. However, these studies were based on single environment database systems and have
limitations for large amounts of data.
We previously proposed a method to optimize the join orders of queries using DQN
on a distributed-parallel processing framework, Spark SQL, which is suitable for processing
large amounts of data [8]. We applied a previous method [2] to load the information of queries into the deep reinforcement learning model. The
method describes the information of queries as vectors. The DQN-based model then learned
the optimized join order using the costs derived from Spark SQL.
However, the dataset and queries were limited to make the model stable for other
data and query sets. The DQN-based model showed poor performance when we extended
the size of datasets and the number of queries. In this study, the proposed model
optimizes the join order of queries on Spark SQL, which is a distributed environment
database system based on Apache Spark. We trained the model using extended data and
query sets.
3. PPO-based Join Order Optimization
3.1 Query Optimization of Spark SQL
Spark SQL has a built-in query optimization function based on a Catalyst Optimizer.
The process of processing queries in Spark SQL can be divided into four steps. The
first step is the analysis phase. In this step, the query requested by the user is
parsed to verify its validity and converted into a logical plan, which is a data structure
that processes the query in Spark SQL. The second step is the logical optimization
phase, where the converted query is optimized using pre-defined rules. Abstract optimization
is performed, which involves actions such as filter operation, join order, and column
pruning.
During the physical planning step, which is the third step, further optimization
is performed using strategies of Spark SQL on the plan that performed logical optimization
in the previous step. If abstract optimization has been performed for the operation,
in the physical optimization stage, a specific algorithm such as a table scan method
or join algorithm is selected to use when joining is performed. The fourth step is
the code generation step. The query plan optimized by the previous three steps is
converted into Resilient Distributed Data (RDD), which is a data structure for the
job execution and fault tolerance in Spark. The converted RDD is partitioned into
multiple nodes, and each node processes the partitioned RDD. Then the query results
are returned.
Like other database systems, Spark SQL also uses a static and heuristic cost
model for query optimization. Therefore, it could miss other possible query plans,
especially certain join plans that are able to show lower costs than the costs of
the join plans derived by Spark SQL. Spark SQL performs the optimization algorithms
on other elements besides the join orders, but determining the join orders of a query
occupies the largest part in the entire optimization process because of its high complexity.
The proposed PPO-based join order optimization model follows the behavior of
Spark SQL. The learning model is able to find more join plans with lower costs than
the plans that Spark SQL actually finds because of the limitations mentioned. Therefore,
we replace the join order optimization process in Spark SQL with a taught model. Then,
we use the cost computation method of Spark SQL to estimate the join plans generated
by the model.
Fig. 2. Overall architecture.
Fig. 3. Example of converting the query information into the states of the model.
3.2 PPO-based Join Order Optimization for Spark SQL
The join order optimization model based on the cost computation of Spark SQL
is shown in Fig. 2. The overall process of the model is as follows. First, to optimize the join orders
of queries using the deep reinforcement learning algorithm, each query to be optimized
needs to be initialized as the environment of the model so that they can be used as
network inputs. We use a method similar to a previous one [2]. The states and actions are defined using clause information in the query. The state
is constructed using three characteristics of the query. First, the relation information
of the query is defined as a relation state. The relation vector is expressed in the
form of N by N, where N is the number of tables in the database schema to be learned.
An example is shown in Fig. 3. The upper portion of Fig. 3(a) shows the state in which no join has occurred, while the lower portion represents
the updated state after joining the relations A and C. Second, the join operation
information is defined as a join predicate state in Fig. 3(b) and represents a join condition between relations included in the input query. In
addition, this state is a vector used to distinguish join actions between relations
that are possible in a certain state. Like the relation state, it is expressed in
the form of N by N. The operations contained in the query are represented as either
1 or 0.
Third, the information of selection operations is defined as a selection state.
The state indicates the selection operations in the ``where'' clause of the input
query. The selection state is expressed in the form of 1 by M, where M is the number
of attributes in the database schema. This is shown in Fig. 3(c). If an attribute is used in the selection operation in the query, it is represented
as 1. Otherwise, it is represented as 0.
An action of the model is an action that can be taken in a specific relation
state. In this paper, it refers to performing a certain join operation between relations.
There are zero or more actions that can be selected. When one of the actions is selected,
the relation state vector is updated because it means that a certain join operation
is performed.
In addition, we describe the definition of a reward. In order to solve the limitation
of a previous reward function [8], we redefine the reward by adding one condition. We define the minimum cost $C_{\min
}$ that is generated during the current execution of the model. When the query plan
is executed, the cost $\textit{C}$ is returned. If the model generates the first query
plan, $C_{\min }$ will be $\textit{C}$. Otherwise, it is compared with $C_{\min }$,
and if $\textit{C}$ is lower than $C_{\min }$, $C_{\min }$ is updated as $\textit{C}$.
$\textit{C}$ is converted into a reward. If $\textit{C}$ of the query plan is 1000
times or more than $C_{\min }$, the plan is judged to have a large cost, and the reward
is given as ``–1''. Otherwise, if $\textit{C}$ is 1000 times or less, the plan takes
the reciprocal and the model uses it as a reward.
The join order optimization model proposed in this paper is trained as shown
in Fig. 4. First, the PPO-based join order optimization model initializes the state in the
form of vectors from the input query. The model considers the initialized state and
lists possible actions, which are the join operations, according to the conditions
of the current state. After that, the state is updated according to the selected action.
Under the condition that the state vector is updated, the model again lists a set
of actions. Next, it selects the optimized action and repeats the same process until
the model reaches the terminal state where there are no more selectable actions. Finally,
a final optimized join order is derived. Then, Spark SQL executes the input query
according to the final join plan generated by the model and returns its cost. In this
study, we train the model by converting these costs into rewards.
In our model, we apply the PPO algorithm, not the DQN algorithm [8]. The DQN algorithm-based model is trained to maximize the Q-value. In addition to
ordering joins, there are various factors for query optimization, such as physical
operations and index selection. If the factors to be considered increase, the optimization
problem becomes more complicated, and its action space increases. In general, the
DQN algorithm does not work well for such complex problems. Even though we consider
the join ordering in this study, if we use the DQN algorithm for this problem, the
model will be difficult to extend to more complicated optimization cases. For this
reason, we used the PPO algorithm.
Fig. 4. Example of the overall training process.
Fig. 5. Cost changes based on number of training cycles of the model for query 8 with 10 GB.
Fig. 6. Cost changes according to the number of training cycles of the model for query 8 with 100 GB.
Fig. 7. Cost changes according to the number of training cycles of the model for query 8 with 300 GB.
4. Performance Evaluation
We conducted a performance evaluation using the TPC-H Benchmark dataset and query
set, which are widely used as database performance indicators. We generated 10 GB,
100 GB, and 300 GB datasets of TPC-H, which consists of 8 input tables. For the query
set, there are 22 queries in TPC-H. The proposed model optimizes the join orders,
so we picked 11 queries, which each contain at least two or more join operations.
A total of 13 servers were set. One master server was used to train the PPO-based
model, and the other 12 slave servers executed queries according to the generated
join orders in parallel. Each server has the same hardware specifications (Intel Xeon
E5-2620 v4@2.7GHz, 64 GB of RAM, and 1 TB HDD). Spark 2.4.3 (which contains Spark
SQL 2.1.1) is installed on each server. The PPO-based model has two hidden layers,
an Adam optimizer, and a ReLU activation function. We developed our model for Spark
SQL using previous code [11].
The PPO-based model was trained 500 times for each of the 11 queries using data
sizes of 10 GB, 100 GB, and 300 GB. Fig. 5 shows the changes of the costs for the join plans generated by the model as the model
is trained 500 times for TPC-H query 8 with the 10 GB dataset. The model mainly selects
join plans with various costs until it is trained 250 times. However, after the model
is trained 250 times, it mainly selects join plans with low costs, which is similar
to the performance of Spark SQL.
Similar training results were observed for the other TPC-H queries. We also conducted
the same experiments for query 8 using different data sizes of 100 GB and 300 GB,
as shown in Figs. 6 and 7. On the extended datasets, the model tends to converge to
certain costs after it is trained 300 times and 400 times, respectively. After that,
similar to the experiment with the data size of 10 GB, they mainly select join plans
with low costs.
Fig. 8 shows a comparison of the costs of the join plans generated by Spark SQL and the
PPO model for 11 queries. For queries 2, 3, and 5, the join plans generated by the
model have lower costs than the join plans of Spark SQL. As we mentioned in the last
part of Section 3.1, the reason why several join plans of the model have lower costs
than the plans of Spark SQL is that the model finds additional join plans that Spark
SQL cannot find because it uses a static and heuristic planning algorithm. For the
remaining eight queries, the graph shows that the join plans generated by the model
have the same cost as the join plans derived by Spark SQL.
Fig. 8. Performance comparison of Spark SQL and the PPO-based model for 11 queries for 10 GB, 100 GB, and 300 GB.
5. Conclusion
In this paper, we proposed a learned model for optimizing the join orders of queries
on Spark SQL using a PPO algorithm to improve the query evaluation performance for
large amounts of data. The model follows the join order optimization mechanism of
Spark SQL by using the costs computed by Spark SQL to train the model. As a result,
the model optimizes the join order of queries without actually executing the optimization
algorithm. In addition, the model is able to generate join plans with lower costs
than the plans that Spark SQL finds because of Spark SQL’s static and heuristic algorithm.
Experiments demonstrated that the model gradually selected the join plans with
low costs as the number of training cycles increased. After the model was trained
a sufficient number of times, it selected join plans with costs equal to or lower
than the costs generated by Spark SQL. However, the cost computation method of Spark
SQL that the model uses estimates the join plans using the simple statistics of input
tables, so it can be inaccurate. In other words, join plans with low costs can actually
have very long execution times. Therefore, we need to train the model using the actual
execution times of generated join plans on Spark SQL, not the costs as a reward of
the model. In future work, we plan to improve the performance of the model by training
it with the actual execution time on Spark SQL.
ACKNOWLEDGMENTS
This work was supported by research fund of Chungnam National University.
REFERENCES
Jeong S-Y, et al. , 2020, Efficient Network Administration for Smart Grid Data Center,
In: 2020 22nd International Conference on Advanced Communi-cation Technology (ICACT),
pp. 48-51
Marcus R., et al. , 2018, Deep reinforcement learning for join order enumeration,
Proceedings of the First International Workshop on Exploiting Artificial Intelligence
Techniques for Data Management, pp. 1-4
Marcus R., et al. , 2018, Towards a hands-free query optimizer through deep learning,
arXiv preprint arXiv:1809.10212
ORTIZ , Jennifer , et al. , 2018, Learning state representations for query optimization
with deep reinforcement learning., In: Proceedings of the Second Workshop on Data
Management for End-To-End Machine Learning, pp. 1-4
HEITZ , et al. , 2019, Join Query Optimization with Deep Reinforcement Learning Algorithms,
arXiv preprint arXiv:1911.11689
Li , Guoliang , et al. , 2019, Qtune: A query-aware database tuning system with deep
reinforcement learning., Proceedings of the VLDB Endowment 12.12:, pp. 2118-2130
Yu , Xiang , et al. , 2020, Reinforcement learning with tree-lstm for join order selection.,
2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, pp. 1297-1308
Lee K-M, et al. , 2020, DQN-based Join Order Optimization of Spark SQL in Smart Grid,
Summer Annual Conference of IEIE, pp. 1919-1821
Armbrust , Michael , et al. , 2015, Spark sql: Relational data processing in spark.,
Proceedings of the 2015 ACM SIGMOD international conference on management of data.
SCHULMAN , et al. , 2017, Proximal policy optimization algorithms, arXiv preprint
arXiv:1707.06347
https://github.com/antonismand/ReJOIN
Author
Kyeong-Min Lee received his B.S. degree in computer engineering from Chungnam National
University, Korea, in 2019. He is now an M.S. candidate at Chungnam National University.
His main research interests include reinforcement learning, deep learning, distributed
processing, and big data.
InA Kim received her B.S. and M.S. degrees in computer engineering from Chungnam
National University, Korea, in 2014 and 2016. Since 2016, she has worked as a researcher
at the Software Research Center (SOREC) in Chungnam National University. Her main
research interests include database systems and parallel processing for big data.
Kyu-Chul Lee received his B.S., M.S., and Ph.D. degrees in computer engineering
at Seoul National University in 1984, 1986, and 1996, respectively. In 1994, he worked
as a visiting researcher at the IBM Almaden Research Center, San Jose, California.
From 1995 to 1996, he worked as a Visiting Professor at the CASE Center at Syracuse
University, Syracuse, New York. He is currently a professor in the Department of Computer
Engineering at Chungnam National University, Daejeon, Korea. His current areas include
database systems, semantic web, big data processing, and artificial Intelligence.
He has published over 100 technical articles in various journals and conferences.
He is a member of ACM, the IEEE Computer Society, and the Korea Information Science
Society.