Mobile QR Code

1. (IoT and Big Data Research Center, Incheon National University, Yeonsu-gu, Incheon, 22012, Korea {ranjannavin07, sovit198}@gmail.com )
2. (Department of Electronics Engineering, Incheon National University, Yeonsu-gu, Incheon, 22012, Korea {kyc0288, hoon}@inu.ac.kr )

Q-learning, Mobile routing, Traveling salesman problem, Dijkstra algorithm, Illegal parking, Edge, Computing, Networks

## 1. Introduction

Route planning in a network is an optimization problem that has been studied in applied mathematics and computer science for decades. This problem is computationally expensive and has been solved using many different algorithms proposed, but providing fast and reliable solutions to this optimization problem is still a challenging task [1]. Route planning plays a critical role in a variety of applications, such as packet transfer in wireless sensor networks [2-6], routing in 5G and 6G networks [7-9], and vehicle routing [10-14]. Several researchers have developed and proposed many exact, heuristic, and metaheuristic algorithms to solve the route-planning optimization problem. In particular, these algorithms have their respective pros and cons. For example, an exact algorithm, such as the Dijkstra algorithm, computes every possible solution and chooses the best among them. However, even though this algorithm provides the exact solution, it tends to suffer from high time-complexity even while route planning in a fairly small network [14-16]. However, since time complexity is a major factor for real-time route planning in a network, it is much better to look for an approximate solution instead of this exact solution. Therefore, the heuristic and metaheuristic algorithms, such as the local search algorithm, ant colony optimization, particle swarm optimization, etc., are preferred in solving the route-planning optimization problem. However, these algorithms also suffer from limitations. For instance, the heuristic algorithms, in many cases, are unable to generate the best solution to this optimization problem.

Recently, machine learning (ML) algorithms have become popular in various fields due to these algorithms’ ability to handle large-data and data non-linearities and the availability of cheap and high computational power [17-19]. Reinforcement learning (RL) is the major class of machine learning and has been inspired by the behavior learning of animals. The RL algorithms seem to perform well in partially observed environments or high-dimensional spaces, when an approximate solution to any complex problem may be of greater value than an exact/best solution, which is possible only to any simple problem, in these environments/spaces. The RL-based algorithms can solve the route-planning optimization problem in applications like packet transfer in a wireless network, vehicle route planning in a transportation network, signal routing in a communication network, navigation through a maze, etc., which involve large networks, and the RL algorithms like Q-learning give the best solution to this problem regardless of the environment.

The advantages of Q-learning, such as its off-policy and temporal-difference learnings, inspired us to propose the DQSR algorithms (one for each stage). The DQSR is based on two Q-learning agents: the first stage’s Q-learning agent develops the shortest path between each customer and this customer’s nearest neighbor customers in the network and finds this path’s length and the second stage’s Q-learning agent finds the shortest route from any customer to the other customers by virtually visiting each other customer only once and returning to the starting customer in a route that has the shortest distance. Hence, both Q-learning agents share the responsibility of achieving the shortest route, and hence the DQSR achieves this route relatively fast compared to the exact, some heuristic, or the single-stage, Q-learning based algorithms. In addition, the concept of nearest neighbor customers cannot be used in the exact and heuristic algorithms for the shortest-route planning/generation. In essence, the main contributions of this paper are summarized as follows:

· We developed a dual-stage, Q-learning based, shortest-route generation (DQSR) and its two algorithms.

· We implemented the idea of calculating the shortest path and distance between each customer and this customer’s nearest neighbor customers for the shortest-route generation.

· Our extensive experiments on the DQSR based, illegal- parking surveillance demonstrate the efficiency of the DQSR compared to two other algorithms used for the same purpose, in terms of time complexity and shortest-route generation.

We described the background of our research in this section. The rest of the paper is structured as follows. Section 2 presents the works related to our research. Section 3 presents the methodology of our research, including the background on reinforcement learning, problem statement, and the DQSR algorithms. Section 4 presents the data set used in the experiment, algorithms compared with the DQSR algorithms, simulation parameters, the case study results, and discussion. Finally, Section 5 concludes and provides the future direction for this research.

## 2. Related Work

Route planning is one of the most important functions of an intelligent transportation system. It is widely used in various areas, such as daily travel route planning [20-22], package delivery service or indoor warehouse management [23-25], and traffic congestion reduction [26-28]. The route planning also saves a lot of time and money for the transport management agencies. In recent years, many academic works have been devoted to solving the route-planning optimization problem with multiple locations to be connected by the route, in a relatively low time. Some of the recent academic works have also been handling route planning under multiple objectives.

Nowadays, many algorithms find the shortest path between nodes/locations in a network, with widespread applications. For example, an exact algorithm based on traditional graph-based methods, such as the Dijkstra algorithm, is used to find this shortest path in a network under multiple destinations [29]. Notably, the multi-destination Dijkstra algorithm (MDDA) is the recent algorithm to find the multi-destination shortest path in a network. In particular, the MDDA uses a Fibonacci heap [30,31]. Meanwhile, Lu et al. [32] proposed a node combination method to implement the Dijkstra algorithm in finding the shortest path in a network by following three simple iterative steps: find the nearest neighbor node of the source node, combine the nearest neighbor node with the source node, and modify the weights on the edge connecting the source and nearest neighbor nodes. However, the node combination method is not efficient or faster than the MDDA. Likewise, Orlin et al. [33] proposed an algorithm to efficiently find the shortest path in a network, which is applicable even to a very sparse network or under a small number of positive integer weights. Further, Jin et al. [34] presented the modified MDDA to find the shortest path in a time-schedule network. In particular, they considered the network constraints on its edges to find this shortest path. Also, Asaduzzaman et al. [35] proposed a modified Dijkstra algorithm to find the shortest path to multiple destinations in a network by using a minimum number of weights. Interestingly, Hart et al. [36] proposed an A* algorithm, which is an exact algorithm for incorporating information from an optimization problem to a graphical analysis performed to find a minimum-cost path in a network. Similarly, Ticha et al. [37] found the multi-destination, bi-objective shortest path using an A* search strategy. In particular, their search strategy involves choosing some part of the path for a non-dominated path, in each iteration, to arrive at one of the destination nodes. However, the last two exact and heuristic algorithms and alike have a serious issue of time complexity and have a polynomial run-time according to the graph size. Meanwhile, Bast et al. [38] developed an algorithmic approach for fast routing in a road network having transit nodes, reducing the quickest-path queries in this road network into a small number of table lookups reducing the query time significantly. Their algorithm is more than one million times faster than the best-known algorithm for routing in a general network. Finally, Shen et al. [39] improved the shortest-path databases of a network by (i) contracting the input graph before preprocessing and (ii) limiting the preprocessing step to only a selected subset of the graph nodes.

Vehicle routing or vehicle-route planning is the generalization of the traveling salesman problem (TSP). Meanwhile, involving only the nearest neighbors, a greedy and very simple strategy, allows the best node/vertex to join the solution. But, this strategy is inefficient in solving the TSP. Bentley [40] achieved an improved run-time in finding the approximate solutions to large-geometric TSPs. In addition, Clarke et al. [41] presented an iterative procedure that enables the rapid selection of an optimum or near optimum route for a fleet of trucks (with varying capacities) from a central depot to several delivery points. Recently, new features are being brought by the ML approaches, and researchers have been coupling ML features with the well-known heuristic concepts, introducing new hybrid algorithms. For example, Mele et al. [42] presented a new, constructive heuristics driven by the ML for solving the TSP. This solution’s algorithm enables a reasonable generalization and unveils an efficient balance between the ML and optimization techniques. Likewise, Zhang et al. [43] proposed deep reinforcement learning for the TSP, with time windows and rejections. Their deep RL showed superior performance compared to the Tabu search heuristics, in terms of the solution quality and inference computation time. Further, Reinelt [44] provided the TSP library (TSPLIB) for researchers, with a broad set of test problems from various sources and with various properties. Finally, the arc routing or Chinese Postman problem [45,46] presents a process for finding a least-cost way to traverse each arc of a network at least once and return to the node/vertex from which the agent starts.

## 3. Methodology

### 3.1 Reinforcement Learning

Reinforcement learning (RL) is a computational approach to understanding and automating goal-directed learning and decision-making. It is about taking suitable action to maximize the reward in a particular situation. The RL differs from other computational approaches, such as supervised learning. In particular, the RL agent learns by interacting with its environment with/without prior knowledge of this environment or without relying on examples. Also, the RL is based on the Markov decision processes (MDPs).

An MDP is a discrete-time stochastic control process providing the mathematical framework to model the decision-making in a situation where the outcomes are partly random and partly under the decision-maker’s control. The MDPs are structured from a finite set containing the state-space ($S$), action-space ($A$), state transition probability matrix ($P$), and reinforcement or reward ($R$). In particular, it is the tuple$\left(S,A,P_{a},R_{a}\right)$. In addition, a state $\left(s\in S\right)~$is the representation of the environment at a specific time step. Further, an action $\left(a\in A\right)$ is one in a set of the possible actions in the state$s\,.$ And, the state transition probability, $P_{a}\left(s,s'\right)=$ $P\left(s_{t+1}=s'~ |s_{t}=s,a_{t}=a\right)$, gives the probability that the action$~ a~$in the state$~ s~$at the time$~ t~$will lead to the state $s'$at the time$~ t+1$. Finally, the reinforcement or reward, $R_{a}\left(s,s'\right),$ is the immediate reward received after transitioning from the state$~ s$ to the state $s'$.

The RL agent interacts with the environment in a sequence of steps at the time t: the agent (i) receives a representation of the environment, i.e., the state$~ s$ (ii) selects the action$~ a$ by using the ${\varepsilon}$-greedy$~$selection (explained below) and executes this action (iii) receives the immediate reward signal $R_{a}\left(s,s'\right)$ (iv) updates the learning matrix (v) observes the new state$~ s'$of the environment.

The goal of the RL agent is to learn a policy$~ \left(\pi \right)$, i.e., to learn to map from the state space ($S$) to the action space ($A$) in a way to maximize the numerical reward signal. Therefore, the RL agent is not explicitly programmed of its action in a state. Instead, the RL agent learns through a trial and error procedure the action yielding the most reward in a state. In addition, the RL agent also follows the exploration and exploitation strategies during its learning to find the most rewarding action in a state. The exploration strategy allows the RL agent to improve the agent’s current knowledge about each state-action pair by taking a random action from the action list, instead of taking the best possible action, in a state. On the other hand, the exploitation strategy makes the RL agent choose the best action from the actions list to get the most reward by exploiting the agent’s current state-action value (knowledge). However, a balance between the exploration and exploitation strategies is essential and accomplished by the ${\varepsilon}$-greedy selection or the Boltzmann selection. The ${\varepsilon}$-greedy selection’s parameter $\varepsilon \left(0<\varepsilon <1\right)$ is defined and the decision policy $\pi \left(s\right)~$is applied according to the following equation.

##### (1)
$$$\pi \left(s\right)=\left\{\begin{array}{ll} a^{*}, & with\,\,the\,\,probability1-\varepsilon \\ a^{r}, & with\,\,the\,\,probability\,\,\varepsilon \end{array}\right.$$$

where $~ a^{\mathrm{*}}~$and $a^{r}~$refer to the best and random actions selected (under exploitation and exploration), respectively, in the current state$~ s$.

The Q-learning is one of the off-policy RL algorithms in which the agent learns the best state-action value function independent of the policy being followed. In particular, the Q-learning is based on the temporal-difference learning (TD), i.e., the value estimates are updated incrementally as the agent interacts with the environment, without waiting for the final outcome. In particular, the Q-learning’s state-action value is updated as,

##### (2)
$Q_{t+1}\left(s,a\right)\leftarrow Q_{t}\left(s,a\right)+\alpha \left[r\left(s,a\right)+\gamma \max _{a'}Q\left(s',a'\right)-Q_{t}\left(s,a\right)\right]$,

where $Q_{t}\left(s,a\right)$ and$Q_{t+1}\left(s,a\right)$ are the state-action value (at the time $t$) and the updated state-action value (at the time $t+1)$ of the state-action pair $\left(s,a\right),$respectively. In addition, $r\left(s,a\right)$ is the immediate reward for executing the action $a~$ in the state$s,\alpha \in \left[0,1\right]$ is the learning rate, $\gamma \in \left[0,1\right]$ is the discount factor, and $\max _{a'}Q\left(s',a'\right)$ is the best Q-value (state-action value) in the state$s'$for the action $a'$.

### 3.2 Problem Statement

Let the road network be represented by a directed graph, $G=\left(V,E\right)$, where $V$ is the set of nodes, $E$ is the set of edges (road segments), and each edge connects two nodes in the road network. The length of the edge $e\in E$ is denoted by $l_{e}\in \mathrm{~ \mathbb{R}}.$ Let us consider that the RL agent has to visit ‘n’ customers’ respective locations or edges, $D=\left\{d_{1},d_{2},\ldots ,d_{n}\right\}\in E$, exactly once before returning to the starting node$~ \left(d\in D\right)$. Here, an edge containing a customer is considered a node to perform the DQSR. Notably, the second stage of the DQSR is a classic example of the TSP.

### 3.3 The Dual-stage, Q-learning based, Shortest-route generation (DQSR) Algorithms

#### 3.3.1 The Q-learning based, Shortest-path Generation Algorithm

The calculation of the shortest path between two nodes is one of the classic problems in graph theory. The DQSR uses the Global Positioning System (GPS) location, i.e., the latitude and the longitude, of all the customers to calculate each customer’s all ‘$x$’ nearest neighbor customers’ respective locations. Then, it uses the Q-learning-based, shortest-path generation algorithm (Algorithm 1) to generate the shortest path between each customer and this customer’s nearest neighbor customers and find this path’s length.

The inputs to the Algorithm 1 are given in line 1 of the algorithm. In particular, the RL agent requires the road network represented by a directed graph, $G\left(V,E\right)$, road network length graph, i.e., the road network with the each of its edge having the corresponding length, list of customers to visit (in terms of their corresponding edges), each customer’s nearest neighbor customers list, ‘n’ Q shortest-path tables \{$Q_{\colon 1},Q_{\colon 2},\ldots ,Q_{\colon n}\}$to store the updated state-action value for each of the ‘n’ customers, respectively, distance table $\left(D_{Route}\right)~$to store the length of the shortest path between each customer and this customer’s nearest neighbor customers, path table $\left(P_{Route}\right)~$to store the shortest path between each customer and this customer’s nearest neighbor customers, and maximum number of random walks the RL agent can take before terminating the ‘while’ loop. Notably, the Q shortest-path tables each have the same state transition probability matrix as that of the road network. In contrast, the distance and path tables each have the same state transition probability matrix as that of each customer and this customer’s nearest neighbor customers, i.e., with size $n\times x.$ The $i^{th}$ Q shortest-path table, $Q_{\colon i}\,,$ stores the Q-value update for the path from the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer. In particular, the $Q_{\colon i}$ is used to calculate the shortest path and distance from the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer; here ‘$\colon$’ represents the list of the $i^{th}$ customer’s nearest neighbor customers. Also, the output of the algorithm includes the updated path and distance tables. In addition, the algorithm requires specifying the learning rate, discount factor, initial epsilon value, epsilon decay rate, and minimum epsilon.

The working of Algorithm 1 is shown in lines 4 to 24 of the algorithm. The first ‘for loop’ loops through all the customers to visit, as mentioned in line 4 of the algorithm. The RL agent copies the $i^{th}$ customer’s Q shortest-path table and the road network length graph as the reward table with the sign of its each edge length changed to negative, in lines 5 and 6 of the algorithm, respectively. The second ‘for loop’ loops through the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers, as mentioned in line 7 of the algorithm. Since the goal of the RL agent is to reach the $i^{th}$ customer’s nodes, we manipulate the reward table by adding extra positive reward to each edge that is connected to any of these nodes, as in line 8 of the algorithm. This extra reward enables the RL agent to learn effectively. Q shortest-path table update for the path from the $i^{th}$ customer’s all ‘x’ nearest neighbor customers to the $i^{th}$ customer is presented in lines 7 to 20 of the algorithm. First, the RL agent starts its functioning from the assigned current state and random-walks one step at a time until its termination according to the ‘while’ condition given in line 11 of the algorithm. In other words, the RL agent’s random walking is terminated either when the number of agent random walks completed exceeds the maximum number of random walks the agent can take (max_steps) or when the agent reaches the end state. Meanwhile, the RL agent chooses the action ($a$) at the current state ($s$) (as mentioned in line 12 of the algorithm), based on the $\varepsilon$-greedy selection, which enables the agent to either explore or exploit depending on the epsilon value. The RL agent then performs the chosen action and moves to the next state ($s')$, as in line 13 of the algorithm. Subsequently, the RL agent updates the previous Q-value, according to Eq. (2), as in line 14 of the algorithm. Further, the RL agent checks if the state $s'$ is the end state and changes the termination parameter (goal)’s value accordingly, as in lines 15 and 16 of the algorithm, respectively. Next, the RL agent replaces the current state with the next state and decrements the max_steps. The epsilon value is then updated after the termination of the RL agent’s path search, as in lines 18 and 19 of the algorithm. The entire process of Q-value update, as in lines 10 to 19 of the algorithm, is repeated $T$ (an user-defined number) times by the ‘for loop’ (in line 10 of the algorithm). These $T~$repetitions allow the RL agent to learn the correct shortest-path from the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers to the $i^{th}$ customer. The RL agent then calculates the path and distance from the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers to the $i^{th}$ customer, in line 20 of the algorithm, based on the Algorithm 3 (given subsequently in this paper). Finally, the RL agent updates the $i^{th}$ component of the distance and path tables (in lines 22 and 23 of the algorithm, respectively).

An example of the Q-learning based generation of the shortest path from each customer’s nearest neighbor customers to this customer is shown in Figs. 1(a)-(d). In particular, Fig. 1(a) shows an arbitrary road network with five customers. Each customer’s starting and end nodes are represented by ‘green’ and ‘light blue’ color dots, respectively. The RL agent then updates the shortest path from each customer’s nearest neighbor customers to this customer by visiting each customer once and returing to the starting customer.

The process of updating the Q shortest-path table for each customer is shown in Fig. 1(b). Fig. 1(b) also shows the list of the respective Q shortest-path table updates for all the five customers. In particular, the first image in Fig. 1(b) shows the learned Q-value for the path from the respective end node of all the customer $d_{1}$’s nearest neighbor customers to $d_{1}.$ In addition, each edge’s width represents this edge’s Q-value compared to other edges. Since, the Q-value grows when reaching the destination, keeping the Q shortest-path table update for the path from each customer’s nearest neighbor customers to this customer in one Q shortest-path table is possible and doesn’t contradict when exploiting the Q shortest-path table to generate the shortest path. Fig. 1(c) shows each customer to this customer’s each nearest neighbor customer transitional distance-table. Similarly, Fig. 1(d) shows each customer to this customer’s each nearest neighbor customer transitional path-table. Notably, the paths in this table are reversed when the states are reversed.

#### 3.3.2 The Q-learning based, Shortest-route Generation Algorithm

Algorithm 2 is for the Q-learning based generation of the shortest route from any customer to other customers by visiting each other customer once and returning back to the starting customer. The inputs to the algorithm are given in line 1 of the algorithm. The RL agent requires the Q-routing table having each customer and this customer’s nearest neighbor customers’ state transition, the distance routing and path routing tables from the Algorithm 1, the reward table having its state transitions same as those of the Q-routing table with the initial reward value as -1, the list of customers to visit, the maximum number of random walks for the RL agent, the total number of episodes, and the starting customer. The output of the algorithm includes the shortest route and distance. The algorithm also requires specifying the learning rate, discount factor, initial epsilon value, epsilon decay rate, and minimum epsilon.

The working of Algorithm 2 is shown in lines 4 to 38 of the algorithm. The algorithm consists of five subparts: the global variable setup, episode variable setup, Q-value update, the global variable’s update by comparing the episode variables, and route and route distance generation. The Q-routing and reward tables are copied as the best Q-routing and reward tables, as in line 4 of the algorithm. Then, the global variable for the number of visited customers and the distance covered, respectively, is initialized, as in line 5 of the algorithm.

##### Algorithm 3. Path/route generation and the corresponding distance calculation.

The ‘for loop’ loops through all the available episodes, as shown in line 8 of the algorithm. The episode variables are updated in each episode, as in lines 7 to 11 of the algorithm. The Q-value update during each loop is similar to the Q-value update of Algorithm 1 and is shown in lines 12 to 26 of the algorithm. The RL agent starts its functioning from the assigned current state and random-walks one step at a time until its termination according to the ‘while’ condition given in line 12 of the algorithm. The RL agent is terminated either when its completed number of random walks exceeds the maximum number of random walks the agent can take (max_steps) or when the agent reaches the end state. The agent then chooses an action ($a$) at the current state ($s$), based on the $\varepsilon$-greedy selection which enables the agent to either explore or exploit based on the epsilon value, as in line 13, and selects the next state, as given in line 14 of the algorithm. Subsequently, the episode reward table is updated, as in lines 15 to 22 of the algorithm. A negative reward is assigned between the current and next states if the RL agent visits all the customers and ends at a different customer than the starting customer, as in lines 15 and 16 of the algorithm. In addition, the reward table is updated with a negative reward if the RL agent visits the same customer more than once, as in lines 19 and 20 of the algorithm. On the other hand, a positive reward is assigned if the RL agent visits the starting customer’s each other customer once and returns to the starting customer, as in lines 17 and 18 of the algorithm. Then, the next to visit customer is removed from the to-visit customers list, as in lines 21 and 22 of the algorithm. The RL agent subsequently updates the previous Q-value, based on the current and next customer states and using Eq. (2), as in line 23 of the algorithm. Also, the RL agent replaces the current state with the next state and decrements the max_steps, as in line 26 of the algorithm. Finally, the number of customer locations visited (VD) by the RL agent is counted after the Q-learning process is terminated. Further, the global route and global route distance variables are calculated based on the VD, as in lines 28 to 35 of the algorithm. If the VD is equal to the previous global_visited_path, no changes in the best reward and the Q-routing tables are made. Otherwise, if the VD is greater than the previous global_visited_path, the length of the global route distance is set to negative infinity, and the best reward and Q-routing tables are replaced by the episode reward and Q-routing tables, respectively, as in lines 29 to 31 of the algorithm. This step is followed by assigning a global_visited_path equal to the VD. Further, if the episode route distance (episode_distance) is greater than the global route distance, then the global route distance is replaced by the episode route distance, and the best reward and Q-routing tables are replaced by the episode reward and Q-routing tables, respectively, as in lines 33 to 35 of the algorithm. The epsilon value is then updated after the termination of the RL agent’s route search, as in lines 36 and 37 of the algorithm. Finally, the RL agent generates the shortest route and distance after the Q-value update for all the possible episodes is completed, based on Algorithm 3. An example of the shortest route generation is shown in Figs. 1(e) and 1(f). Fig. 1(e) shows the Q-value transition between each customer and this customer’s nearest neighbor customers. Subsequently, the shortest route from any customer to other customers is generated based on Algorithm 3 andthe final Q-routing table from Fig. 1(e).

Algorithm 3 is for the path/route generation and the corresponding distance calculation. The inputs to the algorithm are the updated Q-value table, starting and ending nodes, and distance table (the road length graph for the path generation and the distance routing table for the route generation), as in line 1 of the algorithm. The output of the algorithm is the path/route generated and its corresponding distance, as in line 2 of the algorithm. The working of the algorithm is shown in lines 3 to 14 of the algorithm. The current state, path, and distance are assigned their initial values in lines 3 to 5 of the algorithm. In addition, the variable ‘goal’ is set to False initially, as in line 6 of the algorithm. Notably, the RL agent calculates the path/route according to lines 7 to 13 of the algorithm. In particular, the RL agent goes through the ‘while loop’ until the terminating condition is met. The RL agent then calculates the next state from the current state by using the Q-value table through exploitation, as in lines 8 and 9 of the algorithm. Subsequently, the path and distance are updated, as in lines 10 and 11 of the algorithm. In addition, the RL agent checks if the next state is the end state during each ‘while loop’, as in line 12 of the algorithm. If the next state is the end state, the variable ‘goal’ is set to ‘True’, as in line 13 of the algorithm, and the ‘while loop’ terminates. Finally, the path/route and its corresponding distance are returned by the algorithm.

## 4. Experiments

### 4.1 Dataset

We train the RL agent under the DQSR algorithms and evaluate the DQSR’s effectiveness by conducting a case study on the DQSR-based, illegal-parking surveillance in Yeonsu-gu, Incheon, South Korea. Two types of data were required for this case study: (i) the road network with the road information of the Yeonsu-gu region and (ii) the illegal-parking locations in this region. So, the road network was generated using the Quantum Geographical Information System (QGIS), a free and open-source cross-platform desktop GIS application that supports the viewing, editing, and analysis of geospatial data. Further, we collected the details of parking violations in the Yeonsu-gu region for the period of 2017-2019, specifically between 07:00-09:00 on weekdays. The dataset used in this case study is shown in Fig. 2. The ‘pink’ line shows the region of concern, i.e., the Yeonsu-gu region, and the ‘black’ lines show the road network of this region. Also, the ‘green’ dots in Fig. 2 show the respective locations of illegal parking (by the customers) that need to be visited by the RL agent to investigate the parking violation, and the ‘blue’ lines show the road links with illegal parking reported in these road links. In addition, a numerical value adjacent to a road link indicates this road link’s length. Fig. 2 also shows that most parking violations were near a road link and not on a road link. Therefore, we selected the road link nearest to each respective customer (with a corresponding illegal-parking location) for surveillance. The detailed feature representation of the dataset is shown in Table 1. In particular, the dataset consists of the road network information, such as the link id, link starting and ending nodes, road link length, longitude and latitude of each link’s starting node, and road link (i.e., customer) nearest to the illegal-parking location. In addition, Table 2 shows the basic information on the dataset. Meanwhile, Yeonsu-gu, Incheon, has an area of 50.8 km$^{2}$ and includes 2153 road links with a total road network length of 557.5~km.

##### Table 1. The feature representation of the dataset used to evaluate the DQSR’s effectiveness.
 S. No. LINK_ID S_NODE E_NODE Length (m) Longitude Latitude IP Customer index 1. 1610309000 1610019100 1610119400 442.095 123.623 37.4323 - - 2. 1610067301 1610019100 1610090800 426.306 123.623 37.4323 - - 3. 1610067100 1610019100 1610019300 261.743 123.623 37.4323 13 0 4. 1640269200 1610118900 1640085900 88.833 126.599 37.4187 - - 5. 1610294600 1610118900 1610118700 197.158 126.599 37.4187 - - 6. 1640001700 1630002600 1640016500 230.725 126.697 37.4361 - - 7. 1630012200 1630002600 1630003000 329.310 126.697 37.4361 24 1 8. 1630012000 1630002600 1630003900 575.696 126.697 37.4361 - - 9. 1640002000 1630003100 1640016600 361.970 126.681 37.4161 - - 10. 1640002100 1630003100 1640016700 243.891 126.681 37.4161 18 2 … … … … … … … … …
##### Table 2. The basic information of the dataset used to evaluate the DQSR’s effectiveness.
 S. No. Parameter Value 1. Region Yeonsu-gu, Incheon, South Korea 2. Total area 50.8 km2 3. Total road length 557498.47 m 4. Number of road links 2153 5. Number of illegal parking NA 6. Number of customers 135

### 4.2 Algorithms Compared and Simulation Parameters

The effectiveness and superiority of the DQSR algorithms are verified by comparing these algorithms with two other exact and heuristic algorithms, respectively. The first algorithm compared (Q+TPS) is based on a Q-learning algorithm that generates the shortest path between each customer to all other customers and a TSP algorithm that solves for the shortest route covering all the customers (without revisiting any customer) and returning to the starting customer. The second algorithm compared (Dijikstra+TSP) is based on the Dijkstra and Travelling Salesman Problem algorithms, where the Dijkstra algorithm generates the meta-graph (the shortest path from each customer to all other customers), and the TSP algorithm finds the shortest route covering all the costumers (without revisiting any customer) and returning to the starting customer.

Table 3 summarizes the simulation parameters and their values for implementing the DQSR algorithms and the two algorithms compared with the DQSR algorithms. The simulation parameters for the DQSR are mentioned in Table 3’s lines 1 to 9. Notably, the DQSR’s two Q-learning-based algorithms have different simulation parametric values. In particular, the difference between their respective epsilon decay rates and the number of training episodes is very large.

The RL agent learns the Q-value table for 50 pairs of source and destination customers in the DQSR’s Algorithm 1 and goes in a loop of 30 episodes per pair. Therefore, although the epsilon decay rate and total episodes are set to 0.9 and 30, respectively, the RL agent has to go through 1500 cycles before calculating the shortest path and distance. Whereas the DQSR’s Algorithm 2 is tasked only with calculating the shortest route covering all the 135 customers. Since this algorithm searches for the shortest route without repeating with the smallest distance, the epsilon decay rate and total episodes are set to 0.9999 and 40000, respectively, for the better learning of the RL agent.

The Q+TSP consists of two algorithms. The first algorithm is Q-learning based and used for the shortest path generation between all 134 other source customers to one particular destination customer. Since some of the customers are far from the destination customer, the epsilon decay rate and episodes are kept larger than the DQSR’s Algorithm 1. In particular, the epsilon decay rate and episodes are set to 0.95 and 40, respectively. Also, this process is repeated for all 135 customers. In addition, we used the TwoOpt_solver algorithm, a python’s inbuilt function that solves the TSP, as the Q+TSP’s second algorithm. Similarly, the Dijkstra+TSP consists of two algorithms. The first algorithm is the Dijkstra algorithm which calculates the shortest path and distance between each customer and other customers. The second algorithm is the TSP algorithm that solves for the shortest route connecting all the customers (by visiting each customer once and returning the source customer). In particular, we used python’s inbuilt function for the Dijkstra algorithm based on the network graph and the TwoOpt_solver algorithm for solving the TSP. Notably, we implemented all the DQSR algorithms and the two algorithms compared, using python 3.7 on an Ubuntu 18.04.4 machine with 1 NVIDIA TITAN Xp Graphics Card and a GPU memory of 11 GB. Notably, the performance of each model was unaffected by the use of the GPU.

##### Table 3. Simulation parameters.
 S. No. Parameter Shortest path between customers Best route covering all the customers Dual-stage, Q-learning based, Shortest-route generation (DQSR) 1. Algorithm Q-learning Q-learning 2. Learning rate ($\alpha$) 0.8 0.8 3. Epsilon ($\varepsilon$) 1.0 1.0 4. Discount factor ($\gamma$) 0.9 0.9 5. Epsilon decay rate 0.9 0.9999 6. Minimum epsilon 0.05 0.05 7. Episodes 30 40000 8. Number of customers 135 135 9. Number of nearest neighbor customers 50 - Q-learning for the shortest-path generation + Travelling Salesman Problem based, shortest-route-generation (Q+TSP) 10. Algorithm Q-learning TSP: TwoOpt_solver 11. Learning rate ($\alpha$) 0.8 - 12. Epsilon ($\varepsilon$) 1.0 - 13. Discount factor ($\gamma$) 0.9 - 14. Epsilon decay rate 0.95 - 15. Minimum epsilon 0.05 - 16. Episodes 40 - 17. Number of customers 135 135 18. Number of nearest neighbor customers 134 - Dijkstra algorithm for the shortest-path generation + Travelling Salesman Problem based, shortest-route-generation (Dijkstra+TSP) 19. Algorithm Dijkstra algorithm TSP: TwoOpt_solver 20. Number of customers 135 135

### 4.3 Results and Discussion

We present and discuss the results of the case study that used the DQSR, under various parameters, and analyze the DQSR’s performance by comparing this performance to those of the respective other two algorithms (Q+TSP and Dijkstra+TSP).

#### 4.3.1 Route Visualization

The road network and the route information are the two data required for the route visualization on the QGIS application. The road network is generated through the QGIS application or can be found in the government portal. Notably, we generated the road network from scratch for the Yeonsu-gu, Incheon, South Korea, and stored it as a shape (.shp) file. The route result is generated using a python program. The route result contains three data fields (i) a list of road links or edges through which the RL agent should travel through, (ii) the sequence number denoting the order in which the agent should travel, and (iii) a flag to separate the agent-selected road links from the total road links list. The route information is finally stored as a comma-separated value (CSV) file.

##### Fig. 3. Route visualization on QGIS: (a) The DQSR algorithms based shortest-route; (b) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(a); (c) The Q+TSP algorithm based shortest-route; (d) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(c). The dark purple arrows inFig. 3(d)highlight some of the differences in the shortest-route generation by the DQSR and Q+TSP algorithms, respectively.

Table 4 summarizes the algorithms’ respective performances. The table includes the physical distance of the shortest route generated by each of the algorithms and also provides the algorithms’ respective time complexities in generating the shortest routes. Hence, the Q+TSP algorithm provides the shortest route in terms of travel distance which is approximately 55.7 km for this route, whereas the DQSR’s shortest route overshot the Q+TSP algorithm’s travel distance by approximately 3 km. But, a 3 km overshoot is merely 0.5 % of the Yeonsu-gu region’s total road length of 557.49 km. Therefore, we can argue that both the DQSR and Q+TSP algorithms respectively generate the shortest route. However, DQSR is more preferred compared to the other two algorithms when the time complexity is taken into account. Table 4 shows that the DQSR required 237 s to calculate the shortest path between each customer and this customer’s nearest neighbor customers and required another 209 seconds to calculate the shortest route connecting all the customers, i.e., a total time of 446 s. In contrast, the Q+TSP algorithm required 1412 s to calculate the shortest path between each customer and all other customers and 0.23 s to calculate the shortest route connecting all the customers, i.e., a total time of 1412.23 s. Since the road network for our case study is very large with very high network depth, the Dijkstra+TSP’s Dijkstra algorithm couldn’t generate the shortest path in 6 h. Therefore, we chose to terminate the algorithm.

##### Table 4. The comparison of the respective shortest-routes generated by the DQSR, Q+TSP, and Dijkstra+TSP algorithms, in terms of shortest-route distance and time complexity.
 S. No. Algorithm(s) Parameter Value Time complexity Algorithm 1 Algorithm 2 1. DQSR Route length 58.15 km 237 s 209 s 2. Number of visited links 600 3. Q+TSP Route length 55.70 km 1412 s 0.23 s 4. Number of visited links 597 5. Dijkstra+TSP Route length - >6 h - 6. Number of visited links -

## 5. Conclusion

This paper elaborated on all the aspects of the proposed dual-stage, Q-learning based, shortest-route generation and presented its training process. In particular, this shortest-route generation was used for illegal-parking surveillance in Yeonsu-gu, Incheon, South Korea. The analysis of this application, based on experiments, shows that our DQSR algorithms perform better compared to two other shortest-route generation algorithms, in terms of time complexity, and has achieved almost the same shortest-route distance as these two algorithms compared. In particular, this comparison shows that our DQSR was three times faster than the Q+TSP algorithm and had produced the shortest route with a 5 % higher distance than that of the Q+TSP. In addition, the time complexity in our model can further be reduced by lowering the number of nearest neighbor customers, which we fixed to 50 in this study, provided there is enough intersection of customers in each nearest neighbor customer group to generate the shortest route. Further, we plan to increase and decrease the road network size and check the performance of the DQSR in terms of time complexity and shortest-route distance, in the future.

### ACKNOWLEDGMENTS

This work was supported by Incheon National University Research Grant in 2017.

### REFERENCES

1
Nazari M., Oroojlooy A., Takáč M., Snyder L. V., 2018, Reinforcement Learning for Solving the Vehicle Routing Problem, In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS'18), Curran Associates Inc., Red Hook, NY, USA, pp. 9861-9871
2
Singh A. K., Pamula R., 2021, An Efficient and Intelligent Routing Strategy for Vehicular Delay Tolerant Networks, Wireless Network, Vol. 27, pp. 383-400
3
Geetha M, Ganesan R., 2021, CEPRAN-Cooperative Energy Efficient and Priority Based Reliable Routing Protocol with Network Coding for WBAN, Wireless Pers Commun, Vol. 117, pp. 3153-3171
4
Liu D., Cui J., Zhang J., Yang C., Hanzo L., 2021, Deep Reinforcement Learning Aided Packet-Routing for Aeronautical Ad-Hoc Networks Formed by Passenger Planes, IEEE Transactions on Vehicular Technology, Vol. 70, No. 5, pp. 5166-5171
5
Yao H., Mai T., Jiang C., Kuang L., Guo S., 2019, AI Routers & Network Mind: A Hybrid Machine Learning Paradigm for Packet Routing, IEEE Computational Intelligence Magazine, Vol. 14, No. 4, pp. 21-30
6
Roig P. J., Alcaraz S., Gilly K., Bernad C., Juiz C., 2022, Arithmetic Framework to Optimize Packet Forwarding among End Devices in Generic Edge Computing Environments, Sensors, Vol. 22, No. 421
7
Waqas A., Saeed N., Mahmood H., Almutiry M., 2022, Distributed Destination Search Routing for 5G and beyond Networks, Sensors, Vol. 22, No. 472
8
Alhussein O., et al., 2018, Joint VNF Placement and Multicast Traffic Routing in 5G Core Networks, 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1-6
9
Celik A., Tetzner J., Sinha K., et al., 2019, 5G Device-To-Device Communication Security and Multipath Routing Solutions, Appl Netw Sci, Vol. 4, No. 102
10
Feng L, et al., 2021, Explicit Evolutionary Multitasking for Combinatorial Optimization: A Case Study on Capacitated Vehicle Routing Problem, IEEE Transactions on Cybernetics, Vol. 51, No. 6, pp. 3143-3156
11
Zhang J., Yang F., Weng X., 2019, An Evolutionary Scatter Search Particle Swarm Optimization Algorithm for the Vehicle Routing Problem With Time Windows, IEEE Access, Vol. 6, pp. 63468-63485
12
Li J., Wang F., He Y., 2020, Electric Vehicle Routing Problem with Battery Swapping Considering Energy Consumption and Carbon Emissions, Sustainability, Vol. 12, No. 10537
13
Konstantakopoulos G. D., Gayialis S. P., Kechagias E. P., Papadopoulos G. A., Tatsiopoulos I. P. A., 2020, Multiobjective Large Neighborhood Search Metaheuristic for the Vehicle Routing Problem with Time Windows, Algorithms, Vol. 13, No. 243
14
Ngo T. G., Dao T. K., Thandapani J., Nguyen T. T., Pham D. T., Vu V. D., 2021, Analysis Urban Traffic Vehicle Routing Based on Dijkstra Algorithm Optimization, In: Sharma H., Gupta M. K., Tomar G. S., Lipo W. (eds) Communication and Intelligent Systems. Lecture Notes in Networks and Systems, Springer, Singapore, Vol. 204
15
Zheng Y.-L., Song T.-T., Chai J.-X., Yang X.-P., Yu M.-M., Zhu Y.-C., Liu Y., Xie Y.-Y., 2021, Exploring a New Adaptive Routing Based on the Dijkstra Algorithm in Optical Networks-on-Chip, Micromachines, Vol. 12, No. 54
16
El-Sherbeny N. A., 2010, Vehicle Routing with Time Windows: An Overview of Exact, Heuristic and Metaheuristic Methods. J. King Saud Univ. Sci., Vol. 22, pp. 123-131
17
Neupane D., Seok J. A., 2020, Review on Deep Learning-Based Approaches for Automatic Sonar Target Recognition, Electronics, Vol. 9, No. 1972
18
Bhandari S., Ranjan N., Khan P., Kim H., Hong Y.-S., 2021, Deep Learning-Based Content Caching in the Fog Access Points, Electronics, Vol. 10, No. 512
19
Bhandari S., Kim H., Ranjan N., Zhao H. P., Khan P., 2020, Best Cache Resource Based on Deep Neural Network for Fog Radio Access Networks, Journal of Internet Technology, Vol. 21, No. 4, pp. 967-975
20
Lazakis I., Khan S., 2021, An Optimization Framework for Daily Route Planning and Scheduling of Maintenance Vessel Activities in Offshore Wind Farms, Ocean Engineering, Vol. 225, No. 108752
21
Chen C., Jiang J., Lv N., Li S., 2020, An Intelligent Path Planning Scheme of Autonomous Vehicles Platoon Using Deep Reinforcement Learning on Network Edge, IEEE Access, Vol. 8, pp. 99059-99069
22
Prianto E., Kim M., Park J.-H., Bae J.-H., Kim J.-S., 2020, Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor-Critic with Hindsight Experience Replay, Sensors, Vol. 20, No. 5911
23
Lu E. H-C., Ya-Wen Y., 2019, A Hybrid Route Planning Approach for Logistics with Pickup and Delivery. Expert Syst, Appl, Vol. 118, pp. 482-492
24
Tong B., Wang J., Wang X., Zhou F., Mao X., Zheng W., 2022, Best Route Planning for Truck-Drone Delivery Using Variable Neighborhood Tabu Search Algorithm, Appl. Sci., Vol. 12, No. 529
25
Vahdanjoo M., Zhou K., Sørensen C. A. G., 2020, Route Planning for Agricultural Machines with Multiple Customers: Manure Application Case Study, Agronomy, Vol. 10, No. 1608
26
Ranjan N., Bhandari S., Khan P., Hong Y.-S., Kim H., 2021, Large-Scale Road Network Congestion Pattern Analysis and Prediction Using Deep Convolutional Autoencoder, Sustainability, Vol. 13, No. 5108
27
Ranjan N., Bhandari S., Zhao H. P., Kim H., Khan P., 2020, City-Wide Traffic Congestion Prediction Based on CNN, LSTM and Transpose CNN, IEEE Access, Vol. 8, pp. 81606-81620
28
Li Z., Huang J., 2019, How to Mitigate Traffic Congestion Based on Improved Ant Colony Algorithm: A Case Study of a Congested Old Area of a Metropolis, Sustainability, Vol. 11, No. 1140
29
Balakrishnan A., Banciu M., Glowacka K., Mirchandani P., 2013, Hierarchical Approach for Survivable Network Design, Eur. J. Oper. Res., Vol. 225, pp. 223-235
30
Fredman M. L., Tarjan R. E., 1987, Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms, J. ACM, Vol. 34, pp. 596-615
31
Qu T., Cai Z. A., 2015, Fast Isomap Algorithm Based on Fibonacci Heap, In International Conference in Swarm Intelligence; Springer: Cham, Switzerland, pp. 225-231
32
Lu X., Camitz M., 2011, Finding the Shortest Paths by Node Combination, Appl. Math. Comput., Vol. 217, pp. 6401-6408
33
Orlin J. B., Madduri K., Subramani K., Williamson M. A., 2010, Faster Algorithm for the Single Source Shortest Path Problem with Few Distinct Positive Lengths, J. Discret. Algorithms, Vol. 8, pp. 189-198
34
Jin W., Chen S., Jiang H., 2013, Finding the K Shortest Paths in a Time-schedule Network with Constraints on Arcs, Comput. Oper. Res., Vol. 40, pp. 2975-2982
35
Asaduzzaman M., Geok T. K., Hossain F., Sayeed S., Abdaziz A., Wong H.-Y., Tso C. P., Ahmed S., Bari M. A., 2021, An Efficient Shortest Path Algorithm: Multi-Destinations in an Indoor Environment, Symmetry, Vol. 13, No. 421
36
Hart P. E., Nilsson N. J., Raphael B. A., 1968, Formal Basis for the Heuristic Determination of Minimum Cost Paths, IEEE Transactions on Systems Science and Cybernetics, Vol. 4, No. 2, pp. 100-107
37
Ben Ticha H., Absi N. A., 2017, Solution Method for the Multi-Destination Bi-Objectives Shortest Path Problem, Elsevier: Amsterdam, The Netherlands
38
Bast H., Funke S., Sanders P., Schultes D., 2007, Fast Routing in Road Networks with Transit Nodes, Science, Vol. 316, No. 5824:566
39
Shen B., Cheema M. A., Harabor D. D., Stuckey P. J., 2021, Contracting and Compressing Shortest Path Databases, Proceedings of the International Conference on Automated Planning and Scheduling, Vol. 31, No. 1, pp. 322-330
40
Bentley J. L., 22-24 January, 1990, Experiments on Traveling Salesman Heuristics, In Proceedings of the First Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, Vol. 91-99
41
Clarke G., Wright J. W., 1964, Scheduling of Vehicles from a Central Customer to a Number of Delivery Points, Oper. Res., Vol. 12, pp. 568-581
42
Mele U. J., Gambardella L. M., Montemanni R., 2021, Machine Learning Approaches for the Traveling Salesman Problem: A Survey, Proceedings of the 8$^{th}$ International Conference on Industrial Engineering and Applications 2021 (Europe), pp. 182-186
43
Zhang R., Prokhorchuk A., Dauwels J., 2020, Deep Reinforcement Learning for Traveling Salesman Problem with Time Windows and Rejections, 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1-8
44
Reinelt G., 1991, TSPLIB-A Traveling Salesman Problem Library, INFORMS Journal on Computing, Vol. 3, No. 4, pp. 376-384
45
Usberti F. L., Franca P. M., France A. L. M., 2011, The Open Capacitated Arc Routing Problem, Computers & Operations Research, Vol. 38, No. 11, pp. 1543-1555
46
Minieka E., 1979, The Chinese Postman Problem for Mixed Networks, Management Science, Vol. 25, No. 7, pp. 609-707

## Author

##### Navin Ranjan

Navin Ranjan received his Bachelor's degree from Kathmandu University, Nepal, in 2016 and his Master's degree from Incheon National University, South Korea, in 2021. He worked as a Research Assistant at Industry-University Cooperation Foundation, Incheon National University, South Korea (2021~2022). His research interests include image processing, computer vision, scene understanding, and reinforcement learning.

##### Sovit Bhandari

Sovit Bhandari received a B.E. degree from Kathmandu University, Nepal, and an M.S. degree from Incheon National University, South Korea, in 2016 and 2020, respectively. He worked as Core Network Engineer at Huawei Technologies Nepal Pvt. Ltd. From Jan. 2018 to Aug. 2018, and since Sept. 2020, he has been working as a Research Assistant at IoT and Big-Data Research Center, Incheon National University, South Korea. His research interests include radio resource management, 5G and beyond mobile communications, machine learning, the Internet of Things, image processing, and computer vision.

##### Yeong-Chan Kim

Yeong-Chan Kim received his Bachelor’s degree from Daelim University, South Korea, in 2021. He is currently doing his M.S. at Incheon National University, South Korea. His research interests includes data science, deep learning, next-generation mobile communication systems, image processing, and computer vision.

##### Hoon Kim

Hoon Kim has been working as an Associate Professor at the Department of Electronics Engineering, Incheon National University, South Korea, since 2008. His research interest include radio resource management, optimization techniques, 5G mobile communication systems, machine learning and the internet of things. He is a Member of KICS, IEIE, IEEE, and IEICE.