This paper explores a reinforcement learning (with Q-learning as the algorithm) for vehicle routing in a city-wide road network with a large customer (a generic name used for each of the locations (to denote a person possibly associated with this location) through which the vehicle needs to be routed) counts. Most reported research on vehicle routing had mainly focused on calculating the shortest route (while neglecting the shortest path between certain customer(s) and certain/all other customer(s)) between the customers. Hence, we proposed a dual-stage, Q-learning based, shortest-route generation (DQSR) in a road network, considering the shortest path between each customer and certain other customers. Notably, the first stage’s Q-learning agent develops the meta-graph of the shortest path between each customer and this customer’s nearest neighbor customers (chosen differently in number), instead of each other customer, and finds this path’s length. This development is motivated by the fact that any two adjacent customers in the shortest route are located closer than far apart, reducing the time complexity of the first stage’s Q-learning. Subsequently, the second stage’s Q-learning agent finds the shortest route from any customer to other customers from the meta-graph, connecting each other customer only once and returning to the starting customer. Hence, the DQSR generates the shortest route in a relatively shorter duration due to the small size of the state-action pairs in the meta-graph. Further, we conducted a case study on the DQSR-based, illegal-parking surveillance in Yeonsu-gu, Incheon, South Korea, to check the DQSR’s effectiveness. The case study demonstrates that the DQSR is more efficient than two other algorithms used for the same purpose in the case study, in terms of time complexity and shortest-route generation.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

Route planning in a network is an optimization problem that has been studied in applied
mathematics and computer science for decades. This problem is computationally expensive
and has been solved using many different algorithms proposed, but providing fast and
reliable solutions to this optimization problem is still a challenging task ^{[1]}. Route planning plays a critical role in a variety of applications, such as packet
transfer in wireless sensor networks ^{[2-}^{6]}, routing in 5G and 6G networks ^{[7-}^{9]}, and vehicle routing ^{[10-}^{14]}. Several researchers have developed and proposed many exact, heuristic, and metaheuristic
algorithms to solve the route-planning optimization problem. In particular, these
algorithms have their respective pros and cons. For example, an exact algorithm, such
as the Dijkstra algorithm, computes every possible solution and chooses the best among
them. However, even though this algorithm provides the exact solution, it tends to
suffer from high time-complexity even while route planning in a fairly small network
^{[14-}^{16]}. However, since time complexity is a major factor for real-time route planning in
a network, it is much better to look for an approximate solution instead of this exact
solution. Therefore, the heuristic and metaheuristic algorithms, such as the local
search algorithm, ant colony optimization, particle swarm optimization, etc., are
preferred in solving the route-planning optimization problem. However, these algorithms
also suffer from limitations. For instance, the heuristic algorithms, in many cases,
are unable to generate the best solution to this optimization problem.

Recently, machine learning (ML) algorithms have become popular in various fields due
to these algorithms’ ability to handle large-data and data non-linearities and the
availability of cheap and high computational power ^{[17-}^{19]}. Reinforcement learning (RL) is the major class of machine learning and has been
inspired by the behavior learning of animals. The RL algorithms seem to perform well
in partially observed environments or high-dimensional spaces, when an approximate
solution to any complex problem may be of greater value than an exact/best solution,
which is possible only to any simple problem, in these environments/spaces. The RL-based
algorithms can solve the route-planning optimization problem in applications like
packet transfer in a wireless network, vehicle route planning in a transportation
network, signal routing in a communication network, navigation through a maze, etc.,
which involve large networks, and the RL algorithms like Q-learning give the best
solution to this problem regardless of the environment.

The advantages of Q-learning, such as its off-policy and temporal-difference learnings, inspired us to propose the DQSR algorithms (one for each stage). The DQSR is based on two Q-learning agents: the first stage’s Q-learning agent develops the shortest path between each customer and this customer’s nearest neighbor customers in the network and finds this path’s length and the second stage’s Q-learning agent finds the shortest route from any customer to the other customers by virtually visiting each other customer only once and returning to the starting customer in a route that has the shortest distance. Hence, both Q-learning agents share the responsibility of achieving the shortest route, and hence the DQSR achieves this route relatively fast compared to the exact, some heuristic, or the single-stage, Q-learning based algorithms. In addition, the concept of nearest neighbor customers cannot be used in the exact and heuristic algorithms for the shortest-route planning/generation. In essence, the main contributions of this paper are summarized as follows:

· We developed a dual-stage, Q-learning based, shortest-route generation (DQSR) and its two algorithms.

· We implemented the idea of calculating the shortest path and distance between each customer and this customer’s nearest neighbor customers for the shortest-route generation.

· Our extensive experiments on the DQSR based, illegal- parking surveillance demonstrate the efficiency of the DQSR compared to two other algorithms used for the same purpose, in terms of time complexity and shortest-route generation.

We described the background of our research in this section. The rest of the paper is structured as follows. Section 2 presents the works related to our research. Section 3 presents the methodology of our research, including the background on reinforcement learning, problem statement, and the DQSR algorithms. Section 4 presents the data set used in the experiment, algorithms compared with the DQSR algorithms, simulation parameters, the case study results, and discussion. Finally, Section 5 concludes and provides the future direction for this research.

## 2. Related Work

Route planning is one of the most important functions of an intelligent transportation
system. It is widely used in various areas, such as daily travel route planning ^{[20-}^{22]}, package delivery service or indoor warehouse management ^{[23-}^{25]}, and traffic congestion reduction ^{[26-}^{28]}. The route planning also saves a lot of time and money for the transport management
agencies. In recent years, many academic works have been devoted to solving the route-planning
optimization problem with multiple locations to be connected by the route, in a relatively
low time. Some of the recent academic works have also been handling route planning
under multiple objectives.

Nowadays, many algorithms find the shortest path between nodes/locations in a network,
with widespread applications. For example, an exact algorithm based on traditional
graph-based methods, such as the Dijkstra algorithm, is used to find this shortest
path in a network under multiple destinations ^{[29]}. Notably, the multi-destination Dijkstra algorithm (MDDA) is the recent algorithm
to find the multi-destination shortest path in a network. In particular, the MDDA
uses a Fibonacci heap ^{[30,}^{31]}. Meanwhile, Lu et al. ^{[32]} proposed a node combination method to implement the Dijkstra algorithm in finding
the shortest path in a network by following three simple iterative steps: find the
nearest neighbor node of the source node, combine the nearest neighbor node with the
source node, and modify the weights on the edge connecting the source and nearest
neighbor nodes. However, the node combination method is not efficient or faster than
the MDDA. Likewise, Orlin et al. ^{[33]} proposed an algorithm to efficiently find the shortest path in a network, which is
applicable even to a very sparse network or under a small number of positive integer
weights. Further, Jin et al. ^{[34]} presented the modified MDDA to find the shortest path in a time-schedule network.
In particular, they considered the network constraints on its edges to find this shortest
path. Also, Asaduzzaman et al. ^{[35]} proposed a modified Dijkstra algorithm to find the shortest path to multiple destinations
in a network by using a minimum number of weights. Interestingly, Hart et al. ^{[36]} proposed an A* algorithm, which is an exact algorithm for incorporating information
from an optimization problem to a graphical analysis performed to find a minimum-cost
path in a network. Similarly, Ticha et al. ^{[37]} found the multi-destination, bi-objective shortest path using an A* search strategy.
In particular, their search strategy involves choosing some part of the path for a
non-dominated path, in each iteration, to arrive at one of the destination nodes.
However, the last two exact and heuristic algorithms and alike have a serious issue
of time complexity and have a polynomial run-time according to the graph size. Meanwhile,
Bast et al. ^{[38]} developed an algorithmic approach for fast routing in a road network having transit
nodes, reducing the quickest-path queries in this road network into a small number
of table lookups reducing the query time significantly. Their algorithm is more than
one million times faster than the best-known algorithm for routing in a general network.
Finally, Shen et al. ^{[39]} improved the shortest-path databases of a network by (i) contracting the input graph
before preprocessing and (ii) limiting the preprocessing step to only a selected subset
of the graph nodes.

Vehicle routing or vehicle-route planning is the generalization of the traveling salesman
problem (TSP). Meanwhile, involving only the nearest neighbors, a greedy and very
simple strategy, allows the best node/vertex to join the solution. But, this strategy
is inefficient in solving the TSP. Bentley ^{[40]} achieved an improved run-time in finding the approximate solutions to large-geometric
TSPs. In addition, Clarke et al. ^{[41]} presented an iterative procedure that enables the rapid selection of an optimum or
near optimum route for a fleet of trucks (with varying capacities) from a central
depot to several delivery points. Recently, new features are being brought by the
ML approaches, and researchers have been coupling ML features with the well-known
heuristic concepts, introducing new hybrid algorithms. For example, Mele et al. ^{[42]} presented a new, constructive heuristics driven by the ML for solving the TSP. This
solution’s algorithm enables a reasonable generalization and unveils an efficient
balance between the ML and optimization techniques. Likewise, Zhang et al. ^{[43]} proposed deep reinforcement learning for the TSP, with time windows and rejections.
Their deep RL showed superior performance compared to the Tabu search heuristics,
in terms of the solution quality and inference computation time. Further, Reinelt
^{[44]} provided the TSP library (TSPLIB) for researchers, with a broad set of test problems
from various sources and with various properties. Finally, the arc routing or Chinese
Postman problem ^{[45,}^{46]} presents a process for finding a least-cost way to traverse each arc of a network
at least once and return to the node/vertex from which the agent starts.

## 3. Methodology

### 3.1 Reinforcement Learning

Reinforcement learning (RL) is a computational approach to understanding and automating goal-directed learning and decision-making. It is about taking suitable action to maximize the reward in a particular situation. The RL differs from other computational approaches, such as supervised learning. In particular, the RL agent learns by interacting with its environment with/without prior knowledge of this environment or without relying on examples. Also, the RL is based on the Markov decision processes (MDPs).

An MDP is a discrete-time stochastic control process providing the mathematical framework to model the decision-making in a situation where the outcomes are partly random and partly under the decision-maker’s control. The MDPs are structured from a finite set containing the state-space ($S$), action-space ($A$), state transition probability matrix ($P$), and reinforcement or reward ($R$). In particular, it is the tuple$\left(S,A,P_{a},R_{a}\right)$. In addition, a state $\left(s\in S\right)~ $is the representation of the environment at a specific time step. Further, an action $\left(a\in A\right)$ is one in a set of the possible actions in the state$s\,.$ And, the state transition probability, $P_{a}\left(s,s'\right)=$ $P\left(s_{t+1}=s'~ |s_{t}=s,a_{t}=a\right)$, gives the probability that the action$~ a~ $in the state$~ s~ $at the time$~ t~ $will lead to the state $s'$at the time$~ t+1$. Finally, the reinforcement or reward, $R_{a}\left(s,s'\right),$ is the immediate reward received after transitioning from the state$~ s$ to the state $s'$.

The RL agent interacts with the environment in a sequence of steps at the time t: the agent (i) receives a representation of the environment, i.e., the state$~ s$ (ii) selects the action$~ a$ by using the ${\varepsilon}$-greedy$~ $selection (explained below) and executes this action (iii) receives the immediate reward signal $R_{a}\left(s,s'\right)$ (iv) updates the learning matrix (v) observes the new state$~ s'$of the environment.

The goal of the RL agent is to learn a policy$~ \left(\pi \right)$, i.e., to learn to map from the state space ($S$) to the action space ($A$) in a way to maximize the numerical reward signal. Therefore, the RL agent is not explicitly programmed of its action in a state. Instead, the RL agent learns through a trial and error procedure the action yielding the most reward in a state. In addition, the RL agent also follows the exploration and exploitation strategies during its learning to find the most rewarding action in a state. The exploration strategy allows the RL agent to improve the agent’s current knowledge about each state-action pair by taking a random action from the action list, instead of taking the best possible action, in a state. On the other hand, the exploitation strategy makes the RL agent choose the best action from the actions list to get the most reward by exploiting the agent’s current state-action value (knowledge). However, a balance between the exploration and exploitation strategies is essential and accomplished by the ${\varepsilon}$-greedy selection or the Boltzmann selection. The ${\varepsilon}$-greedy selection’s parameter $\varepsilon \left(0<\varepsilon <1\right)$ is defined and the decision policy $\pi \left(s\right)~ $is applied according to the following equation.

##### (1)

$ \begin{equation} \pi \left(s\right)=\left\{\begin{array}{ll} a^{*}, & with\,\,the\,\,probability1-\varepsilon \\ a^{r}, & with\,\,the\,\,probability\,\,\varepsilon \end{array}\right. \end{equation} $where $~ a^{\mathrm{*}}~ $and $a^{r}~ $refer to the best and random actions selected (under exploitation and exploration), respectively, in the current state$~ s$.

The Q-learning is one of the off-policy RL algorithms in which the agent learns the best state-action value function independent of the policy being followed. In particular, the Q-learning is based on the temporal-difference learning (TD), i.e., the value estimates are updated incrementally as the agent interacts with the environment, without waiting for the final outcome. In particular, the Q-learning’s state-action value is updated as,

##### (2)

$Q_{t+1}\left(s,a\right)\leftarrow Q_{t}\left(s,a\right)+\alpha \left[r\left(s,a\right)+\gamma \max _{a'}Q\left(s',a'\right)-Q_{t}\left(s,a\right)\right]$,where $Q_{t}\left(s,a\right)$ and$Q_{t+1}\left(s,a\right)$ are the state-action value (at the time $t$) and the updated state-action value (at the time $t+1)$ of the state-action pair $\left(s,a\right),$respectively. In addition, $r\left(s,a\right)$ is the immediate reward for executing the action $a~ $ in the state$s,\alpha \in \left[0,1\right]$ is the learning rate, $\gamma \in \left[0,1\right]$ is the discount factor, and $\max _{a'}Q\left(s',a'\right)$ is the best Q-value (state-action value) in the state$s'$for the action $a'$.

### 3.2 Problem Statement

Let the road network be represented by a directed graph, $G=\left(V,E\right)$, where $V$ is the set of nodes, $E$ is the set of edges (road segments), and each edge connects two nodes in the road network. The length of the edge $e\in E$ is denoted by $l_{e}\in \mathrm{~ \mathbb{R}}.$ Let us consider that the RL agent has to visit ‘n’ customers’ respective locations or edges, $D=\left\{d_{1},d_{2},\ldots ,d_{n}\right\}\in E$, exactly once before returning to the starting node$~ \left(d\in D\right)$. Here, an edge containing a customer is considered a node to perform the DQSR. Notably, the second stage of the DQSR is a classic example of the TSP.

### 3.3 The Dual-stage, Q-learning based, Shortest-route generation (DQSR) Algorithms

#### 3.3.1 The Q-learning based, Shortest-path Generation Algorithm

The calculation of the shortest path between two nodes is one of the classic problems in graph theory. The DQSR uses the Global Positioning System (GPS) location, i.e., the latitude and the longitude, of all the customers to calculate each customer’s all ‘$x$’ nearest neighbor customers’ respective locations. Then, it uses the Q-learning-based, shortest-path generation algorithm (Algorithm 1) to generate the shortest path between each customer and this customer’s nearest neighbor customers and find this path’s length.

The inputs to the Algorithm 1 are given in line 1 of the algorithm. In particular, the RL agent requires the road network represented by a directed graph, $G\left(V,E\right)$, road network length graph, i.e., the road network with the each of its edge having the corresponding length, list of customers to visit (in terms of their corresponding edges), each customer’s nearest neighbor customers list, ‘n’ Q shortest-path tables \{$Q_{\colon 1},Q_{\colon 2},\ldots ,Q_{\colon n}\}$to store the updated state-action value for each of the ‘n’ customers, respectively, distance table $\left(D_{Route}\right)~ $to store the length of the shortest path between each customer and this customer’s nearest neighbor customers, path table $\left(P_{Route}\right)~ $to store the shortest path between each customer and this customer’s nearest neighbor customers, and maximum number of random walks the RL agent can take before terminating the ‘while’ loop. Notably, the Q shortest-path tables each have the same state transition probability matrix as that of the road network. In contrast, the distance and path tables each have the same state transition probability matrix as that of each customer and this customer’s nearest neighbor customers, i.e., with size $n\times x.$ The $i^{th}$ Q shortest-path table, $Q_{\colon i}\,,$ stores the Q-value update for the path from the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer. In particular, the $Q_{\colon i}$ is used to calculate the shortest path and distance from the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers to the $i^{th}$ customer; here ‘$\colon $’ represents the list of the $i^{th}$ customer’s nearest neighbor customers. Also, the output of the algorithm includes the updated path and distance tables. In addition, the algorithm requires specifying the learning rate, discount factor, initial epsilon value, epsilon decay rate, and minimum epsilon.

The working of Algorithm 1 is shown in lines 4 to 24 of the algorithm. The first ‘for loop’ loops through all the customers to visit, as mentioned in line 4 of the algorithm. The RL agent copies the $i^{th}$ customer’s Q shortest-path table and the road network length graph as the reward table with the sign of its each edge length changed to negative, in lines 5 and 6 of the algorithm, respectively. The second ‘for loop’ loops through the $i^{th}$ customer’s all ‘$x$’ nearest neighbor customers, as mentioned in line 7 of the algorithm. Since the goal of the RL agent is to reach the $i^{th}$ customer’s nodes, we manipulate the reward table by adding extra positive reward to each edge that is connected to any of these nodes, as in line 8 of the algorithm. This extra reward enables the RL agent to learn effectively. Q shortest-path table update for the path from the $i^{th}$ customer’s all ‘x’ nearest neighbor customers to the $i^{th}$ customer is presented in lines 7 to 20 of the algorithm. First, the RL agent starts its functioning from the assigned current state and random-walks one step at a time until its termination according to the ‘while’ condition given in line 11 of the algorithm. In other words, the RL agent’s random walking is terminated either when the number of agent random walks completed exceeds the maximum number of random walks the agent can take (max_steps) or when the agent reaches the end state. Meanwhile, the RL agent chooses the action ($a$) at the current state ($s$) (as mentioned in line 12 of the algorithm), based on the $\varepsilon $-greedy selection, which enables the agent to either explore or exploit depending on the epsilon value. The RL agent then performs the chosen action and moves to the next state ($s')$, as in line 13 of the algorithm. Subsequently, the RL agent updates the previous Q-value, according to Eq. (2), as in line 14 of the algorithm. Further, the RL agent checks if the state $s'$ is the end state and changes the termination parameter (goal)’s value accordingly, as in lines 15 and 16 of the algorithm, respectively. Next, the RL agent replaces the current state with the next state and decrements the max_steps. The epsilon value is then updated after the termination of the RL agent’s path search, as in lines 18 and 19 of the algorithm. The entire process of Q-value update, as in lines 10 to 19 of the algorithm, is repeated $T$ (an user-defined number) times by the ‘for loop’ (in line 10 of the algorithm). These $T~ $repetitions allow the RL agent to learn the correct shortest-path from the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers to the $i^{th}$ customer. The RL agent then calculates the path and distance from the $i^{th}$ customer’s all ‘$x$’nearest neighbor customers to the $i^{th}$ customer, in line 20 of the algorithm, based on the Algorithm 3 (given subsequently in this paper). Finally, the RL agent updates the $i^{th}$ component of the distance and path tables (in lines 22 and 23 of the algorithm, respectively).

An example of the Q-learning based generation of the shortest path from each customer’s nearest neighbor customers to this customer is shown in Figs. 1(a)-(d). In particular, Fig. 1(a) shows an arbitrary road network with five customers. Each customer’s starting and end nodes are represented by ‘green’ and ‘light blue’ color dots, respectively. The RL agent then updates the shortest path from each customer’s nearest neighbor customers to this customer by visiting each customer once and returing to the starting customer.

The process of updating the Q shortest-path table for each customer is shown in Fig. 1(b). Fig. 1(b) also shows the list of the respective Q shortest-path table updates for all the five customers. In particular, the first image in Fig. 1(b) shows the learned Q-value for the path from the respective end node of all the customer $d_{1}$’s nearest neighbor customers to $d_{1}.$ In addition, each edge’s width represents this edge’s Q-value compared to other edges. Since, the Q-value grows when reaching the destination, keeping the Q shortest-path table update for the path from each customer’s nearest neighbor customers to this customer in one Q shortest-path table is possible and doesn’t contradict when exploiting the Q shortest-path table to generate the shortest path. Fig. 1(c) shows each customer to this customer’s each nearest neighbor customer transitional distance-table. Similarly, Fig. 1(d) shows each customer to this customer’s each nearest neighbor customer transitional path-table. Notably, the paths in this table are reversed when the states are reversed.

##### Fig. 1. An example of the Q-learning based generation of the shortest path from each customer’s nearest neighbor customers to this customer and the shortest route connecting all the customers.Fig. 1(a)is an arbitrary road network with five customers.Fig. 1(b)shows the list of the respective Q shortest-table updates for all the five customers and each edge’s width represents this edge’s Q-value compared to other edges.Fig. 1(c)shows each customer to this customer’s each nearest neighbor customer transitional distance-table.Fig. 1(d)shows each customer to this customer’s each nearest neighbor customer transitional path-table.Fig. 1(e)shows the Q-value transition between each customer and this customer’s nearest neighbor customers.Fig. 1(f)shows the shortest route from the customer $d_{1}$, generated based on theAlgorithm 3and the final Q-routing table fromFig. 1(e).

#### 3.3.2 The Q-learning based, Shortest-route Generation Algorithm

Algorithm 2 is for the Q-learning based generation of the shortest route from any customer to other customers by visiting each other customer once and returning back to the starting customer. The inputs to the algorithm are given in line 1 of the algorithm. The RL agent requires the Q-routing table having each customer and this customer’s nearest neighbor customers’ state transition, the distance routing and path routing tables from the Algorithm 1, the reward table having its state transitions same as those of the Q-routing table with the initial reward value as -1, the list of customers to visit, the maximum number of random walks for the RL agent, the total number of episodes, and the starting customer. The output of the algorithm includes the shortest route and distance. The algorithm also requires specifying the learning rate, discount factor, initial epsilon value, epsilon decay rate, and minimum epsilon.

The working of Algorithm 2 is shown in lines 4 to 38 of the algorithm. The algorithm consists of five subparts: the global variable setup, episode variable setup, Q-value update, the global variable’s update by comparing the episode variables, and route and route distance generation. The Q-routing and reward tables are copied as the best Q-routing and reward tables, as in line 4 of the algorithm. Then, the global variable for the number of visited customers and the distance covered, respectively, is initialized, as in line 5 of the algorithm.

The ‘for loop’ loops through all the available episodes, as shown in line 8 of the algorithm. The episode variables are updated in each episode, as in lines 7 to 11 of the algorithm. The Q-value update during each loop is similar to the Q-value update of Algorithm 1 and is shown in lines 12 to 26 of the algorithm. The RL agent starts its functioning from the assigned current state and random-walks one step at a time until its termination according to the ‘while’ condition given in line 12 of the algorithm. The RL agent is terminated either when its completed number of random walks exceeds the maximum number of random walks the agent can take (max_steps) or when the agent reaches the end state. The agent then chooses an action ($a$) at the current state ($s$), based on the $\varepsilon $-greedy selection which enables the agent to either explore or exploit based on the epsilon value, as in line 13, and selects the next state, as given in line 14 of the algorithm. Subsequently, the episode reward table is updated, as in lines 15 to 22 of the algorithm. A negative reward is assigned between the current and next states if the RL agent visits all the customers and ends at a different customer than the starting customer, as in lines 15 and 16 of the algorithm. In addition, the reward table is updated with a negative reward if the RL agent visits the same customer more than once, as in lines 19 and 20 of the algorithm. On the other hand, a positive reward is assigned if the RL agent visits the starting customer’s each other customer once and returns to the starting customer, as in lines 17 and 18 of the algorithm. Then, the next to visit customer is removed from the to-visit customers list, as in lines 21 and 22 of the algorithm. The RL agent subsequently updates the previous Q-value, based on the current and next customer states and using Eq. (2), as in line 23 of the algorithm. Also, the RL agent replaces the current state with the next state and decrements the max_steps, as in line 26 of the algorithm. Finally, the number of customer locations visited (VD) by the RL agent is counted after the Q-learning process is terminated. Further, the global route and global route distance variables are calculated based on the VD, as in lines 28 to 35 of the algorithm. If the VD is equal to the previous global_visited_path, no changes in the best reward and the Q-routing tables are made. Otherwise, if the VD is greater than the previous global_visited_path, the length of the global route distance is set to negative infinity, and the best reward and Q-routing tables are replaced by the episode reward and Q-routing tables, respectively, as in lines 29 to 31 of the algorithm. This step is followed by assigning a global_visited_path equal to the VD. Further, if the episode route distance (episode_distance) is greater than the global route distance, then the global route distance is replaced by the episode route distance, and the best reward and Q-routing tables are replaced by the episode reward and Q-routing tables, respectively, as in lines 33 to 35 of the algorithm. The epsilon value is then updated after the termination of the RL agent’s route search, as in lines 36 and 37 of the algorithm. Finally, the RL agent generates the shortest route and distance after the Q-value update for all the possible episodes is completed, based on Algorithm 3. An example of the shortest route generation is shown in Figs. 1(e) and 1(f). Fig. 1(e) shows the Q-value transition between each customer and this customer’s nearest neighbor customers. Subsequently, the shortest route from any customer to other customers is generated based on Algorithm 3 andthe final Q-routing table from Fig. 1(e).

Algorithm 3 is for the path/route generation and the corresponding distance calculation. The inputs to the algorithm are the updated Q-value table, starting and ending nodes, and distance table (the road length graph for the path generation and the distance routing table for the route generation), as in line 1 of the algorithm. The output of the algorithm is the path/route generated and its corresponding distance, as in line 2 of the algorithm. The working of the algorithm is shown in lines 3 to 14 of the algorithm. The current state, path, and distance are assigned their initial values in lines 3 to 5 of the algorithm. In addition, the variable ‘goal’ is set to False initially, as in line 6 of the algorithm. Notably, the RL agent calculates the path/route according to lines 7 to 13 of the algorithm. In particular, the RL agent goes through the ‘while loop’ until the terminating condition is met. The RL agent then calculates the next state from the current state by using the Q-value table through exploitation, as in lines 8 and 9 of the algorithm. Subsequently, the path and distance are updated, as in lines 10 and 11 of the algorithm. In addition, the RL agent checks if the next state is the end state during each ‘while loop’, as in line 12 of the algorithm. If the next state is the end state, the variable ‘goal’ is set to ‘True’, as in line 13 of the algorithm, and the ‘while loop’ terminates. Finally, the path/route and its corresponding distance are returned by the algorithm.

## 4. Experiments

### 4.1 Dataset

We train the RL agent under the DQSR algorithms and evaluate the DQSR’s effectiveness by conducting a case study on the DQSR-based, illegal-parking surveillance in Yeonsu-gu, Incheon, South Korea. Two types of data were required for this case study: (i) the road network with the road information of the Yeonsu-gu region and (ii) the illegal-parking locations in this region. So, the road network was generated using the Quantum Geographical Information System (QGIS), a free and open-source cross-platform desktop GIS application that supports the viewing, editing, and analysis of geospatial data. Further, we collected the details of parking violations in the Yeonsu-gu region for the period of 2017-2019, specifically between 07:00-09:00 on weekdays. The dataset used in this case study is shown in Fig. 2. The ‘pink’ line shows the region of concern, i.e., the Yeonsu-gu region, and the ‘black’ lines show the road network of this region. Also, the ‘green’ dots in Fig. 2 show the respective locations of illegal parking (by the customers) that need to be visited by the RL agent to investigate the parking violation, and the ‘blue’ lines show the road links with illegal parking reported in these road links. In addition, a numerical value adjacent to a road link indicates this road link’s length. Fig. 2 also shows that most parking violations were near a road link and not on a road link. Therefore, we selected the road link nearest to each respective customer (with a corresponding illegal-parking location) for surveillance. The detailed feature representation of the dataset is shown in Table 1. In particular, the dataset consists of the road network information, such as the link id, link starting and ending nodes, road link length, longitude and latitude of each link’s starting node, and road link (i.e., customer) nearest to the illegal-parking location. In addition, Table 2 shows the basic information on the dataset. Meanwhile, Yeonsu-gu, Incheon, has an area of 50.8 km$^{2}$ and includes 2153 road links with a total road network length of 557.5~km.

##### Fig. 2. Dataset Visualization. The ‘pink’ line shows the Yeonsu-gu region’s boundary, the ‘black’ lines show the road network with road length information, the ‘green’ dots represent the respective illegal-parking (by the customers) locations that need to be included in the route, and the ‘blue’ lines show the road links with illegal parking reported in these links.

##### Table 1. The feature representation of the dataset used to evaluate the DQSR’s effectiveness.

### 4.2 Algorithms Compared and Simulation Parameters

The effectiveness and superiority of the DQSR algorithms are verified by comparing these algorithms with two other exact and heuristic algorithms, respectively. The first algorithm compared (Q+TPS) is based on a Q-learning algorithm that generates the shortest path between each customer to all other customers and a TSP algorithm that solves for the shortest route covering all the customers (without revisiting any customer) and returning to the starting customer. The second algorithm compared (Dijikstra+TSP) is based on the Dijkstra and Travelling Salesman Problem algorithms, where the Dijkstra algorithm generates the meta-graph (the shortest path from each customer to all other customers), and the TSP algorithm finds the shortest route covering all the costumers (without revisiting any customer) and returning to the starting customer.

Table 3 summarizes the simulation parameters and their values for implementing the DQSR algorithms and the two algorithms compared with the DQSR algorithms. The simulation parameters for the DQSR are mentioned in Table 3’s lines 1 to 9. Notably, the DQSR’s two Q-learning-based algorithms have different simulation parametric values. In particular, the difference between their respective epsilon decay rates and the number of training episodes is very large.

The RL agent learns the Q-value table for 50 pairs of source and destination customers in the DQSR’s Algorithm 1 and goes in a loop of 30 episodes per pair. Therefore, although the epsilon decay rate and total episodes are set to 0.9 and 30, respectively, the RL agent has to go through 1500 cycles before calculating the shortest path and distance. Whereas the DQSR’s Algorithm 2 is tasked only with calculating the shortest route covering all the 135 customers. Since this algorithm searches for the shortest route without repeating with the smallest distance, the epsilon decay rate and total episodes are set to 0.9999 and 40000, respectively, for the better learning of the RL agent.

The Q+TSP consists of two algorithms. The first algorithm is Q-learning based and used for the shortest path generation between all 134 other source customers to one particular destination customer. Since some of the customers are far from the destination customer, the epsilon decay rate and episodes are kept larger than the DQSR’s Algorithm 1. In particular, the epsilon decay rate and episodes are set to 0.95 and 40, respectively. Also, this process is repeated for all 135 customers. In addition, we used the TwoOpt_solver algorithm, a python’s inbuilt function that solves the TSP, as the Q+TSP’s second algorithm. Similarly, the Dijkstra+TSP consists of two algorithms. The first algorithm is the Dijkstra algorithm which calculates the shortest path and distance between each customer and other customers. The second algorithm is the TSP algorithm that solves for the shortest route connecting all the customers (by visiting each customer once and returning the source customer). In particular, we used python’s inbuilt function for the Dijkstra algorithm based on the network graph and the TwoOpt_solver algorithm for solving the TSP. Notably, we implemented all the DQSR algorithms and the two algorithms compared, using python 3.7 on an Ubuntu 18.04.4 machine with 1 NVIDIA TITAN Xp Graphics Card and a GPU memory of 11 GB. Notably, the performance of each model was unaffected by the use of the GPU.

##### Table 3. Simulation parameters.

### 4.3 Results and Discussion

We present and discuss the results of the case study that used the DQSR, under various parameters, and analyze the DQSR’s performance by comparing this performance to those of the respective other two algorithms (Q+TSP and Dijkstra+TSP).

#### 4.3.1 Route Visualization

The road network and the route information are the two data required for the route visualization on the QGIS application. The road network is generated through the QGIS application or can be found in the government portal. Notably, we generated the road network from scratch for the Yeonsu-gu, Incheon, South Korea, and stored it as a shape (.shp) file. The route result is generated using a python program. The route result contains three data fields (i) a list of road links or edges through which the RL agent should travel through, (ii) the sequence number denoting the order in which the agent should travel, and (iii) a flag to separate the agent-selected road links from the total road links list. The route information is finally stored as a comma-separated value (CSV) file.

Meanwhile, the python program and QGIS have not been embedded together in this work. Rather, the route result from the python program is manually uploaded to the QGIS application for route visualization. First, the shapefile containing the road network information is loaded to the QGIS through the ‘Add Vector Layer’ function. In particular, the road network layer consists of fields, such as the road link id, node id, location information, etc. Similarly, the route information file is uploaded to QGIS through the ‘Add Vector Layer’ function. It consists of fields like road link id, sequence number, and flag. Next, we join the route layer to the road network layer based on a common field. Consequently, two new data fields, namely sequence number and flag, are added to the road network layer. Also, the road link id in the road network layer is categorized into two groups based on the flag data field: (i) the one selected for the agent to travel through and (ii) the one not included in the route planning. This categorization of the road link is performed through the QGIS Symbology properties, and the road sequence level’s categorization is performed through the QGIS Label properties.

Fig. 3 shows the visualization of the shortest routes from the DQSR and Q+TSP algorithms, respectively. In particular, these results are visualized on a QGIS map. The ‘black’ lines represent the road network, ‘green’ dots represent the illegal-parking locations, and dotted ‘blue’ lines represent the customer links or the customers for the DQSR-based RL agent to visit. Fig. 3(a) shows the shortest route covering all the customers and returning to the starting customer (‘red’ line). In addition, the numbers mentioned in the road network show the order in which the DQSR-based RL agent had to follow the road links to visit all the mentioned customers and return to the starting customer. Correspondingly, the DQSR-based RL agent’s shortest route is 58155.14 m long. In addition, the DQSR-based RL agent had to visit 600 out of the total 2152 road links, for the shortest route. Importantly, the DQSR-based RL agent could choose to travel through previously traveled road links without concerning that the agent would visit a previously visited customer. The agent was allowed to make such a choice if and only if the path through a customer revisited is the shortest path connecting two other customers. Fig. 3(b) shows the zoomed-in visualization of the region within the ‘purple circle’ of Fig. 3(a). This zoomed-in image of Fig. 3 (b) is more intuitive and easier to visualize. Since the DQSR-based RL agent visited the road links multiple times, only the last sequence numbers of the road links in the route are displayed in the image.

##### Fig. 3. Route visualization on QGIS: (a) The DQSR algorithms based shortest-route; (b) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(a); (c) The Q+TSP algorithm based shortest-route; (d) The zoomed-in visualization of the region within the ‘purple circle’ ofFig. 3(c). The dark purple arrows inFig. 3(d)highlight some of the differences in the shortest-route generation by the DQSR and Q+TSP algorithms, respectively.

Fig. 3(c) shows the shortest route generated by the Q+TSP algorithm. In particular, this shortest route is represented by the ‘purple’ line of Fig. 3(c). The numbers in the road network show the sequence in which the agent had to travel the road links to complete the shortest route. Notably, the Q+TSP algorithm required 55703.39 m for the route connecting all the customers. In addition, the Q+TSP agent visited 597 out of the total 2153 road links in generating the shortest route. Meanwhile, the Q+TSP agent was allowed to go through the same road links without violating the condition of visiting an already visited customer. Fig. 3(d) is the zoomed-in image of the region within the ‘purple circle’ of Fig. 3(c), which is more intuitive for the reader to understand the shortest route. Further, Fig. 3(d)’s dark purple arrows highlight some of the differences in the shortest-route generation by the DQSR and Q+TSP algorithms, respectively.

Table 4 summarizes the algorithms’ respective performances. The table includes the physical distance of the shortest route generated by each of the algorithms and also provides the algorithms’ respective time complexities in generating the shortest routes. Hence, the Q+TSP algorithm provides the shortest route in terms of travel distance which is approximately 55.7 km for this route, whereas the DQSR’s shortest route overshot the Q+TSP algorithm’s travel distance by approximately 3 km. But, a 3 km overshoot is merely 0.5 % of the Yeonsu-gu region’s total road length of 557.49 km. Therefore, we can argue that both the DQSR and Q+TSP algorithms respectively generate the shortest route. However, DQSR is more preferred compared to the other two algorithms when the time complexity is taken into account. Table 4 shows that the DQSR required 237 s to calculate the shortest path between each customer and this customer’s nearest neighbor customers and required another 209 seconds to calculate the shortest route connecting all the customers, i.e., a total time of 446 s. In contrast, the Q+TSP algorithm required 1412 s to calculate the shortest path between each customer and all other customers and 0.23 s to calculate the shortest route connecting all the customers, i.e., a total time of 1412.23 s. Since the road network for our case study is very large with very high network depth, the Dijkstra+TSP’s Dijkstra algorithm couldn’t generate the shortest path in 6 h. Therefore, we chose to terminate the algorithm.

##### Table 4. The comparison of the respective shortest-routes generated by the DQSR, Q+TSP, and Dijkstra+TSP algorithms, in terms of shortest-route distance and time complexity.

## 5. Conclusion

This paper elaborated on all the aspects of the proposed dual-stage, Q-learning based, shortest-route generation and presented its training process. In particular, this shortest-route generation was used for illegal-parking surveillance in Yeonsu-gu, Incheon, South Korea. The analysis of this application, based on experiments, shows that our DQSR algorithms perform better compared to two other shortest-route generation algorithms, in terms of time complexity, and has achieved almost the same shortest-route distance as these two algorithms compared. In particular, this comparison shows that our DQSR was three times faster than the Q+TSP algorithm and had produced the shortest route with a 5 % higher distance than that of the Q+TSP. In addition, the time complexity in our model can further be reduced by lowering the number of nearest neighbor customers, which we fixed to 50 in this study, provided there is enough intersection of customers in each nearest neighbor customer group to generate the shortest route. Further, we plan to increase and decrease the road network size and check the performance of the DQSR in terms of time complexity and shortest-route distance, in the future.

### REFERENCES

## Author

Navin Ranjan received his Bachelor's degree from Kathmandu University, Nepal, in 2016 and his Master's degree from Incheon National University, South Korea, in 2021. He worked as a Research Assistant at Industry-University Cooperation Foundation, Incheon National University, South Korea (2021~2022). His research interests include image processing, computer vision, scene understanding, and reinforcement learning.

Sovit Bhandari received a B.E. degree from Kathmandu University, Nepal, and an M.S. degree from Incheon National University, South Korea, in 2016 and 2020, respectively. He worked as Core Network Engineer at Huawei Technologies Nepal Pvt. Ltd. From Jan. 2018 to Aug. 2018, and since Sept. 2020, he has been working as a Research Assistant at IoT and Big-Data Research Center, Incheon National University, South Korea. His research interests include radio resource management, 5G and beyond mobile communications, machine learning, the Internet of Things, image processing, and computer vision.

Yeong-Chan Kim received his Bachelor’s degree from Daelim University, South Korea, in 2021. He is currently doing his M.S. at Incheon National University, South Korea. His research interests includes data science, deep learning, next-generation mobile communication systems, image processing, and computer vision.

Hoon Kim has been working as an Associate Professor at the Department of Electronics Engineering, Incheon National University, South Korea, since 2008. His research interest include radio resource management, optimization techniques, 5G mobile communication systems, machine learning and the internet of things. He is a Member of KICS, IEIE, IEEE, and IEICE.