The proposal of this study emphasizes the importance of adhering to a rigorous methodology
to enable experiment reproducibility and result comparison based on the traffic conditions
in Korea. In addition, this study applied a traffic simulation environment that uses
tools from graph theory and Markov chains using Eclipse. The basic concepts in MDP
and RL methods were also applied. The methodology is a slightly adapted version of
Varela, a reinforcement learning-based adaptive traffic signal control methodology
for multi-agent coordination. While the existing methodologies for independent learners
consist of four steps, this study extended it to two additional steps as distinct
components: MDP formulation and RL method. The five steps included simulation setup,
MDP formulation, RL method, training, and evaluation, as shown in Fig. 1.
Because the MDP defines the optimization problem, meaningful comparisons between different
reinforcement learning methods require the same underlying MDP. Moreover, the MDP
formulation can have a decisive impact on the performance of the reinforcement learning
model. This has been demonstrated by keeping the learning algorithm fixed and altering
the underlying MDP formulation. In this study, the underlying MDP was fixed, and different
baselines and RL-based methods were tested, evaluating separate function approximations,
adjustment methods, and observation scopes.
4.1 Motorway Networks Topology-based MDP Formulation
The network could be extracted from real-world locations. Parts of urban areas can
be exported by leveraging the available open-source services, and by preparing this
information during the simulation setup phase, it can be provided to the simulator,
opening up the possibility of simulating a rich set of networks related to real traffic
signal control. Real-world data can generate realistic traffic demands that match
actual observations, reducing the gap between the simulator and the deployed traffic
controllers in the real world. On the other hand, these data need to be validated
before being used, and the setup process can be complex because it is often specific
to the network. Acquiring such data can be challenging, and it may be noisy or even
unavailable. Therefore, data-driven traffic demands fall outside the scope of this
research.
The MDP consists of state features, reward signals, action schemes, and observation
scope. A group of collaborating multi-agent-based DRLs is defined by an MDP that accounts
for the lack of observability and interactions. The MDP is defined by the tuple and
is expressed as Eq. (1).
State space S (s${\in}$S) represents the state at time t, composed of features of
incoming approaches at intersections. In this research, The equation was described
by feature maps ${\phi}$(s) composed of data on internal states and incoming approaches
expressed as Eq. (2).
The internal state is defined by the index of the current green phase, $x_{g}\in \left\{0,1,\ldots
,r-1\right\}$, where P is the number of phases and the time since this phase has been
active, $x_{t}\in \left\{10,20,\ldots ,90\right\}$. The feature $x_{r}$ on the incoming
approaches of any given agent n at phase p is defined by the cumulative delay as Eq.
(3).
where $v_{r}$ is the speed of the vehicle in the incoming approach of step p for the
agent, and $v_{r}$ is the speed limit for step r. No delay occurs if all vehicles
travel at the speed limit for each step or if no vehicles are in a step. If a vehicle
travels slower than the speed limit, the delay becomes positive until it reaches the
maximum stop (v = 0), and the delay becomes a maximum of 1.
4.2 Deep Reinforcement Approaches
The DRL method consists of learning algorithms with different function approximation
methods, adjustment methods, and observation scopes. In this task, agent coordination
is achieved using the QL algorithm for the domain and some of its variations. (i)
The QL algorithm receives a discrete state space, so it is necessary to discretize
the state defined in the previous MDP formula. (ii) In this algorithm, each intersection
must share its state and behavior during training and share its state during execution.
Deep QL (Deep Q-Learning) is a type of reinforcement learning that explores a non-deterministic
environment and selects the best action based on experience. Deep QL learns based
on the concepts of state, action, and reward. When time is denoted as t, the situation
of the environment is defined as a state ($s_{t}$). When an action ($a_{t}$) is taken
in a state, a reward ($r_{t+1}$) is given, and the system transitions to the next
state ($s_{t+1}$) as Eq. (4).
The set of states for n states and m actions is expressed as Eq. (5), and the set of actions is represented by Eq. (6). Each state, action, and reward has a Q-function, denoted by Eq. (7).
The learning values in Deep Q-Learning are stored in a Q-table. In this case, the
value is obtained from the maximum value among those for the current state, action,
and reward ($r_{t+1}$) and the new state ($max_{a}$ Q($s_{t+1}$, $a_{t+1}$)). This
is achieved using the learning rate (lr, ${\alpha}$) and the discount factor (df,
${\gamma}$) expressed as Eq. (8).
In general, Deep Q-Learning involves exploration, where actions are chosen based on
the state and reward. When selecting actions, occasionally attempting new actions
can lead to better results rather than solely relying on the actions that yield the
highest immediate rewards. Therefore, the concept of exploration with randomness is
applied, known as epsilon-greedy selection. This research proposes a traffic signal
control system using Deep Q-Learning in a multi-intersection setting. Each intersection
is equipped with a local agent ($L_{agent}$), and each agent performs Deep Q-Learning
independently based on the time information of the waiting vehicles from neighboring
intersections aiming to enter the respective intersection. Accordingly, during training,
specific procedures of simulations and algorithms rely on random number generators.
Simply changing the seed of these generators can induce significant differences in
the performance of the implemented traffic controllers. Owing to this variance, multiple
independent training runs are seeded for each controller, and the results are averaged
across each controller to obtain the performance outcomes that reflect how the traffic
controller performs. These random seeds also allow for complete replication of all
experiments. The DRL process involves exploration and exploitation phases, where congestion
can occur in the network during simulations, preventing vehicles from moving through
the road network. This can occur more frequently during the exploration phase, where
actions are randomly selected by the agent. When congestion occurs, the agent halts
learning, and the simulation halts. To avoid congestion, the reinforcement learning
task is episodic, where the simulator is reset after a set time to prevent unfavorable
outcomes from persisting indefinitely. Two main performance metrics and two auxiliary
performance metrics are used. The reward increases during training to allow the agent
to make better decisions and indicate that the generated policy, such as in deep reinforcement
learning models (e.g., DQN), is approaching stable state preservation. The other two
auxiliary metrics are the average number of vehicles in the road network and the average
speed. As training progresses, the agent can make better decisions, reflecting a decrease
in the average number of vehicles in the network because it becomes more dispersed
and an increase in average speed (Fig. 2).
The state of DQL (Deep Q-Learning) is defined as the number of available directions
for vehicles to move at a given intersection. For example, Fig. 3 shows a four-way intersection with four adjacent directions. Each direction at a
four-way intersection allows for left turns and straight movements. Therefore, the
state of a four-way intersection can be classified into eight categories ((S= {$s_{0},s_{1},s_{2},\ldots
\ldots \ldots \ldots ,s_{n}$}). The actions in the proposed DQL consisted of the possible
actions to take at the intersection, and there were three action sets (A = {$a_{0},a_{1},a_{2},\ldots
\ldots \ldots \ldots ,a_{m}$}).
At time t, the reward (${\left(r\right)}_{t}^{i}$) of the local agent at an intersection
was composed of the throughput ($t_{p}$) and the average waiting time (wt) of the
adjacent intersections, as shown in Eq. (9). The throughput represents the number of vehicles processed at intersection i within
a unit time, while the waiting time is the average waiting time of vehicles at intersection
i and its adjacent intersections. The weights (${\alpha}$) adjust the importance of
the throughput and waiting time, with w being greater than one and ${\xi}$ defined
between zero and one.
Fig. 2. Training metrics of City Traffic Flow by Deep Q-Learning.
Fig. 3. Adjacent four Intersections of the Motorway of a City.