Quantum computers are superior to existing classical supercomputers in terms of run time for specific tasks, and various studies on quantum computers are in progress. Despite this, there is a significant gap in technology between available technologies and actual quantum computers on the noisy intermediate-scale quantum-scale. Currently, various quantum algorithms have been implemented and verified using quantum simulations. Therefore, the runtime performance of a quantum circuit can be verified in the same environment as a real quantum machine. A quantum compiler compiles quantum circuits. When a quantum circuit is compiled, logical qubits in a quantum circuit should be mapped to physical qubits, called qubit mapping. As qubit mapping is an NP-Complete problem, the compilation time increases exponentially according to the number of qubits. Therefore, various studies on effective qubit mapping techniques have been developed to accelerate the execution time. This paper proposes a new parallelized Qubit mapping algorithm. The proposed method consists of three-step processes: one circuit is divided into several circuits; the SABRE enhanced algorithm is executed on each circuit in parallel and then merged. The performance was evaluated using 10 benchmarks. The experimental results revealed an acceleration of up to 6.42x and an average speed up of 4.8x, with an average overhead of 1.03x in terms of the number of gates obtained. The proposed method outperformed the well-known modern qubit-mapping algorithm in terms of the execution time with exponential speedup improvement.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

Quantum computers are the next-generation computers that store and calculate data
based on quantum mechanics ^{[1,}^{2]}. Quantum devices can process computationally intensive operations in a significantly
shorter time compared to classical computers by utilizing quantum mechanics, such
as superposition and entanglement. The D-Wave company ^{[3]} released quantum systems that are being commercialized and surpass classical supercomputers
using the quantum-annealing algorithm ^{[4,}^{5]}. On the other hand, since the D-wave was designed only for the quantum-annealing
algorithm, the current technology still has many problems developing an actual quantum
machine on the noisy intermediate-scale quantum (NISQ)-scale to be commercialized
universally soon ^{[6]}. Qubit, the basic unit constituting a quantum computer, and the quantum gates constituting
a quantum algorithm all have inherent errors. Therefore, two major problems should
be solved to produce a fault-tolerant and universal quantum computer on the NISQ-scale.
The first to be addressed is quantum error correction, which solves the errors inherent
in qubit and quantum gate ^{[7,}^{8]}. Currently, various studies on quantum error-correction codes have been conducted
^{[9-}^{13]}. Nevertheless, thousands of qubits are required to make a fault-tolerant qubit in
a real quantum machine ^{[14]}. A considerable number of qubits are required for a meaningful quantum algorithm
on the NISQ-scale. Second, the quantum circuit must be executed quickly within the
quantum coherence time ^{[15]}. The quantum coherence time refers to the time at which the information of the qubits
of an actual quantum machine can be preserved without change. In particular, the same
result as the simulation can be obtained if the quantum algorithm is executed successfully
within the quantum coherence time on an actual quantum machine. Currently, there is
no real quantum machine capable of supporting a general-purpose quantum circuit on
the NISQ-scale that can satisfy the two major problems. Thus, quantum computer simulations
must be utilized. Quantum simulation platforms, such as Qiskit ^{[16]}, QuTiP ^{[17]}, and t|ket> ^{[18]}, provide several useful tools to implement, verify quantum algorithms, and evaluate
their performance. In the case of Qiskit, the NISQ-scale quantum algorithm implemented
by the programmer can be operated using a simulator with the same characteristics
as the actual quantum machine. Consequently, it allows a researcher to implement and
analyze quantum circuits quickly.

With the recent developments of quantum computer technology, various quantum algorithms
have been proposed. Quantum algorithms are implemented as quantum circuits based on
quantum gates, and a quantum compiler is required to run these quantum circuits on
a quantum machine. Quantum qubit mapping is the core function of a quantum compiler
that maps logical qubits to physical qubits. The main difference between these two
qubits is the qubit connectivity, and the states in which the qubits are connected
are different. In a logical qubit, a random qubit is connected to all other qubits.
Thus, all the logical qubits are connected to each other. In a physical qubit, however,
only specific qubits are connected to each other. Qubit connectivity is important
because a two-quantum gate cannot be executed directly if a two-quantum gate to be
executed in a real quantum machine is applied to qubits that are not connected. Qubit
mapping is required to solve this problem. On the other hand, qubit mapping is an
NP-complete problem ^{[26]}. Thus, the compilation time increases exponentially with increasing number of qubits.
Therefore, the execution time of qubit mapping should be reduced to decrease the compilation
time.

This paper proposes a novel qubit mapping algorithm that utilizes the existing qubit
mapping algorithm called SABRE ^{[19]}. The SABRE algorithm minimizes the number of quantum gates added due to qubit mapping.
By running the SABRE algorithm in parallel, this study focused on improving the speedup
of the execution time of qubit mapping. For performance verification, 10 NISQ-scale
quantum circuits were used as benchmarks to measure the execution time of qubit mapping
compared to the SABRE algorithm. According to the experimental results, compared to
SABRE, the proposed method has an overhead of up to 1.05x and an average of 1.03x
in terms of the gate, resulting in improvements of up to 6.24 times with an average
of 4.8 times in terms of speedup.

The major contributions of this paper can be summarized as follows:

• This study analyzed the characteristics of previous studies for the qubit-mapping algorithm and identified the goals considered essential for qubit mapping.

• A new method was proposed to run the SABRE algorithm in parallel to reduce the compilation time of quantum compilers.

• The scalability of the proposed method was shown; increasing the number of cores during parallelization will outperform the well-known qubit-mapping algorithm for NISQ-scale quantum circuits.

The remainder of this paper is organized as follows. Section 2 describes the Qubit mapping problem. In Section 3, the proposed qubit-mapping algorithm is explained, and the performance of the method is verified in Section 4. Section 5 discusses the issues that should be resolved in future research, and Section 6 concludes the paper.

## 2. Background and Related Works

This section explains Qubit mapping, which is the focus of the study. In addition, the contents necessary for each paper are mentioned, and the work performed in this study is outlined by explaining recent studies to solve the Qubit mapping problem effectively.

### 2.1 Qubit Mapping Problem

Running a quantum circuit using simulation is not the same as running in a real quantum machine. Qubits used in quantum simulations are called logical qubits, while those used in actual quantum machines are called physical qubits. These two qubits run the quantum circuit under different conditions, and the condition considered in this study is the Qubit connectivity.

Qubit connectivity contains information on how different qubits are connected and can be expressed in a graph called a coupling map. In the case of logical qubits, because they are all connected to each other, the coupling map is not considered when executing the quantum circuit on a quantum simulator. By contrast, because not all physical qubits are connected to each other, the quantum circuit is executed in consideration of the coupling map. Fig. 1 shows the coupling map of a real quantum machine, IBM Q Tokyo. In the coupling map of IBM Q Tokyo, the circles indicate physical qubits, and straight arrows indicate that physical qubits are connected to each other. For example, qubits 6 and 7 are connected to each other, but qubits 6 and 12 are not. Hence, there are limitations in operating the quantum circuit because the physical qubits of the actual quantum machine are only connected to specific qubits.

A quantum circuit consists of several qubits and quantum gates. Quantum gates can be divided into single-qubit gates and multi-qubit gates. When executing a quantum circuit composed of these quantum gates in an actual quantum machine, it is necessary to check whether multi-qubit gates can be executed without additional work. That is, if the physical qubits to which the multi-qubit gates are applied are not connected to each other, these multi-qubit gates cannot be executed directly on an actual quantum machine. To solve this problem, it is necessary to connect the physical qubits that are separated. On the other hand, because the location of the physical qubit cannot be changed, the logical qubit must first be mapped to the physical qubit on the coupling map. Some of the logical qubits are separated from each other and some are connected according to the coupling map. To connect the separated physical qubits, it is necessary to connect the logical qubits mapped to the corresponding physical qubits. To this end, many SWAP gates that change the positions of the logical qubits are added to the original quantum circuit. Accordingly, the number of gates in the quantum circuit increases, which increases the execution time of qubit mapping. That is, to alleviate this problem, it is necessary to minimize the total number of gates using only a minimum number of SWAP gates.

Fig. 2 shows two different quantum circuits in which arbitrary quantum circuits can be produced through qubit mapping: (a) is a coupling map composed of four qubits, and (b) is a quantum circuit composed of four qubits and six quantum gates. If qubit mapping is performed considering the coupling map of (a) provided in (b), a quantum circuit is constructed, as shown in (c). The CNOT gate, which is indicated by the red dotted line in circuit (b), is applied to the first and third qubits. In the coupling graph of (a), the first and third qubits are not connected to each other. Thus, they cannot be executed directly on the quantum machine. The CNOT gate represented by the blue dotted line is a CNOT gate that can be affected by the SWAP gates added for the CNOT, which are represented by the red dotted line to be executed. When Qubit mapping is performed, two possible results can be observed: (c) and (d). In (c), two swap gates were added, and in (d), only one swap was added. Therefore, for a NISQ-scale quantum circuit, depending on the qubit mapping algorithm, the number of swap gates added, and the execution time of qubit mapping may be affected.

### 2.2 Qubit-mapping Algorithms

Recent studies related to Qubit mapping have been conducted to implement NISQ-scale
quantum circuits in real quantum machines. Cowtan et al. ^{[20]} performed Qubit mapping using the t|ket> quantum compiler in the architecture of
IBM QX5, a real quantum machine with 16 qubits, and IBM Tokyo with 20 qubits. Cowtan
et al. ^{[20]} aimed to minimize the depth of the added quantum gate and quantum circuit. For performance
verification, from a significantly small-scale quantum circuit with five quantum gates
to a NISQ-scale quantum circuit as a benchmark, the benchmarks were run together in
the compiler systems of Qiskit, Project Q, and Qulic to compare the performance. In
the performance comparison, however, it is difficult to accurately compare the performance
with t|ket> because the number and depth of quantum gates and depth of Qiskit and
Qulic for approximately 3000 or more quantum circuits in the IBM Q Tokyo machine used
by Cowtan et al. ^{[20]} were not measured. In addition, the execution time required for qubit mapping in
the three compilers was not analyzed.

A previous study ^{[21]} used a heuristic algorithm using the gate commutation rule and the gate transformation
rule to reduce the number of additional gates. In particular, the number of additional
gates required for qubit mapping was minimized using the Bridge gate and the SWAP
gate. On the other hand, the quantum circuit was used as a benchmark for performance
verification contained up to 38577 quantum gates, which is not sufficiently large.
Comparison analysis of the circuit execution time was not performed sufficiently.
Zulehner et al. ^{[22]} performed qubit mapping that minimizes the number of gates added to the IBM QX architecture
and reported superior performance compared to the default qubit-mapping algorithm
provided by Qiskit at the time the thesis was written. Cowtan et al. ^{[20]} developed and verified the superiority of the SABRE algorithm by comparing its performance
with that by Zulehner et al. ^{[22]} in terms of the number of additional quantum gates, total number of gates, and quantum
qubit mapping execution time. On the other hand, the benchmarks used for performance
verification included a maximum of 34881 quantum gates, which is significantly smaller
than the number of quantum gates on the NISQ-scale to be addressed in this study.
Zhang et al. ^{[23]} performed qubit mapping to minimize the circuit depth using a depth-aware SWAP insertion
scheme. The reported performance of the method ^{[23]} was verified by comparing it with the SABRE algorithm and the method reported by
Zulehner et al. ^{[22]}.

Compared to the two methods, the method reported by Zhang et al. ^{[23]} showed superior performance in terms of additional gates and depth. On the other
hand, the benchmarks used by Zhang et al. ^{[23]} are also not NISQ-scale quantum circuits, and a detailed analysis of the execution
time was not performed except for the statement that the expected execution time can
be equated with the depth. Niu et al. ^{[24]} reported the performance of reducing the number of additional gates by 28\% on average
compared to SABRE on the ibmq\_almaden machine and 14\% on the ibmq\_tokyo machine
by proposing a hardware-aware mapping transition algorithm. On the other hand, because
Niu et al. ^{[24]} also benchmarked the quantum circuit used in SABRE, the execution time of the NISQ-scale
quantum circuit was not compared and analyzed. Therefore, the present study used the
NISQ-scale quantum circuit as a benchmark to analyze the execution time, which was
insufficient in previous papers, and reduce it compared to the existing methodology.
The circuit consists of three sub-processes using the SABRE algorithm, and the execution
time of SABRE was compared with the proposed method for performance verification.

## 3. Parallelized Qubit Mapping Algorithm

This section describes the novel Qubit mapping algorithm proposed in this paper in more detail. Fig. 3 illustrates the overall process of the proposed method. It is a parallelization of the existing SABRE algorithm through a multiprocessing module in Python and consists of three-step processes. Each process is Circuit Split, Parallelized SABRE and Reversed SWAP and Circuit Merge.

##### Fig. 3. Overall process of the proposed method. Q-sc indicate the quantum subcircuit produced after the Circuit Split.

### 3.1 Proposed Method

First, the quantum circuit was split into several quantum subcircuits. By doing this, each quantum subcircuit could be computed in parallel. Circuit splitting processes on quantum circuit objects of the Qiskit library. In this version, every quantum circuit that contains more than 300 gates must be split. There is a trade-off between the number of gates per quantum subcircuit and the number of core processes used. This trade-off is mentioned in section 4.2.

Second, one quantum circuit was divided into quantum subcircuits with the same number
of quantum gates, and the SABRE algorithm was applied to the quantum subcircuits to
perform qubit mapping. At this time, because SABRE mapping is performed on a relatively
small quantum sub-circuit compared to the original circuit, the size of a Directed
Acyclic Graph (DAG) ^{[19]} of each quantum sub-circuit is also relatively small. The size of the DAG has a significant
impact on the execution time of the qubit mapping because the DAG is checked multiple
times during qubit mapping. In addition, because SABRE mapping is applied to each
quantum subcircuit in parallel, the execution time of qubit mapping is faster than
that of one-by-one. After SABRE mapping, the position of the last logical qubit of
each quantum subcircuit is different from the initial position because the SWAP gates
are added to each quantum subcircuit. The quantum gate is not applied to the correct
qubit if all quantum subcircuits are combined under these conditions. Thus, the measurement
result will be incorrect. To address the position changes, the logical qubit position
must be moved back to the initial position. To achieve this, when performing SABRE
mapping for each quantum subcircuit, the initial logical qubit position is arranged
from the first qubit to the last qubit in order. When SABRE mapping is finished, the
position of the changed logical qubit is rearranged from the first qubit to the last
qubit sequentially. Different permutations are added to each circuit to rearrange
the physical qubits. The permutation describes the current position of the logical
qubit that allows the addition of reversed swap gates to move the qubit back to the
initial position. The permutation pattern is extracted by identifying which qubits
are swapped when the SABRE mapping operates. This pattern is different for each quantum
subcircuit, and the positions of the last logical qubits of all quantum subcircuits
are rearranged sequentially through permutation. In addition, this reversed swap process
is also executed in parallel to accelerate the execution time.

Finally, each quantum subcircuit on which SABRE mapping and permutation have been performed must be combined into one quantum circuit again. Since all quantum subcircuits have physical qubits arranged in order by permutation at the end of the circuit, they can be converted quickly into a single quantum circuit with a simple '+' operation without additional work. The final quantum circuit has a different configuration from the circuit mapped by the SABRE algorithm; however, the result after the measurement is the same. Therefore, the accuracy of the quantum circuit mapped through the proposed method is reliable.

### 3.2 Example of Proposed Method

Fig. 4 is an arbitrary quantum circuit with four qubits and a total of 19 quantum gates. Assume that a given quantum circuit is operated in a quantum machine with a coupling map, as shown in Fig. 2(a). First, as shown in Fig. 5, a quantum circuit is divided into three quantum subcircuits with six quantum gates through the circuit split process. Subsequently, the red dotted line of each quantum subcircuit cannot be executed directly. Hence, it becomes a quantum gate that requires an additional swap, and the blue dotted line becomes a quantum gate that can be changed owing to the additional swap.

Fig. 6 presents three quantum subcircuits to which SABRE mapping is applied. Each swap gate is applied at a different location. Hence, the location of the final physical qubit is different from the first location. Therefore, before combining them as one quantum circuit, each permutation is applied in Fig. 7. The square brackets in permutation are the pattern of permutation, and each quantum subcircuit has a different pattern. Consequently, the position of the final qubit becomes the same as the position of the first qubit, which avoids applying a quantum gate to the wrong qubit when merging. Fig. 8 shows that it has become one quantum circuit again after the Circuit Merge process.

## 4. Performance Evaluation

This section reports the performance of the proposed qubit-mapping algorithm by comparing it with the SABRE algorithm. Ten NISQ-scale quantum circuits are used as benchmarks, and the execution time of qubit mapping, the number of added quantum gates, and the circuit depth are analyzed.

### 4.1 Speedup

Fig. 9 shows that the execution time of the proposed method is improved compared to the SABRE execution time. All experiments in this study were executed on a server with an AMD EPYC 7R32 CPU up to 3.4GHz (32 physical cores, two threads per core) and 128GB memory. The operating system was Ubuntu 20.04.2 LTS. The Python and Qiskit versions were 3.8.10 and 0.30.1, respectively. A Python multiprocessing module was used to perform the SABRE algorithm in parallel, and the number of processes depends on the number of CPU cores. Ten NISQ-scale quantum circuits were used as benchmarks, and the number of CPU cores was doubled from two to use up to 32 cores.

The Qiskit simulator was used to verify the accuracy of the experimental results and confirm that the results of the original quantum circuit, the circuit to which SABER was applied, and the circuit to which the proposed method was applied were all the same.

Because the proposed method executes SABRE mapping in parallel, the execution time speedup is improved approximately twofold when the number of cores is doubled. The proposed method in most quantum circuits outperformed SABRE. In the case of $pm\_4096$ with full name $plus63mod4096\_163$, using 32 cores, it was possible to obtain a speedup improvement of 6.4x compared to SABRE. On average, the proposed method obtained 1.5x, 2.5x, 3.5x, 4.5x, and 4.8x faster for 2, 4, 8, 16, and 32 cores, respectively, compared to SABRE.

The proposed method ran the SABER algorithm in parallel while increasing the number of CPU cores from 2 to 32. In some benchmarks, the execution time of the proposed method did not increase linearly when the number of cores was doubled. The proposed algorithm consisted of three parts. The circuit split and circuit merge process take time because only the part that performs qubit mapping was in parallel. Therefore, the execution time of the proposed method did not improve the performance completely linearly according to the number of cores.

### 4.2 Overhead Estimation

Table 1 lists the total number of quantum gates and the circuit depth generated by the SABRE and proposed method. The comparison normalized the results of SABRE and presented the results for the number of SWAP gates added using the proposed method, total number of gates, and depth. Because the proposed method included the permutations, there was a possibility that the number of swap gates to be added would be larger than that of the SABRE algorithm. On average, 1.25x swap gates were added compared to SABRE. On the other hand, in terms of the total gates, compared to SABRE, the maximum value was 1.05x, and the average was 1.03x, indicating a relatively small number compared to the added swap gates.

The depth of a quantum circuit is the total number of layers in which quantum gates
can be executed simultaneously ^{[25]}. That is, quantum gates included in one layer can be executed in parallel; hence,
the depth can significantly affect the overall execution time of the circuit. The
depth of the proposed method increased up to 1.06x with an average of 1.03x compared
to SABRE.

This study analyzed the methods to improve the execution time performance of the proposed method for future studies. The SABRE algorithm was executed in parallel by dividing one quantum circuit into multiple small-scale quantum subcircuits, doubling from at least two CPU cores to 32 for each quantum subcircuit. The results are presented in Fig. 9. Some specific benchmarks, such as $misex1\_241$, i.e., the number of cores, were doubled, but the speedup improvement was decreased. This is because the quantum subcircuits are assigned randomly to each core without considering the computational load of each quantum subcircuit when executing this method. That is, the load-balancing issue was not considered. Therefore, the consistent speedup improvement of the method according to the number of cores will need to be determined by adding a load-balancing algorithm optimized to the proposed method in future research.

## 5. Conclusion

This paper reported the parallelized qubit mapping algorithm that utilizes SABRE to accelerate the execution time of qubit mapping. The proposed method takes a quantum circuit as an input, and then splits it into multiple quantum subcircuits. Next, it calculates the SABRE mapping to each of the quantum subcircuits simultaneously powered by multiple cores. Finally, it accumulates results from SABRE and merges back into a single quantum circuit. The proposed method was tested based on the number of gates, depth and execution time with 10 NISQ-scale benchmarks. Because this method used additional swap gates to swap back the location of qubits, it has 1.03x more gates than SABRE on average. On the other hand, with the advantage of parallelization, in terms of the execution time, a speedup improvement of up to 6.42x with an average of 4.8x was obtained with 32 cores compared to SABRE. Future work will extend this method to manage the assigned quantum subcircuits to each core more efficiently.

### ACKNOWLEDGMENTS

This work was supported by Institute for Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) [No. 2020-0-00014, A Technology Development of Quantum OS for Fault-tolerant Logical Qubit Computing Environment]

### REFERENCES

## Author

Dongmin Kim received a Bachelor's degree in Computer Engineering and Applied Mathematics, Pukyong National University, Busan, South Korea, in 2021. He is currently working toward the MS degree in the Department of AI Convergence, Pukyong National University, Busan, South Korea. His research interests include compiler, memory systems, and quantum computing. Particularly, quantum annealing and quantum machine learning.

Sengthai Heng received his Bachelor's degree in Information Technology from the University of Cambodia, Phnom Penh, Cambodia, in 2019. He presently is a master's degree student of AI convergence at Pukyong National University (PKNU), Pusan, Republic of Korea. His current research interests include quantum computing, high-performance computing, and AI.

Youngsun Han received his B.E. and Ph.D. degrees in electrical and computer engineering from Korea University, Seoul, Republic of Korea, in 2003 and 2009, respectively. He was a senior engineer from 2009 to 2011 with System LSI, Samsung Electronics, Suwon, Republic of Korea. From 2011 to 2019, He was an assistant/associate professor with the Department of Electronic Engineering at Kyungil University, Gyeongsan-si, Republic of Korea. Since 2019, he has been an associate professor with the Department of Computer Engineering, Pukyong National University, Pusan, Republic of Korea. His research interests include high-performance computing, emerging memory systems, compiler construction, and SoC design.