1. Introduction
A multiplier is an essential component of most digital systems, such as microprocessors,
digital signal processors, graphic processors, etc. Therefore, various efforts have
been made to develop highspeed multipliers ^{[1}^{6]}. The most basic form of multiplication process follows the traditional algorithm
taught in primary school and simplifies it to base 2. Multiplication of two $\textit{N}$digit
numbers is composed of three steps. First, $\textit{N}$ partial products are produced
by multiplying the multiplicand with each digit of the multiplier. The partial products
are then shifted to their proper positions and added by carry save adders (CSA) until
only two binary numbers are left in the second step. In the third step, the two numbers
are added with a carry propagation adder (CPA) to obtain the final result.
The research on highspeed multipliers focuses on the first and second steps because
many highspeed CPAs have been designed and are still researched by adder designers.
Booth encoding was developed to reduce the number of partial products. Traditional
$\textit{N}$${\times}$$\textit{N}$ multiplication generates $\textit{N}$ partial products,
but the radix$\textit{r}$ ($\textit{r}$=2$^{k}$) Booth multiplier ^{[5}^{10]} can generate one partial product per $\textit{k}$ digits of a multiplier by using
encoding logics. Therefore, only $\textit{N}$/$\textit{k}$+1 partial products are
generated in the multiplier.
The most critical issue in multiplier design is adding the partial products generated
in the first step. Generally, summations with CSAs are performed until two partial
sums are left, which should be added by a CPA. The speed of the multiplier depends
mostly on the way of reducing the many partial products to two partial sums.
The most widely used reduction scheme is the Wallacetree ^{[2]}, which adds all bits of a digit in parallel with multiple adders. Since tens of bits
exist in a digit, several stages of parallel addition are required to obtain the final
result. For example, the reduction process for 32 partial products requires 8 stages
of adder layers with the Wallacetree structure. Besides the Wallacetree architecture,
many modified architectures have been researched for fast addition of partial products.
Although many reduction schemes have been developed so far, all multipliers are designed
with (3, 2)adders. Some wideinput adders called addercompressors ^{[10]} are used for some multiplier designs, but most addercompressors are also built with
(3, 2)adders. Recently, a (7, 3)adder based multiplier design was proposed ^{[11]}. An ($\textit{m}$, 3)adder (4 ${\leq}$ $\textit{m}$ ${\leq}$ 7) adds $\textit{m}$
bits in the same digit at a time and produces 3 outputs: S (sum), C (carry), and N
(next carry), such that the binary number (N C S) is the result of the addition. A
(7, 3)adder based multiplier uses a (7, 3)adder as the basic building unit of the
multiplier with (6, 3), (5, 3), (4, 3), (3, 2), and (2, 2) adders as auxiliary units.
The advantage of a (7, 3)adder based multiplier is that the reduction process with
CSA can be built with fewer stages. Although a (7, 3)adder circuit is slower and
more complex than a (3, 2)adder circuit, the reduced number of stages could compensate
for this. In this paper, newly designed fast ($\textit{m}$, 3)adder (4 ${\leq}$ $\textit{m}$
${\leq}$ 7) circuits are presented. The new ($\textit{m}$, 3)adders are smaller,
faster and consume less power than previous ($\textit{m}$, 3)adder circuits.
2. (7, 3)adder based Multiplier Design
All multipliers designed so far use a (3, 2)adder, which is usually called a full
adder for adding partial products. For decades, a number of full adders have been
designed, so it may not be possible to design a new faster full adder circuit any
more. Therefore, designers of faster multipliers have turned their attention to architectural
design of adder layers or efficient parallel processing and pipelining.
However, a new challenge is attempted to speed up multipliers by the design of a fast
(7, 3)adder circuit. The advantage is that the (7, 3)adders can reduce partial products
more rapidly than (3, 2)adders. Therefore, a small number of adder stages is required
in the reduction process. Since the latency of the reduction process is the summation
of the delay of all stages, the increased delay from the (7, 3)adder stage can be
compensated by the decreased number of stages. Although a (7, 3)adder circuit is
slower, bigger and uses more power than a (3, 2)adder, a (7, 3)adder based multiplier
requires fewer adders and stages than a (3, 2)adder based multiplier, so that there
are cross points in speed, area, and power. If it is possible to design a (7, 3)adder
circuit within the cross point\textbf{,} a (7, 3)adder based multiplier can operate
faster than (3, 2)adder based multipliers.
To obtain the speed limit of a (7, 3)adder circuit, it is necessary to compare the
number of adder stages in both multipliers. A (3, 2)adder produces two outputs, S
(sum) and C (carry), by adding three bits in the same digit. Since a (3, 2)adder
removes 3 bits by an operation but introduces 2 new bits to the next layer, 1/3 of
the number of bits is reduced by a stage. Likewise, an ($\textit{m}$, 3)adder (4
${\leq}$ $\textit{m}$ ${\leq}$ 7) adds $\textit{m}$ bits at a time and produces 3
outputs: S, C, and N, so that 4/7 of the number of bits is reduced by a stage of (7,
3)adder.
Fig. 1 shows the critical path of the reduction process in a 64bit radix4 Boothencoded
Wallacetree multiplier designed with (3, 2)adders and (7, 3)adders. 33 partial
products can be reduced by three (7, 3)adder stages and one (3, 2)adder stage, while
8 stages are required when only (3, 2)adders are used. Table 1 shows the number of layers required for the reduction step for $\textit{N}$ (=2$^{k}$)bit
Wallacetree multipliers. To reduce 2$^{k}$(or 2$^{k}$+1) partial products to two
partial sums, the (7, 3)based multiplier needs $\textit{k}$2 (7, 3)adder stages
and one (3, 2)adder stage, while the other one requires 2($\textit{k}$1) stages
of (3, 2)adders. Therefore, the (7, 3)adders reduce the number of stages by half.
Table 1. The Number of Adder Stages Required in the Reduction Process with WallaceTree Structure.

(3, 2)adderbased

(7, 3)adderbased

$\textit{N}$ (2$^{k}$)

32

64

128

32

64

128

$\textit{k}$

5

6

7

5

6

7

# of partial products

32

64

128

32

64

128

# of (7, 3) stages







3

4

5

# of (3, 2) stages

8

10

12

1

1

1

Fig. 1. Critical path for the reduction process of a 64bit radix4 Booth multiplier with Wallacetree architecture: (a) (3, 2)adderbased multiplier; (b) (7, 3)adderbased multiplier.
For a speed advantage, the delay of a (7, 3)adder should not exceed 2 times the latency
of a (3, 2)adder. While speed is the most important feature in most circuit designs,
power and area are significant features as well.
The number of ($\textit{m}$, 3)adders required in the reduction step is about 1/3
of the (3, 2)adders, which could be another criterion for the design of a (7, 3)adder
circuit.
A Booth multiplier with a high radix needs complex encoders and odd multiples of the
multiplicand, which requires much hardware resources and extra processing time. One
stage of parallel addition with (7, 3)adders reduces the number of partial products
by half, so its effect is equivalent to squaring the radix of the Booth encoder. For
example, the number of partial products generated by radix16 Booth multipliers is
the same as the number of partial sums left after passing the first stage of a (7,
3)adder in radix4 Booth multipliers. Likewise, those left after passing the second
stage of the (7, 3)adder layer in the radix4 multiplier is the same as the number
of partial products in a radix256 Booth multiplier. Therefore, a (7, 3)adder based
multiplier can provide more options in optimizing Booth multipliers.
3. The Fast (m, 3)adder Circuits
As the number of inputs increases, the adder circuit becomes slower and more complicated.
Therefore, among ($\textit{m}$, 3)adders, a fast (7, 3)adder circuit is the bottleneck
of the design. It is obvious that the circuits of a (6, 3)adder, (5, 3)adder, and
(4, 3)adder will be faster than a (7, 3)adder circuit if they are designed with
the same technique.
For convenience, the newly designed ($\textit{m}$, 3)adder circuits can be divided
into two parts: the body and leaves, as shown in Fig. 2. The circuits for S, C, and N share the same body, and each circuit is generated
in a leaf separately by using the outputs of the body. For fast operations, the body
is designed with NMOS only. Due to the properties of NMOS, it is strong in pulldown
but weak in pullup. Therefore, there is a large difference between rising and falling
delay: the rising delay is very large, whereas the falling delay is small.
The main idea for the circuit design is that the outputs S, C, and N should respond
only to negative transitions. For example, the sum of 7 inputs, S, is 1 when X1 or
X3 is 0, but it is 0 otherwise. Assume that the inputs change from (0000000) to (0000111)
and then to (0000101). (X0, X1, X2, X3) will be changed from (0111) to (1110) and
to (1101). Initially, S is 0 and is then changed to 1 since X3 becomes 0. S is changed
quickly since the falling delay of X3 is small. Next, X3 becomes 1, so S should be
changed to 0. If S is controlled by X1 and X3 only, S should be changed after X3 becomes
1, which is very slow compared to negative transition. To avoid the influence of positive
transitions, let S be controlled by not only X1 and X3, but also X0 and X2. In the
second transition, S is controlled by the negative transition of X2 instead of the
positive transition of X3. The circuit in Fig. 2(c) makes the output respond to changing signals such that S is changed by X2 even if
both X2 and X3 are 0 for a while.
C and N circuits work in the same manner. The same design technique is applied for
other ($\textit{m}$, 3)adder circuits. The circuits of (4, 3), (5, 3), (6, 3), and
(7, 3)adders are presented in Figs. 25. The leaf circuits of S and C are the same for all ($\textit{m}$, 3)adder circuits.
Fig. 2. The circuit of a (4, 3)adder: (a) body part; (b) leaf for N circuit; (c) leaf for S circuit; (d) leaf for C circuit.
Fig. 3. The circuit of a (5, 3)adder: (a) body part; (b) leaf for N circuit.
Fig. 4. The circuit of a (6, 3)adder: (a) body part; (b) leaf for N circuit.
Fig. 5. The circuit of a (7, 3)adder: (a) body part; (b) leaf for N circuit.
4. Experiments
The characteristics of the newly designed ($\textit{m}$, 3)adders were estimated
by simulations and compared with those of existing ($\textit{m}$, 3)adder circuits
^{[11]}. The simulation was performed by HSPICE with 1.2V0.13${\mathrm{\mu}}$m model parameter.
Table 2 summarizes the simulation results comparing the speed of the newly designed circuits
with the existing circuits ^{[11]} and the reference (3, 2)adder circuit. The wellknown full adder circuit as in Fig. 6 ^{[12]} is chosen as the reference (3, 2)adder circuit. The delay of an adder is measured
by cascading 10 identical adders (fanout=1). The delays and power of the circuits
are measured in various combinations of input changes. The speed of the (7, 3)adder
varies widely with the change of inputs. The worst case is when input A changes, while
the best case is when only input G changes. The delay and the power in Table 2 are taken as the worstcase values in the simulations.
Fig. 6. The reference (3, 2)adder circuit.
To be adopted for a highspeed multiplier, the delay of the (7, 3)adder should be
less than 2 times the delay of the (3, 2)adder. The delay of the previous (7, 3)adder
^{[11]} is about 1.9 times longer than that of the (3, 2)adder. Although it is within the
boundary, it is so close to the limit that the advantage of the (7, 3)adder based
multiplier is not clear. The delay of the proposed (7, 3)adder circuit is 1.45 times
longer than that of the (3, 2)adder, which is faster than the previous (7, 3)adder.
This verifies that the (7, 3)adder based multiplier can operate faster than the (3,
2)adder based multipliers by upgrading the circuit of the (7, 3)adder.
Table 2. Simulation Results.

Adder

(3, 2)

(4, 3)

(5, 3)

(6, 3)

(7, 3)

Ref.

280





In ^{[11]}


315

381

422

528

New


312

319

352

406

Ref.

5.66





New


8.20

9.31

10.81

13.43

* Power is measured at 100MHz frequency
To estimate the speed enhancement of the (7, 3)adder based multiplier, 64bit radix4
Booth multipliers were designed based on the (3, 2)adder and the (7, 3)adder. Since
the (7, 3)adder is used only in the reduction step, the two multipliers differ only
in the reduction circuit. The same radix4 Booth encoder circuit and 128bit carry
selection adder are used in both multipliers. The delays of the multipliers are simulated
with HSPICE, and the result is shown in Table 3.
By using the new (7, 3)adder circuit, the (7, 3)adder based multiplier completes
the reduction step 35% faster than the (3, 2)adder based Wallacetree reduction circuits.
Enhancement of the total delay depends largely on the 128bit adder used in the multiplier
because many adders have been developed, and their delays differ widely. In the simulation,
a progressively partitioned carry selection adder ^{[13]} was used in the multiplier. The latency of the adder is 3.2 ns for addition of two
128bit binary numbers. With this 128bit adder, the (7, 3)adderbased multiplier
is 13% faster than the (3, 2)adder based multiplier. Recently, a 128bit modified
carry lookahead adder ^{[14]} was proposed, and its delay was 0.42 ns with 1.8V0.18${\mu}$m technology. If this
adder is used, the speed is increased by 25%. For a speedoriented design, in general,
a highspeed adder is employed in the CPA step, so the latency of the reduction step
would be the dominant component of the total delay. Therefore, the efficiency of the
(7, 3)adder increases when a highspeed 128bit adder is used.
Table 3. The Delay Components of a 64bit Radix4 Booth Multiplier.

Delay (ns)

(3,2)based

(7,3)based

Partial product generation

0.54

0.54

Reduction (CSA)

2.24

1.45

Carry propagation addition

3.20

3.20

Total

5.98

5.19

Table 4. Comparison of the Number of CSA Adders to Build a 64bit Radix4 Booth Multiplier.

Adder

WallaceTree

(7, 3) based

# of adders

(2, 2)

67

26

(3, 2)

2021

218

(4, 3)



95

(5, 3)



100

(6, 3)



102

(7, 3)



378

# of stages

(7, 3)


3

(3, 2)

8

1

The worstcase power of the (7, 3)adder is about 2.4 times that of the (3, 2)adder.
Although it consumes much more power than the (3, 2)adder, the reduced number of
adders can compensate for this increase. Table 4 shows the number of adders required to build the reduction layers for the 64bit
radix4 Booth multiplier with Wallacetree structure. If we combine the values in
Tables 2 and 4, the worstcase power in the reduction process was 11.4 mW for the
(3, 2)adder based multiplier and 9.1 mW for the (7, 3)adder based multiplier. This
result shows that the (7, 3)adder based multiplier consumes less power than the (3,
2)adder based multiplier.
The simulation results verify that a faster and lowerpower multiplier can be implemented
by employing the (7, 3)adder. The design of widerinput adders could serve as an
alternative approach for highspeed multipliers.
5. Conclusion
A set of ($\textit{m}$, 3)adder circuits was presented in this paper for use in the
design of a highspeed multiplier. Simulations proved that the (7, 3)adder based
multiplier with the proposed circuits is faster and uses less power than the (3, 2)adder
based multiplier.
The circuit of a (3,2)adder is relatively simple, and various circuits have been
developed for decades, so it is almost impossible to design a new faster circuit.
However, the circuit of a (7, 3)adder is complex and rarely developed. The proposed
adder circuits verified that highperformance multipliers can be achieved through
the design of a highperformance (7, 3)adder circuit.
REFERENCES
Park M. C., Lee B. W., Kim G. M., Kim D. H., 1993, Compact and Fast Multiplier Using
Dual Array Tree Structure, IEEE Int. Symp. on Circuit and Systems, Chigago, Vol. 3,
pp. 18171820
Wallace C.S., 1964, A Suggestion for a Fast Multiplier, IEEE Trans. Electron. Compt.,
Vol. 13, No. 1, pp. 1417
Dadda L., 1965, Some Schemes for Parallel Multipliers, Alta Frequenza
Mehta P., Gawali D., 2009, Conventional versus Vedicmathematical method for Hardware
implementation of a multiplier, IEEE Int. Conf. on Advances in Computing, Control,
Telecommun.Technologies, pp. 640642
Booth A. D., 1951, A Signed Binary Multiplication Technique, Jour. ofMech. Appl. Math.,
Vol. 4, pp. 236240
Yeh W. C., Jen C. W., 2000, HighSpeed Booth Encoded Parallel Multiplier Design, IEEE
Trans. on Compt., Vol. 49, No. 7, pp. 692701
Schwarz E. M., III R. M. A., Sigal L. J., 1997, A radix8 CMOS S/390 multiplier, in
Proc. 13th IEEE Symp. Comput. Arithmetic (ARITH), pp. 29
ColonBonet G., Winterrowd P., 2008, Multiplier evolution: A familyof multiplier VLSI
implementations, Comput. J., Vol. 51, No. 5, pp. 585594
Riedlinger et al. R., 2012, A 32 nm, 3.1 billion transistor, 12 wide issue itanium
processor for missioncritical servers, IEEE J. SolidState Circuits, Vol. 47, No.
1, pp. 177193
Rao M., Dubey S., 2012, A high speed and area efficient Booth recoded Wallace tree
multiplier for fast arithmetic circuits, Microelectronics and Electronics (Prime Asia),
2012 Asia Pacific Conference onPostgraduate Research, pp. 220223
Yoon M., 2019, Design of A Fast Multiplier with (m, 3)Adders, IJERT, Vol. 12, No.
10, pp. 17571763
Neil H.E. Weste , David M. Harris , Kamran , 2010, Principles of CMOS VLSI Design:
A systems perspective, Pearson, Fourth Edition
Pilato L., Saponara S., Fanucci L., 2016, Performance of digital adder architectures
in 180nm CMOS standardcell technology, 2016 International Conference on Applied Electronics
(AE), pp. 211214
Ghafari S., Mousazadeh M., Khoei A., Dadashi A., 2019, A New Highspeed and Low area
Efficient Pipelined 128bit Adder Based on Modified Carry Lookahead Merging with
HanCarlson Tree Method, 2019 MIXDES  26th International Conference "Mixed Design
of Integrated Circuits and Systems", pp. 157162
Author
Myungchul Yoon received the BS and MS degrees in Electronics Engineering from Seoul
National University, Korea, in 1986 and in 1988 respectively, and the Ph.D. degree
in Electrical and Computer Engineering from the University of Texas at Austin in 1998.
From 1988 to 2002, he was with Hynix Inc. Icheon, Korea as a technical research staff
at Semiconductor R&D Lab., and Mobile Communication R&D Lab. From 2005 to 2006, he
was with DGIST, Korea as a technical staff at the Information Technology R&D Division.
Since 2006, he has been with the Department of Electronics and Electrical Engineering,
Dankook University, YongIn, Korea, where he is a professor. His research interests
are in lowpower VLSI design, embedded systems, and mobile communication..