Approximate computing has been widely used in image processing applications to significantly reduce the hardware cost of circuits; however, this induces a sacrifice in computing accuracy. The compromise between accuracy and hardware cost in approximate multipliers has not been investigated yet. To address this issue, this paper proposes a set of approximate 8×8 Dadda multipliers built by using an efficient imprecise 4-2 compressor. The compressor introduces symmetrical errors into the truth table of the exact design to reach a simpler structure. Furthermore, as an important image processing application, image multiplication is implemented with the proposed multipliers. Synthesis and simulation results show that the overall performance of the multipliers varies depending on the various assessment criteria. Utilization of the modified compressor in the multipliers results in area, delay, and power reductions of 38%-72%, 14%-33%, and 39%-77%, respectively, compared to the exact design, while maintaining acceptable computing accuracy in image multiplication. According to the results, the proposed multipliers achieve a better trade-off between energy efficacy and computing accuracy than the existing designs, which could be certified as options for exact multipliers in image processing.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

## 1. Introduction

Approximate computing is an attractive paradigm in circuit design, lowering the demand for accurate operations, and reducing power, speed, and area at the expense of a reduction in computing accuracy. The trade-off between hardware cost and computing accuracy is especially relevant to error-resilient applications, such as machine learning and multimedia processing.

Multipliers are the basic blocks of digital systems, and usually consist of three
steps: 1) generating partial products, 2) reducing the partial products, and 3) summing
the final results. Among them, the second step accounts for the dominant hardware
cost. Using efficient compressors can significantly reduce the complexity of this
step, and thus, improves the performance of multipliers ^{[1]}, and 4-2 compressors are widely applied to multipliers to accelerate the reduction
of partial products. In ^{[2]}, a compressor ignored input signal cin and output signal cout to improve the performance
of multipliers in terms of power and delay. The multiplier that utilizes the proposed
compressor shows a great reduction in hardware requirements and transistor count,
compared to the existing designs. Three 4-2 compressors were proposed in ^{[3]} by modifying the truth table of an exact compressor. However, the multipliers using
these compressors were inferior in overall performance. In ^{[4]}, the partial-product-altering method was applied to a 4-2 compressor, and realized
a balance between hardware cost and multiplier accuracy. A compressor using a majority
gate was designed in ^{[5]} by ignoring input signal x$_{2}$, cin, and the cout signal to achieve excellent power
and delay performance. The stacking circuit technique was adopted in ^{[6]} to design approximate multipliers with high computing accuracy while leading to high
hardware costs. In ^{[7]}, a new compressor was designed using only simple AND-OR gates, and the multiplier
utilizing this compressor provided a good error-electrical performance trade-off.
The dual-quality 4-2 compressors introduced in ^{[8]} can be flexibly switched between precise and approximate operating modes. Therefore,
multipliers using these compressors can realize dynamic change in accuracy at runtime.

To improve the trade-off between hardware cost and computing accuracy in approximate circuits, this paper proposes a set of approximate 8${\times}$8 Dadda multipliers. To that end, an imprecise 4-2 compressor using only OR and XNOR gates is designed by introducing symmetrical errors into the truth table of the exact compressor. The errors can counteract each other in a multiplier. This method will optimize the design complexity of multipliers in area, power, and delay while generating satisfying results. The main contributions of this paper are summarized as follows.

1) An approximate 4-2 compressor is proposed to simplify the design complexity of the partial production reduction step in multipliers.

2) A set of approximate Dadda multipliers is built from the compressors to find a better structure with a lower hardware cost and higher computing accuracy.

3) The image multiplication operation is realized through these multipliers to evaluate computing accuracy in real applications.

4) The trade-off between hardware cost and accuracy in the multipliers is comprehensively analyzed through various evaluation criteria as an example in approximate computing.

This paper proceeds as follows. In Section 2, the previous approximate 4-2 compressors are reviewed. Section 3 presents the proposed approximate compressor and multipliers. The synthesis results and their application to image processing are presented in Section 4. Section 5 concludes this paper.

## 2. Related Work

In this paper, we look to 4-2 compressors to build 8${\times}$8 Dadda multipliers owing to their simplified structure and high efficiency in transistor-level implementations. In recent years, several methods have been proposed to design imprecise 4-2 compressors, and they were utilized to design approximate multipliers. Some previous approximate designs that ignored cin and cout are summarized and compared in this section.

In the approximate 4-2 compressor presented in ^{[2]}, the delay of the critical path is less than the previous design, and the number
of gates was further reduced. Three approximate 4-2 compressors were proposed in ^{[3]}; they use a k-map to obtain simplified logical expressions that reduce errors while
providing a significant performance improvement over previous 4-2 compressors. The
first and the second designs in ^{[3]} only have four gates, which greatly simplifies the structural complexity. The third
design is the most accurate while having a more complex structure compared with other
designs. In ^{[4]}, to simplify the circuit of the 4-2 compressor, an OR gate replaces an XOR gate to
compute a sum, thus introducing additional errors. An ultra-efficient compressor proposed
in ^{[5]} consists of one majority gate, which is different from conventional designs. Since
input x$_{2}$ is omitted, and output sum is always equal to 1, this approximate compressor
reaches a simpler logic implementation. The compressors in ^{[6]} have high accuracy, using the stacking circuit technique. A hardware-efficient approximate
compressor proposed in ^{[9]} was obtained by modifying the truth table of the exact compressor, and consists of
only three NOR gates and one NAND gate. In ^{[10]}, an ultra-compact 4-2 compressor was proposed based on simple AND-OR logic, which
leads to a trade-off between hardware cost and precision. In ^{[11]}, the proposed compressor was obtained by modifying an approximate compressor, and
the performance of the applied multiplier improved. Three approximate compressors
were presented in ^{[12]}, and they all innovatively reduced the number of outputs to one, thus significantly
reducing the hardware cost.

## 3. The Proposed Compressor and Multipliers

### 3.1 The Compressor

As shown in Fig. 1, an exact 4-2 compressor generally consists of two full adders with five inputs (x$_{1}$,
x$_{2}$, x$_{3}$, x$_{4}$, and cin) and three outputs (sum, carry, and cout) ^{[13]}. The number for logic 1 in five inputs is counted by the output according to (1), (2), and (3):

##### (3)

$ carry=\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)c_{in}+\overline{\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)}x_{4} $The four inputs, x$_{1}$, x$_{2}$, x$_{3}$, and x$_{4}$, and the output sum have the
same weight, whereas the weights of cout and carry are one binary bit order higher
^{[12,}^{14]}. Therefore, cout and carry are delivered to the next module of higher significance.

In this work, the proposed 4-2 compressor (Fig. 2) is derived by modifying the truth table of the exact compressor to obtain simpler
logic expressions, as seen in (4) and (5), along with ignoring signals cin and cout for design efficiency, as seen in previous
work ^{[2]}. Input x$_{1}$ and x$_{2}$ are also omitted to simplify the compressor and greatly
reduce the energy and critical path delay further. Thus, it has only OR and XNOR gates.
Although the omission of x$_{1}$ and x$_{2}$ introduces certain errors, the proposed
compressors are only used for the approximate part in multipliers, which has little
impact on computing accuracy. Thus, attention will be paid more to the hardware/accuracy
trade-off of the multipliers, rather than only a specific indicator.

As seen in the truth table in Table 1, the proposed design has eight erroneous outputs out of 16 outputs. Error is defined
as the arithmetic distance between the exact and approximate values ^{[15]}. For example, when all inputs are 1, the exact output is 4, and the proposed compressor
produces a 1 for both sum and carry. In this case, the decimal output is 3, so the
error distance is 1. The maximum error generated by this design is 1 (-1), which could
avoid unacceptable results when the compressor is applied to approximate multipliers.
Besides, in the structure of a multiplier, error distance with opposite signs of -1
and 1 will counteract each other ^{[5]}.

##### Table 1. Truth table of the proposed 4-2 compressor.

### 3.2 The Approximate Multipliers

To investigate the impact of the proposed compressor on multiplication, 8${\times}$8
Dadda multipliers with various levels of accuracy are designed. The basic structure
of the approximate Dadda multiplier was described in ^{[2]} where the multiplier uses AND gates to generate all partial products in the first
step, and then uses approximate compressors to compress them into, at most, two rows.
In the last step, an exact ripple carry adder computes the results.

In designing multipliers, the second step plays a critical role in terms of delay, power consumption, and area. The proposed multipliers are denoted M${\alpha}$${\beta}$${\gamma}$, where ${\alpha}$, ${\beta}$, and ${\gamma}$, respectively, represent the number of columns using exact compressors, approximate compressors, and truncation to compress partial products. To find an effective way to improve the performance of multipliers, the least significant bits of the partial products are truncated. In some applications, such as image processing, it is not desirable to obtain more than a certain level of accuracy. Furthermore, the related exact operations consume relatively high amounts of energy. Therefore, exact compressors are utilized for the most significant bits to make up for the lack of computing accuracy, while the proposed approximate compressors are applied to the middle of the partial products to reduce the hardware cost. To investigate the trade-off between hardware cost and accuracy, a set of multipliers was designed. Obviously, M7${\beta}$${\gamma}$ and M6${\beta}$${\gamma}$ aim at improving computing accuracy, while M5${\beta}$${\gamma}$ is used to reduce the hardware cost.

For example, the partial product reduction step of the proposed M654 is shown in Fig. 3, where each dot represents a partial product bit. In the first two stages, three half adders, three full adders, 10 of the proposed imprecise 4-2 compressors, and six exact 4-2 compressors are utilized. In the last stage, a half adder and nine full adders are applied to compute the results.

## 4. Simulation Results and Application

In this section, all designs were described in Verilog HDL and synthesized through the Synopsys Design Compiler NXT with a TSMC 65 nm standard cell library at 100MHz to evaluate performance. Note that the standard CMOS cell library does not include a special module, so all circuits were synthesized using the compile\_ultra command to provide a fair comparison, and the logic functions of the existing designs were optimized under the same conditions. Power data reported are from the Synopsys PrimePower tool using the vector-free power analysis model. In addition, the error metrics and an application for image processing with multipliers were programmed in Matlab.

### 4.1 The Approximate Compressor

A comparison of the proposed compressor and the existing exact and approximate compressors
in terms of area, power, and delay is shown in Table 2. For clarity, the three designs proposed in ^{[3]} are represented by ^{[3]}1, ^{[3]}2, and ^{[3]}3, and the three methods in ^{[6]} are denoted ^{[6]}1, ^{[6]}2, and ^{[6]}3. To comprehensively evaluate efficiency from the proposed design, power-delay product
(PDP) and energy-delay product (EDP) are also listed ^{[9,}^{16]}.

As can be seen from Table 2, the proposed approximate compressor has a 74% reduction in area, a 27% reduction
in delay, and a 91% reduction in PDP, compared to the exact 4-2 compressor. Besides,
it is noteworthy that the proposed compressor has the lowest area and power, compared
to state-of-the-art 4-2 compressors. Although PDP is a little higher than ^{[5]}, EDP is equal. In summary, the proposed approximate 4-2 compressor has an advantage
in hardware overhead, owing to the optimized structure using only one OR gate and
one XNOR gate. Although the compressor in ^{[5]} has better delay and power than the one proposed here, the approximate multiplier
in ^{[5]} is inferior to the multipliers proposed here, as is explained later.

##### Table 2. Hardware comparison of 4-2 compressors.

### 4.2 The Approximate Multipliers

#### 4.2.1 Hardware Cost

The area, power, delay, PDP, and EDP of the approximate and exact multipliers are listed in Table 3. The proposed multipliers are divided into three types (M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$) to get the trade-off between hardware cost and computing accuracy.

##### Table 3. Hardware comparison of 8${\times}$8 multipliers.

As seen from the results in Table 3, M5${\beta}$${\gamma}$ has the smallest area, power, and delay of the three types of multipliers, whereas M7${\beta}$${\gamma}$ has the highest, and M6${\beta}$${\gamma}$ is in the middle, as influenced by ${\alpha}$. Obviously, for each type of multiplier (like M7${\beta}$${\gamma}$), when ${\gamma}$ increases, ${\beta}$ will decrease, and the hardware cost is also reduced by the impact of ${\gamma}$. PDP and EDP are reported to further assess the performance of these multipliers, and they change in the way described above.

Note that from the data, the proposed multipliers greatly outperformed the exact design. The proposed multipliers reduce the area, delay, and power by 38%-72%, 14%-33%, and 39%-77%, respectively, compared to the exact multiplier. Besides, most of the M5${\beta}$${\gamma}$ multipliers reached significant hardware improvement over previous designs, especially M519, which had the best hardware performance compared to all designs, reducing PDP and EDP on average by 67% and 75%, respectively.

#### 4.2.2 Computing Accuracy

To evaluate the output quality from approximate multipliers, error rate (ER), mean
error distance (MED), and normalized mean error distance (NMED) were computed by applying
all 65,536 possible input combinations ^{[16]}. ER is the possibility of producing an erroneous result, and MED is calculated with
(6):

where N is the bit width of a multiplier, and ED$_{i}$ represents the arithmetic difference between approximate and exact results. NMED from the maximum output of the exact multiplier is expressed in (7):

##### (7)

$ NMED=\frac{1}{\left(2^{N}-1\right)^{2}}\sum _{i=1}^{2^{2N}}\frac{\left| ED_{i}\right| }{2^{2N}} $The accuracy metrics of the proposed multipliers are listed in Table 4. In the three types of multiplier, M7${\beta}$${\gamma}$ has a relatively small ER, MED, and NMED. Besides, all the multipliers have a high ER, mainly due to the truncated structure. ER decreases as the number of truncated columns increases. As for MED and NMED, they decrease as ${\gamma}$ increases, and drop to a minimum when ${\beta}$ is 2, then increase again. When the number of truncated columns reached the highest level, the multipliers had the worst computing accuracy, but the accuracy of M717 was higher than M663, and M618 was better than M573 due to the exact part of the most significant bits.

##### Table 4. ER, MED, and NMED of approximate 8${\times}$8 multipliers.

Compared to previous work, NMED from the proposed multipliers was not the lowest;
however, it was acceptable for most image processing applications ^{[17]}. M528 had the best accuracy, compared to all designs except ^{[6]}. Although the multipliers in ^{[6]} have advantages in the accuracy metrics, they carried the highest hardware cost,
as shown in Table 3. Therefore, all performance evaluation metrics should be taken into account.

The error distribution of the proposed multipliers, including M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$, is shown in Fig. 4, where the errors were mainly in the ranges [-600, 600], [\hbox{-}1000, 1000], and [-2000, 1000], respectively, accounting on average for about 83%, 84%, and 84% of the whole range. Thus, the reservation of an appropriate number of the most significant bits will preserve the accuracy of a multiplier.

##### Fig. 4. Error distance from the multipliers: (a) M5${\beta}$${\gamma}$; (b) M6${\beta}$${\gamma}$; (c) M7${\beta}$${\gamma}$.

As seen from the results above, M5${\beta}$${\gamma}$ had the better hardware metrics
but a worse NMED, while M7${\beta}$${\gamma}$ had the better NMED and a worse hardware
cost. Thus, to reconcile the trade-off between accuracy and hardware cost, a figure
of merit (FOM) was suggested in ^{[8]}. Due to the relatively small delay from the proposed multiplier, for a fair comparison,
delay was removed and modified as seen in (8) ^{[5]}:

Fig. 5 shows FOM1 for the proposed and existing approximate 8${\times}$8 multipliers. The smaller the value of FOM1, the better the trade-off between accuracy and hardware. Thus, M627, M618, M564, M555, M546, M537, M528, and M519 have a lower FOM1 compared with other designs, indicating that most of the proposed multipliers offer a better trade-off than previous designs.

### 4.3 Image Multiplication

To assess the practicality of approximate multipliers in real applications, they were
applied to image multiplication as a widely used operation in image processing. The
discussed multipliers handled two images, pixel by pixel, thereby combining two images
into a single image ^{[18-}^{21]}.

The peak signal-to-noise ratio (PSNR) and the mean structural similarity index metric
(MSSIM) ^{[22]} were computed to evaluate the quality of the processed images. PSNR is expressed
in (9):

##### (9)

$ PSNR=10\log _{10}\left(\frac{w\times r\times MAX^{2}}{\sum _{i=0}^{w-1}\sum _{j=0}^{r-1}\left[S'\left(i,j\right)-S\left(i,j\right)\right]^{2}}\right) $where w and r are the width and height of the image, $\textit{S'(i, j)}$ and S(i, j) represent the exact and approximate value of each pixel, respectively, and MAX is the maximum pixel value. The larger the PSNR, the better the image. MSSIM is expressed in (10):

##### (10)

$ MSSIM\left(X,Y\right)=\frac{1}{k}\sum _{i=1}^{k}\frac{\left(2\mu _{x}\mu _{y}+C_{1}\right)\left(2\sigma _{xy}+C_{2}\right)}{\left(\mu _{x}^{2}+\mu _{y}^{2}+C_{1}\right)\left(\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2}\right)} $where X and Y represent two images. Other parameters can be found in detail in ^{[22]}. MSSIM reaches 1 when the two processed images are the same.

Table 5 shows PSNR and MSSIM values for five image multiplication examples. All the proposed
multipliers achieved PSNR values higher than 30dB for various images, with a PSNR
higher than 30dB certified as good enough ^{[23]}. Besides, the results of MSSIM for all approximate multipliers are very close to
the exact design (MSSIM=1). Moreover, both PSNR and MSSIM values increase as the number
of exact columns increases.

##### Table 5. PSNR and MSSIM of multiplied images using the 8${\times}$8 multipliers.

To visualize the effect of approximate multiplication on image quality, multiplied images LenaRGB and Lena (using the considered multipliers) are shown in Fig. 6. The results indicate no obvious differences between the proposed designs and the exact design.

For comprehensively evaluating the efficiency of the discussed approximate designs
in image processing, both hardware cost and image quality should be considered simultaneously,
rather than under specific assessment. To intensify the practicability of approximate
multipliers, FOM2 is expressed in (11) ^{[24]}:

A smaller FOM2 value indicates a better compromise between hardware efficiency and accuracy. Fig. 5 shows FOM2 from the discussed multipliers when saving space. The results indicate a decreasing trend. Among them, M627, M618, M537, M528, and M519 provided a better FOM2 than the other designs. Specifically, FOM2 for M528 takes first place in this regard, with a 63% reduction, on average, compared to the existing designs, followed by M519 and M537.

## 5. Conclusion

In this work, an ultra-efficient approximate 4-2 compressor was proposed by introducing symmetrical errors into the truth table of the exact compressor. A set of Dadda multipliers, denoted as M${\alpha}$${\beta}$${\gamma}$, was designed to investigate the hardware/accuracy trade-off. Image multiplication was considered as an example to evaluate computing accuracy. Experimental results showed that the accuracy of a multiplier is mainly dominated by the exact part, while the hardware cost is affected by the approximate and truncated parts. Furthermore, two figures of merit show that a comprehensive indicator should be considered to reach a compromise between hardware and accuracy, because a multiplier having high accuracy will consume high amounts of energy. In addition, several proposed multipliers surpassed their counterparts under the considered criteria.

### ACKNOWLEDGMENTS

This work was supported by the Fundamental Research Funds for the Central Universities of China (Grant No. JZ2020HGQA0162, Grant No. JZ2020HGTA0085).

### REFERENCES

## Author

Yongqiang Zhang received the B.S. degree in electronic science and technology from Anhui Jianzhu University, Hefei, China, in 2013, and the Ph.D. degree in integrated circuits and systems from the Hefei University of Technology, Hefei, in 2018. He was a Visiting Student with the Department of Electrical and Computer Engineering, University of Alberta, for one year. He is currently with the School of Microelectronics, Hefei University of Technology. His research interests include approximate computing, stochastic computing, VLSI design, and nanoelectronics circuits and systems.

Cong He received her B.S. degree in Electronic Information and Engi-neering from Anhui Jianzhu University, Hefei, China, in 2019. She is currently pursuing the M.S. degree in Micro-electronics with the Hefei University of Technology. Her research interests include approximate computing, and emerging technologies in computing systerms.

Xiaoyue Chen received her B.S. degree in Electronic and Information Engineering from the Liaoning University of Engineering and Technology, Huludao, China, in 2021. She is currently pursuing the M.S. degree in Microelectronics with the Hefei University of Technology. Her research interests include approximate computing and stochastic computing.

Guangjun Xie received the B.S. degree and M.S. degrees in microelectronics from the Hefei University of Technology, Hefei, China, in 1992 and 1995, respectively, and the Ph.D. degree in signal and information processing from the University of Science and Technology of China, Hefei, in 2002. He worked as a Post-Doctoral Researcher in optics with the University of Science and Technology of China from 2003 to 2005. He was a Senior Visitor with IMEC in 2007 and ASIC in 2011. He is currently a Professor with the School of Microelectronics, Hefei University of Technology. His research interests include integrated circuit design and nanoelectronics. Dr. Xie is a Senior Member of the Chinese Institute of Electronics.