Mobile QR Code QR CODE

  1. (School of Microelectronics, Hefei University of Technology / Hefei, China ahzhangyq@hfut.edu.cn, 2191158315@qq.com, 1617090911@qq.com, gjxie8005@hfut.edu.cn )



Approximate computing, Multiplier, Compressor, Energy consumption, Image multiplication

1. Introduction

Approximate computing is an attractive paradigm in circuit design, lowering the demand for accurate operations, and reducing power, speed, and area at the expense of a reduction in computing accuracy. The trade-off between hardware cost and computing accuracy is especially relevant to error-resilient applications, such as machine learning and multimedia processing.

Multipliers are the basic blocks of digital systems, and usually consist of three steps: 1) generating partial products, 2) reducing the partial products, and 3) summing the final results. Among them, the second step accounts for the dominant hardware cost. Using efficient compressors can significantly reduce the complexity of this step, and thus, improves the performance of multipliers [1], and 4-2 compressors are widely applied to multipliers to accelerate the reduction of partial products. In [2], a compressor ignored input signal cin and output signal cout to improve the performance of multipliers in terms of power and delay. The multiplier that utilizes the proposed compressor shows a great reduction in hardware requirements and transistor count, compared to the existing designs. Three 4-2 compressors were proposed in [3] by modifying the truth table of an exact compressor. However, the multipliers using these compressors were inferior in overall performance. In [4], the partial-product-altering method was applied to a 4-2 compressor, and realized a balance between hardware cost and multiplier accuracy. A compressor using a majority gate was designed in [5] by ignoring input signal x$_{2}$, cin, and the cout signal to achieve excellent power and delay performance. The stacking circuit technique was adopted in [6] to design approximate multipliers with high computing accuracy while leading to high hardware costs. In [7], a new compressor was designed using only simple AND-OR gates, and the multiplier utilizing this compressor provided a good error-electrical performance trade-off. The dual-quality 4-2 compressors introduced in [8] can be flexibly switched between precise and approximate operating modes. Therefore, multipliers using these compressors can realize dynamic change in accuracy at runtime.

To improve the trade-off between hardware cost and computing accuracy in approximate circuits, this paper proposes a set of approximate 8${\times}$8 Dadda multipliers. To that end, an imprecise 4-2 compressor using only OR and XNOR gates is designed by introducing symmetrical errors into the truth table of the exact compressor. The errors can counteract each other in a multiplier. This method will optimize the design complexity of multipliers in area, power, and delay while generating satisfying results. The main contributions of this paper are summarized as follows.

1) An approximate 4-2 compressor is proposed to simplify the design complexity of the partial production reduction step in multipliers.

2) A set of approximate Dadda multipliers is built from the compressors to find a better structure with a lower hardware cost and higher computing accuracy.

3) The image multiplication operation is realized through these multipliers to evaluate computing accuracy in real applications.

4) The trade-off between hardware cost and accuracy in the multipliers is comprehensively analyzed through various evaluation criteria as an example in approximate computing.

This paper proceeds as follows. In Section 2, the previous approximate 4-2 compressors are reviewed. Section 3 presents the proposed approximate compressor and multipliers. The synthesis results and their application to image processing are presented in Section 4. Section 5 concludes this paper.

2. Related Work

In this paper, we look to 4-2 compressors to build 8${\times}$8 Dadda multipliers owing to their simplified structure and high efficiency in transistor-level implementations. In recent years, several methods have been proposed to design imprecise 4-2 compressors, and they were utilized to design approximate multipliers. Some previous approximate designs that ignored cin and cout are summarized and compared in this section.

In the approximate 4-2 compressor presented in [2], the delay of the critical path is less than the previous design, and the number of gates was further reduced. Three approximate 4-2 compressors were proposed in [3]; they use a k-map to obtain simplified logical expressions that reduce errors while providing a significant performance improvement over previous 4-2 compressors. The first and the second designs in [3] only have four gates, which greatly simplifies the structural complexity. The third design is the most accurate while having a more complex structure compared with other designs. In [4], to simplify the circuit of the 4-2 compressor, an OR gate replaces an XOR gate to compute a sum, thus introducing additional errors. An ultra-efficient compressor proposed in [5] consists of one majority gate, which is different from conventional designs. Since input x$_{2}$ is omitted, and output sum is always equal to 1, this approximate compressor reaches a simpler logic implementation. The compressors in [6] have high accuracy, using the stacking circuit technique. A hardware-efficient approximate compressor proposed in [9] was obtained by modifying the truth table of the exact compressor, and consists of only three NOR gates and one NAND gate. In [10], an ultra-compact 4-2 compressor was proposed based on simple AND-OR logic, which leads to a trade-off between hardware cost and precision. In [11], the proposed compressor was obtained by modifying an approximate compressor, and the performance of the applied multiplier improved. Three approximate compressors were presented in [12], and they all innovatively reduced the number of outputs to one, thus significantly reducing the hardware cost.

3. The Proposed Compressor and Multipliers

3.1 The Compressor

As shown in Fig. 1, an exact 4-2 compressor generally consists of two full adders with five inputs (x$_{1}$, x$_{2}$, x$_{3}$, x$_{4}$, and cin) and three outputs (sum, carry, and cout) [13]. The number for logic 1 in five inputs is counted by the output according to (1), (2), and (3):

(1)
$ sum=x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\oplus c_{in} \\ $
(2)
$ cout=\left(x_{1}\oplus x_{2}\right)x_{3}+\overline{\left(x_{1}\oplus x_{2}\right)}x_{1} \\ $
(3)
$ carry=\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)c_{in}+\overline{\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)}x_{4} $

The four inputs, x$_{1}$, x$_{2}$, x$_{3}$, and x$_{4}$, and the output sum have the same weight, whereas the weights of cout and carry are one binary bit order higher [12,14]. Therefore, cout and carry are delivered to the next module of higher significance.

In this work, the proposed 4-2 compressor (Fig. 2) is derived by modifying the truth table of the exact compressor to obtain simpler logic expressions, as seen in (4) and (5), along with ignoring signals cin and cout for design efficiency, as seen in previous work [2]. Input x$_{1}$ and x$_{2}$ are also omitted to simplify the compressor and greatly reduce the energy and critical path delay further. Thus, it has only OR and XNOR gates. Although the omission of x$_{1}$ and x$_{2}$ introduces certain errors, the proposed compressors are only used for the approximate part in multipliers, which has little impact on computing accuracy. Thus, attention will be paid more to the hardware/accuracy trade-off of the multipliers, rather than only a specific indicator.

(4)
$ carry=x_{3}+x_{4} $
(5)
$ sum=x_{3}\odot x_{4} $

As seen in the truth table in Table 1, the proposed design has eight erroneous outputs out of 16 outputs. Error is defined as the arithmetic distance between the exact and approximate values [15]. For example, when all inputs are 1, the exact output is 4, and the proposed compressor produces a 1 for both sum and carry. In this case, the decimal output is 3, so the error distance is 1. The maximum error generated by this design is 1 (-1), which could avoid unacceptable results when the compressor is applied to approximate multipliers. Besides, in the structure of a multiplier, error distance with opposite signs of -1 and 1 will counteract each other [5].

Fig. 1. The conventional 4-2 compressor.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig1.png
Fig. 2. The proposed 4-2 compressor.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig2.png
Table 1. Truth table of the proposed 4-2 compressor.

x4

x3

x2

x1

exact

carry

sum

approximate

error

0

0

0

0

0

0

1

1

-1

0

0

0

1

1

0

1

1

0

0

0

1

0

1

0

1

1

0

0

0

1

1

2

0

1

1

1

0

1

0

0

1

1

0

2

-1

0

1

0

1

2

1

0

2

0

0

1

1

0

2

1

0

2

0

0

1

1

1

3

1

0

2

1

1

0

0

0

1

1

0

2

-1

1

0

0

1

2

1

0

2

0

1

0

1

0

2

1

0

2

0

1

0

1

1

3

1

0

2

1

1

1

0

0

2

1

1

3

-1

1

1

0

1

3

1

1

3

0

1

1

1

0

3

1

1

3

0

1

1

1

1

4

1

1

3

1

3.2 The Approximate Multipliers

To investigate the impact of the proposed compressor on multiplication, 8${\times}$8 Dadda multipliers with various levels of accuracy are designed. The basic structure of the approximate Dadda multiplier was described in [2] where the multiplier uses AND gates to generate all partial products in the first step, and then uses approximate compressors to compress them into, at most, two rows. In the last step, an exact ripple carry adder computes the results.

In designing multipliers, the second step plays a critical role in terms of delay, power consumption, and area. The proposed multipliers are denoted M${\alpha}$${\beta}$${\gamma}$, where ${\alpha}$, ${\beta}$, and ${\gamma}$, respectively, represent the number of columns using exact compressors, approximate compressors, and truncation to compress partial products. To find an effective way to improve the performance of multipliers, the least significant bits of the partial products are truncated. In some applications, such as image processing, it is not desirable to obtain more than a certain level of accuracy. Furthermore, the related exact operations consume relatively high amounts of energy. Therefore, exact compressors are utilized for the most significant bits to make up for the lack of computing accuracy, while the proposed approximate compressors are applied to the middle of the partial products to reduce the hardware cost. To investigate the trade-off between hardware cost and accuracy, a set of multipliers was designed. Obviously, M7${\beta}$${\gamma}$ and M6${\beta}$${\gamma}$ aim at improving computing accuracy, while M5${\beta}$${\gamma}$ is used to reduce the hardware cost.

For example, the partial product reduction step of the proposed M654 is shown in Fig. 3, where each dot represents a partial product bit. In the first two stages, three half adders, three full adders, 10 of the proposed imprecise 4-2 compressors, and six exact 4-2 compressors are utilized. In the last stage, a half adder and nine full adders are applied to compute the results.

Fig. 3. Partial product reduction of the proposed M654.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig3.png

4. Simulation Results and Application

In this section, all designs were described in Verilog HDL and synthesized through the Synopsys Design Compiler NXT with a TSMC 65 nm standard cell library at 100MHz to evaluate performance. Note that the standard CMOS cell library does not include a special module, so all circuits were synthesized using the compile\_ultra command to provide a fair comparison, and the logic functions of the existing designs were optimized under the same conditions. Power data reported are from the Synopsys PrimePower tool using the vector-free power analysis model. In addition, the error metrics and an application for image processing with multipliers were programmed in Matlab.

4.1 The Approximate Compressor

A comparison of the proposed compressor and the existing exact and approximate compressors in terms of area, power, and delay is shown in Table 2. For clarity, the three designs proposed in [3] are represented by [3]1, [3]2, and [3]3, and the three methods in [6] are denoted [6]1, [6]2, and [6]3. To comprehensively evaluate efficiency from the proposed design, power-delay product (PDP) and energy-delay product (EDP) are also listed [9,16].

As can be seen from Table 2, the proposed approximate compressor has a 74% reduction in area, a 27% reduction in delay, and a 91% reduction in PDP, compared to the exact 4-2 compressor. Besides, it is noteworthy that the proposed compressor has the lowest area and power, compared to state-of-the-art 4-2 compressors. Although PDP is a little higher than [5], EDP is equal. In summary, the proposed approximate 4-2 compressor has an advantage in hardware overhead, owing to the optimized structure using only one OR gate and one XNOR gate. Although the compressor in [5] has better delay and power than the one proposed here, the approximate multiplier in [5] is inferior to the multipliers proposed here, as is explained later.

Table 2. Hardware comparison of 4-2 compressors.

Design

Area

(${\mu}$m$^{2}$)

Power

(mW)

Delay

(ns)

PDP

(fJ)

EDP

(fJ∙ns)

Proposed

4.68

4.93×10$^{-4}$

0.30

0.15

0.04

[2]

6.84

1.26×10$^{-3}$

0.46

0.58

0.27

[3]1

14.04

1.36×10$^{-3}$

0.35

0.48

0.17

[3]2

13.32

1.66×10$^{-3}$

0.34

0.56

0.19

[3]3

14.40

1.40×10$^{-3}$

0.32

0.45

0.14

[4]

11.52

1.27×10$^{-3}$

0.36

0.46

0.17

[5]

5.04

5.46×10$^{-4}$

0.25

0.14

0.04

[6]1

11.16

2.00×10$^{-3}$

0.33

0.66

0.22

[6]2

15.84

2.29×10$^{-3}$

0.43

0.98

0.42

[6]3

17.28

2.42×10$^{-3}$

0.45

1.09

0.49

Exact

18.00

3.95×10$^{-3}$

0.41

1.62

0.66

4.2 The Approximate Multipliers

4.2.1 Hardware Cost

The area, power, delay, PDP, and EDP of the approximate and exact multipliers are listed in Table 3. The proposed multipliers are divided into three types (M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$) to get the trade-off between hardware cost and computing accuracy.

Table 3. Hardware comparison of 8${\times}$8 multipliers.

Design

Area

(${\mu}$m$^{2}$)

Power

(mW)

Delay

(ns)

PDP

(fJ)

EDP

(fJ∙ns)

M753

360.00

4.76×10$^{-2}$

1.55

73.78

114.36

M744

342.36

4.51×10$^{-2}$

1.56

70.36

109.76

M735

331.92

4.36×10$^{-2}$

1.54

67.14

103.40

M726

329.76

4.13×10$^{-2}$

1.56

64.43

100.51

M717

292.68

3.69×10$^{-2}$

1.63

60.15

98.04

M663

314.64

4.03×10$^{-2}$

1.46

58.84

85.90

M654

298.80

3.83×10$^{-2}$

1.44

55.15

79.42

M645

285.84

3.61×10$^{-2}$

1.42

51.26

72.79

M636

267.84

3.42×10$^{-2}$

1.42

48.56

68.96

M627

246.24

3.04×10$^{-2}$

1.32

40.13

52.97

M618

227.16

2.71×10$^{-2}$

1.35

36.59

49.39

M573

275.40

3.38×10$^{-2}$

1.38

46.64

64.37

M564

258.84

3.16×10$^{-2}$

1.26

39.82

50.17

M555

245.88

2.99×10$^{-2}$

1.29

38.57

49.76

M546

226.08

2.78×10$^{-2}$

1.27

35.31

44.84

M537

207.36

2.47×10$^{-2}$

1.27

31.37

39.84

M528

185.40

2.18×10$^{-2}$

1.21

26.38

31.92

M519

160.56

1.83×10$^{-2}$

1.26

23.06

29.05

[2]

389.52

3.73×10$^{-2}$

1.71

63.78

109.07

[3]1

398.52

3.52×10$^{-2}$

1.58

55.62

87.87

[3]2

423.36

3.72×10$^{-2}$

1.85

68.82

127.32

[3]3

420.12

3.36×10$^{-2}$

1.89

63.50

120.02

[4]

325.44

3.13×10$^{-2}$

1.52

47.58

72.32

[5]

264.24

2.76×10$^{-2}$

1.35

37.26

50.30

[6]1

498.96

6.4×10$^{-2}$

1.66

106.24

176.36

[6]2

510.84

6.9×10$^{-2}$

1.73

119.37

206.51

[6]3

567.72

7.35×10$^{-2}$

1.77

130.10

230.27

Exact

577.80

7.81×10$^{-2}$

1.81

141.36

255.86

As seen from the results in Table 3, M5${\beta}$${\gamma}$ has the smallest area, power, and delay of the three types of multipliers, whereas M7${\beta}$${\gamma}$ has the highest, and M6${\beta}$${\gamma}$ is in the middle, as influenced by ${\alpha}$. Obviously, for each type of multiplier (like M7${\beta}$${\gamma}$), when ${\gamma}$ increases, ${\beta}$ will decrease, and the hardware cost is also reduced by the impact of ${\gamma}$. PDP and EDP are reported to further assess the performance of these multipliers, and they change in the way described above.

Note that from the data, the proposed multipliers greatly outperformed the exact design. The proposed multipliers reduce the area, delay, and power by 38%-72%, 14%-33%, and 39%-77%, respectively, compared to the exact multiplier. Besides, most of the M5${\beta}$${\gamma}$ multipliers reached significant hardware improvement over previous designs, especially M519, which had the best hardware performance compared to all designs, reducing PDP and EDP on average by 67% and 75%, respectively.

4.2.2 Computing Accuracy

To evaluate the output quality from approximate multipliers, error rate (ER), mean error distance (MED), and normalized mean error distance (NMED) were computed by applying all 65,536 possible input combinations [16]. ER is the possibility of producing an erroneous result, and MED is calculated with (6):

(6)
$ MED=\frac{1}{2^{2N}}\sum _{i=1}^{2^{2N}}\left| ED_{i}\right| $

where N is the bit width of a multiplier, and ED$_{i}$ represents the arithmetic difference between approximate and exact results. NMED from the maximum output of the exact multiplier is expressed in (7):

(7)
$ NMED=\frac{1}{\left(2^{N}-1\right)^{2}}\sum _{i=1}^{2^{2N}}\frac{\left| ED_{i}\right| }{2^{2N}} $

The accuracy metrics of the proposed multipliers are listed in Table 4. In the three types of multiplier, M7${\beta}$${\gamma}$ has a relatively small ER, MED, and NMED. Besides, all the multipliers have a high ER, mainly due to the truncated structure. ER decreases as the number of truncated columns increases. As for MED and NMED, they decrease as ${\gamma}$ increases, and drop to a minimum when ${\beta}$ is 2, then increase again. When the number of truncated columns reached the highest level, the multipliers had the worst computing accuracy, but the accuracy of M717 was higher than M663, and M618 was better than M573 due to the exact part of the most significant bits.

Table 4. ER, MED, and NMED of approximate 8${\times}$8 multipliers.

Design

ER (%)

MED

NMED

M753

99.77

1.96×10$^{2}$

3.01×10$^{-3}$

M744

99.83

1.88×10$^{2}$

2.89×10$^{-3}$

M735

99.80

1.68×10$^{2}$

2.58×10$^{-3}$

M726

99.51

1.31×10$^{2}$

2.01×10$^{-3}$

M717

99.22

1.72×10$^{2}$

2.65×10$^{-3}$

M663

99.89

3.49×10$^{2}$

5.36×10$^{-3}$

M654

99.91

3.41×10$^{2}$

5.25×10$^{-3}$

M645

99.91

3.22×10$^{2}$

4.95×10$^{-3}$

M636

99.83

2.81×10$^{2}$

4.33×10$^{-3}$

M627

99.66

2.63×10$^{2}$

4.04×10$^{-3}$

M618

99.51

4.29×10$^{2}$

6.60×10$^{-3}$

M573

99.95

6.78×10$^{2}$

10.42×10$^{-3}$

M564

99.95

6.71×10$^{2}$

10.32×10$^{-3}$

M555

99.95

6.55×10$^{2}$

10.08×10$^{-3}$

M546

99.92

6.11×10$^{2}$

9.40×10$^{-3}$

M537

99.85

5.64×10$^{2}$

8.67×10$^{-3}$

M528

99.83

4.79×10$^{2}$

7.36×10$^{-3}$

M519

99.80

8.01×10$^{2}$

12.33×10$^{-3}$

[2]

99.10

3.15×10$^{3}$

48.46×10$^{-3}$

[3]1

87.19

3.62×10$^{3}$

55.73×10$^{-3}$

[3]2

87.19

4.17×10$^{3}$

64.2×10$^{-3}$

[3]3

97.26

5.91×10$^{3}$

90.92×10$^{-3}$

[4]

85.73

2.24×10$^{3}$

34.41×10$^{-3}$

[5]

99.82

4.94×10$^{2}$

7.60×10$^{-3}$

[6]1

55.34

0.70×10$^{2}$

1.07×10$^{-3}$

[6]2

17.96

0.17×10$^{2}$

0.26×10$^{-3}$

[6]3

3.59

0.03×10$^{2}$

0.04×10$^{-3}$

Compared to previous work, NMED from the proposed multipliers was not the lowest; however, it was acceptable for most image processing applications [17]. M528 had the best accuracy, compared to all designs except [6]. Although the multipliers in [6] have advantages in the accuracy metrics, they carried the highest hardware cost, as shown in Table 3. Therefore, all performance evaluation metrics should be taken into account.

The error distribution of the proposed multipliers, including M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$, is shown in Fig. 4, where the errors were mainly in the ranges [-600, 600], [\hbox{-}1000, 1000], and [-2000, 1000], respectively, accounting on average for about 83%, 84%, and 84% of the whole range. Thus, the reservation of an appropriate number of the most significant bits will preserve the accuracy of a multiplier.

Fig. 4. Error distance from the multipliers: (a) M5${\beta}$${\gamma}$; (b) M6${\beta}$${\gamma}$; (c) M7${\beta}$${\gamma}$.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig4.png

As seen from the results above, M5${\beta}$${\gamma}$ had the better hardware metrics but a worse NMED, while M7${\beta}$${\gamma}$ had the better NMED and a worse hardware cost. Thus, to reconcile the trade-off between accuracy and hardware cost, a figure of merit (FOM) was suggested in [8]. Due to the relatively small delay from the proposed multiplier, for a fair comparison, delay was removed and modified as seen in (8) [5]:

(8)
$ FOM1=PDP\times Area/\left(1-NMED\right) $

Fig. 5 shows FOM1 for the proposed and existing approximate 8${\times}$8 multipliers. The smaller the value of FOM1, the better the trade-off between accuracy and hardware. Thus, M627, M618, M564, M555, M546, M537, M528, and M519 have a lower FOM1 compared with other designs, indicating that most of the proposed multipliers offer a better trade-off than previous designs.

Fig. 5. FOM of approximate 8${\times}$8 multipliers.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig5.png

4.3 Image Multiplication

To assess the practicality of approximate multipliers in real applications, they were applied to image multiplication as a widely used operation in image processing. The discussed multipliers handled two images, pixel by pixel, thereby combining two images into a single image [18-21].

The peak signal-to-noise ratio (PSNR) and the mean structural similarity index metric (MSSIM) [22] were computed to evaluate the quality of the processed images. PSNR is expressed in (9):

(9)
$ PSNR=10\log _{10}\left(\frac{w\times r\times MAX^{2}}{\sum _{i=0}^{w-1}\sum _{j=0}^{r-1}\left[S'\left(i,j\right)-S\left(i,j\right)\right]^{2}}\right) $

where w and r are the width and height of the image, $\textit{S'(i, j)}$ and S(i, j) represent the exact and approximate value of each pixel, respectively, and MAX is the maximum pixel value. The larger the PSNR, the better the image. MSSIM is expressed in (10):

(10)
$ MSSIM\left(X,Y\right)=\frac{1}{k}\sum _{i=1}^{k}\frac{\left(2\mu _{x}\mu _{y}+C_{1}\right)\left(2\sigma _{xy}+C_{2}\right)}{\left(\mu _{x}^{2}+\mu _{y}^{2}+C_{1}\right)\left(\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2}\right)} $

where X and Y represent two images. Other parameters can be found in detail in [22]. MSSIM reaches 1 when the two processed images are the same.

Table 5 shows PSNR and MSSIM values for five image multiplication examples. All the proposed multipliers achieved PSNR values higher than 30dB for various images, with a PSNR higher than 30dB certified as good enough [23]. Besides, the results of MSSIM for all approximate multipliers are very close to the exact design (MSSIM=1). Moreover, both PSNR and MSSIM values increase as the number of exact columns increases.

Table 5. PSNR and MSSIM of multiplied images using the 8${\times}$8 multipliers.

PSNR (dB)

MSSIM

Lena×

LenaRGB

Baboon×

BaboonRGB

Goldhill×

Goldhill

Goldhill×

LenaRGB

Goldhill×

BaboonRGB

Lena×

LenaRGB

Baboon×

BaboonRGB

Goldhill×

Goldhill

Goldhill×

LenaRGB

Goldhill×

BaboonRGB

M753

46.03

45.13

46.20

45.97

45.72

0.9985

0.9989

0.9966

0.9984

0.9980

M744

46.33

45.43

46.50

46.25

46.02

0.9985

0.9990

0.9965

0.9984

0.9980

M735

47.15

46.17

47.26

46.97

46.72

0.9986

0.9990

0.9966

0.9984

0.9980

M726

48.56

48.24

48.46

48.89

48.80

0.9988

0.9992

0.9960

0.9987

0.9983

M717

46.66

47.30

45.68

46.60

46.73

0.9987

0.9990

0.9943

0.9984

0.9980

M663

41.55

40.19

38.99

41.55

41.25

0.9957

0.9967

0.9855

0.9953

0.9943

M654

41.70

40.32

39.08

41.70

41.41

0.9957

0.9968

0.9851

0.9953

0.9943

M645

42.12

40.74

39.44

42.11

41.82

0.9958

0.9968

0.9855

0.9953

0.9943

M636

43.02

41.79

40.33

43.25

43.12

0.9960

0.9971

0.9858

0.9957

0.9947

M627

43.64

42.99

41.64

43.60

43.65

0.9962

0.9972

0.9846

0.9956

0.9944

M618

39.72

39.55

36.71

39.51

39.42

0.9955

0.9964

0.9742

0.9950

0.9929

M573

34.54

34.98

34.36

36.07

35.65

0.9813

0.9896

0.9631

0.9847

0.9827

M564

34.61

35.05

34.39

36.15

35.73

0.9814

0.9896

0.9629

0.9848

0.9827

M555

34.79

35.22

34.43

36.29

35.90

0.9814

0.9896

0.9614

0.9844

0.9822

M546

35.13

35.73

34.91

36.83

36.52

0.9815

0.9897

0.9616

0.9848

0.9826

M537

35.87

36.45

35.50

37.52

37.32

0.9825

0.9900

0.9577

0.9849

0.9823

M528

38.76

37.94

35.27

38.48

38.48

0.9902

0.9913

0.9444

0.9884

0.9848

M519

33.77

33.86

31.04

33.98

34.07

0.9846

0.9872

0.9226

0.9827

0.9778

[2]

22.77

23.44

21.61

24.03

23.68

0.8630

0.8600

0.7214

0.7864

0.7994

[3] 1

13.72

13.85

12.48

13.84

13.67

0.6534

0.7018

0.5411

0.6542

0.6626

[3] 2

13.71

13.85

12.48

13.86

13.68

0.6550

0.7015

0.5416

0.6342

0.6507

[3] 3

14.09

14.19

12.72

14.35

14.16

0.6239

0.6753

0.4938

0.6049

0.6035

[4]

28.17

27.83

25.35

28.59

28.94

0.9367

0.9534

0.9464

0.9533

0.9478

[5]

38.73

39.09

36.70

38.73

38.61

0.9897

0.9916

0.9645

0.9873

0.9827

[6] 1

51.35

52.64

49.11

51.78

51.99

0.9995

0.9997

0.9982

0.9995

0.9994

[6] 2

59.41

59.47

54.20

58.56

58.80

0.9999

0.9999

0.9990

0.9999

0.9998

[6] 3

68.77

68.78

62.52

67.65

67.70

1.0000

1.0000

0.9998

1.0000

1.0000

To visualize the effect of approximate multiplication on image quality, multiplied images LenaRGB and Lena (using the considered multipliers) are shown in Fig. 6. The results indicate no obvious differences between the proposed designs and the exact design.

For comprehensively evaluating the efficiency of the discussed approximate designs in image processing, both hardware cost and image quality should be considered simultaneously, rather than under specific assessment. To intensify the practicability of approximate multipliers, FOM2 is expressed in (11) [24]:

(11)
$ FOM2=PDP/\left(MSSIM\times PSNR\right) $
Fig. 6. The multiplied images for LenaRGB and Lena using 8${\times}$8 multipliers.
../../Resources/ieie/IEIESPC.2022.11.3.174/fig6.png

A smaller FOM2 value indicates a better compromise between hardware efficiency and accuracy. Fig. 5 shows FOM2 from the discussed multipliers when saving space. The results indicate a decreasing trend. Among them, M627, M618, M537, M528, and M519 provided a better FOM2 than the other designs. Specifically, FOM2 for M528 takes first place in this regard, with a 63% reduction, on average, compared to the existing designs, followed by M519 and M537.

5. Conclusion

In this work, an ultra-efficient approximate 4-2 compressor was proposed by introducing symmetrical errors into the truth table of the exact compressor. A set of Dadda multipliers, denoted as M${\alpha}$${\beta}$${\gamma}$, was designed to investigate the hardware/accuracy trade-off. Image multiplication was considered as an example to evaluate computing accuracy. Experimental results showed that the accuracy of a multiplier is mainly dominated by the exact part, while the hardware cost is affected by the approximate and truncated parts. Furthermore, two figures of merit show that a comprehensive indicator should be considered to reach a compromise between hardware and accuracy, because a multiplier having high accuracy will consume high amounts of energy. In addition, several proposed multipliers surpassed their counterparts under the considered criteria.

ACKNOWLEDGMENTS

This work was supported by the Fundamental Research Funds for the Central Universities of China (Grant No. JZ2020HGQA0162, Grant No. JZ2020HGTA0085).

REFERENCES

1 
Angizi S., Jiang H., DeMara R. F., Han J., Fan D., 2018, Majority-Based Spin-CMOS Primitives for Approximate Computing, IEEE Transactions on Nanotechnology, Vol. 17, No. 4, pp. 795-806DOI
2 
Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and Analysis of Approximate Compressors for Multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp. 984-994DOI
3 
Gorantla A., P D., 2017., Design of Approximate Compressors for Multiplication, ACM J. Emerg. Technol. Comput. Syst., Vol. 13, No. 3, pp. article 44DOI
4 
Venkatachalam S., Ko S., 2017, Design of Power and Area Efficient Approximate Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5, pp. 1782-1786DOI
5 
Sabetzadeh F., Moaiyeri M., Ahmadinejad M., 2019, A Majority-Based Imprecise Multiplier for Ultra-Efficient Approximate Image Multiplication, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208DOI
6 
Strollo A., Napoli E., Caro D., Petra N., Meo G., 2020, Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034DOI
7 
Esposito D., Strollo A. G. M., Napoli E., Caro D. D., Petra N., 2018, Approximate Multipliers Based on New Approximate Compressors, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 65, No. 12, pp. 4169-4182DOI
8 
Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361DOI
9 
Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise compressors for approximate multiplication at nanoscale, (in English), Aeu-International Journal of Electronics and Communications, Vol. 110DOI
10 
Salmanpour F., Moaiyeri M. H., Sabetzadeh F., 2021, Ultra-Compact Imprecise 4:2 Compressor and Multiplier Circuits for Approximate Computing in Deep Nanoscale, Circuits Systems and Signal ProcessingDOI
11 
Ha M., Lee S., Mar 2018, Multipliers With Approximate 4-2 Compressors and Error Recovery Modules, IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9DOI
12 
Pei H., Yi X., Zhou H., He Y., Jan 2021, Design of Ultra-Low Power Consumption Approximate 4-2 Compressors Based on the Compensation Characteristic, IEEE Transactions on Circuits and Systems II-Express Briefs, Vol. 68, No. 1, pp. 461-465DOI
13 
Chiphong C., Jiangmin G., Mingyan Z., 2004, Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 51, No. 10, pp. 1985-1997DOI
14 
Yi X., Pei H., Zhang Z., Zhou H., He Y., 2019, Design of an Energy-Efficient Approximate Compressor for Error-Resilient Multiplications, in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-5DOI
15 
Liang J., Han J., Lombardi F., 2013, New Metrics for the Reliability of Approximate and Probabilistic Adders, IEEE Transactions on Computers, Vol. 62, No. 9, pp. 1760-1771DOI
16 
Guo W., Li S., 2021, Fast Binary Counters and Compressors Generated by Sorting Network, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 29, No. 6, pp. 1220-1230DOI
17 
Jiang H., Santiago F. J. H., Mo H., Liu L., Han J., 2020, Approximate Arithmetic Circuits: A Survey, Characterization, Recent Applications, Proceedings of the IEEE, Vol. 108, No. 12, pp. 2108-2135DOI
18 
Strollo A. G. M., Caro D. D., Napoli E., Petra N., Meo G. D., 2020, Low-Power Approximate Multiplier with Error Recovery using a New Approximate 4-2 Compressor, in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-4DOI
19 
Toan N. V., Lee J., 2019, Energy-Area-Efficient Approximate Multipliers for Error-Tolerant Applications on FPGAs, in 2019 32nd IEEE International System-on-Chip Conference (SOCC), pp. 336-341DOI
20 
Savithaa N., Poornima A., 2019, A High speed Area Efficient Compression technique of Dadda multiplier for Image Blending Application, in 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 426-430DOI
21 
Savio M. M. D., Deepa T., 2020, Design of Higher Order Multiplier with Approximate Compressor, in 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pp. 1-6DOI
22 
Zhou W., Bovik A. C., Sheikh H. R., Simoncelli E. P., 2004, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600-612DOI
23 
Ansari M. S., Jiang H., Cockburn B. F., Han J., 2018, Low-Power Approximate Multipliers Using Encoded Partial Products and Approximate Compressors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 8, No. 3, pp. 404-416DOI
24 
Ahmadinejad M., Moaiyeri M. H., 2021, Energy- and Quality-Efficient Approximate Multipliers for Neural Network and Image Processing Applications, IEEE Transactions on Emerging Topics in Computing, pp. 1-1DOI

Author

Yongqiang Zhang
../../Resources/ieie/IEIESPC.2022.11.3.174/au1.png

Yongqiang Zhang received the B.S. degree in electronic science and technology from Anhui Jianzhu University, Hefei, China, in 2013, and the Ph.D. degree in integrated circuits and systems from the Hefei University of Technology, Hefei, in 2018. He was a Visiting Student with the Department of Electrical and Computer Engineering, University of Alberta, for one year. He is currently with the School of Microelectronics, Hefei University of Technology. His research interests include approximate computing, stochastic computing, VLSI design, and nanoelectronics circuits and systems.

Cong He
../../Resources/ieie/IEIESPC.2022.11.3.174/au2.png

Cong He received her B.S. degree in Electronic Information and Engi-neering from Anhui Jianzhu University, Hefei, China, in 2019. She is currently pursuing the M.S. degree in Micro-electronics with the Hefei University of Technology. Her research interests include approximate computing, and emerging technologies in computing systerms.

Xiaoyue Chen
../../Resources/ieie/IEIESPC.2022.11.3.174/au3.png

Xiaoyue Chen received her B.S. degree in Electronic and Information Engineering from the Liaoning University of Engineering and Technology, Huludao, China, in 2021. She is currently pursuing the M.S. degree in Microelectronics with the Hefei University of Technology. Her research interests include approximate computing and stochastic computing.

Guangjun Xie
../../Resources/ieie/IEIESPC.2022.11.3.174/au4.png

Guangjun Xie received the B.S. degree and M.S. degrees in microelectronics from the Hefei University of Technology, Hefei, China, in 1992 and 1995, respectively, and the Ph.D. degree in signal and information processing from the University of Science and Technology of China, Hefei, in 2002. He worked as a Post-Doctoral Researcher in optics with the University of Science and Technology of China from 2003 to 2005. He was a Senior Visitor with IMEC in 2007 and ASIC in 2011. He is currently a Professor with the School of Microelectronics, Hefei University of Technology. His research interests include integrated circuit design and nanoelectronics. Dr. Xie is a Senior Member of the Chinese Institute of Electronics.