Mobile QR Code QR CODE

  1. (School of Computer Science and Engineering, Kyungpook National University, Daegu, Korea )



Approximate computing, Approximate multiplier, Approximate compressor, 4-2 compressor, Design optimization, Low-cost

1. Introduction

As battery-based devices have become more common in recent years, the volume of data that needs to be handled has also been growing quickly. This has led to an increase in the energy consumption of computing devices, but the capacity of batteries is limited. Hence, energy efficiency becomes the primary design consideration. Approximate computing can offer energy efficiency by sacrificing computation accuracy [1]. The goal of approximate computing is to reduce hardware costs, such as area, latency, and power, while maintaining adequate accuracy. Since this technique degrades the accuracy and processing quality, it can be applicable to error-tolerant applications. Many of these applications typically involve human or application resilience [2]. For example, even if there is some noise or pixel loss in an image, humans may not detect the errors because the brain recovers the image as similarly as possible to the original. Therefore, taking advantage of this fact, we can make resource-saving hardware designs by sacrificing accuracy [3].

Approximate computing technique is applicable to basic circuits such as adders and multipliers. One of the most efficient ways to design approximate arithmetic circuits is to split them into two parts: accurate and inaccurate parts. Higher bits are calculated in the accurate part because higher-order bit errors have a more significant impact on the results. In an approximate adder, for example, the accurate part is implemented by a traditional accurate adder, such as a ripple carry adder (RCA) and a carry lookahead adder (CLA), and many approximation techniques for the lower-order part have been presented in the literature [16-23].

Multipliers can be implemented using a parallel configuration of compressors [4]. Compressor-based multiplication is composed of three steps: 1) generating partial products, 2) reducing partial products, and 3) adding final partial products. The second step is a computation-intensive one where compressors are mainly used to compress partial products and generate the final two terms. The hardware cost of multipliers can be determined by the compressors' complexity and bit-per-bit compression rate. For example, an accurate multiplier requires significant resources because it consists of a large number of full adders (3-2 compressor) and half adders (2-2 compressor). On the other hand, the approximate multiplier is composed of approximate compressors, enabling efficient design. Among the different sized compressors, 4-2 compressors are widely used to build approximate multipliers.

In this paper, we propose an optimized 4-2 approximate compressor based on an existing compressor. The original compressor was implemented using OR, NOR, and XOR gates, but in our design, we elaborate the Boolean expressions of the compressor to obtain optimizable forms. Then, we replace the gates of the compressor with compound gates, such as AO and OA gates, to reduce hardware costs. As a result, the proposed approximate design’s area and power are improved by 62.5% and 65.7%, respectively. The main contributions of this paper are as follows:

· We systematically analyze an existing approximate 4-2 compressor and optimize it by exploiting compound gates to reduce hardware costs using logic optimization techniques.

· We compare the hardware performance of the traditional and the proposed compressors to show the improvement and use them in approximate multiplier designs to demonstrate the efficacy of the proposed compressor.

· We compare the proposed compressor with other compressors in hardware aspects and show the benefit of the proposed design.

2. Related Work

Various types of compressors have been presented in the literature, such as 4-2, 5-2, and 7-3 compressors [5-15]. 4-2 approximate compressors are commonly used, and some are illustrated in Fig. 1. All the compressors except for the exact one belong to the low-accuracy compressor category, which offers considerable hardware cost benefits [9]. The exact compressor is composed of two full adders, as illustrated in Fig. 1(a). With five bits of input (X$_{4}$-X$_{1}$ and C$_{IN}$), two full adders are connected to generate three output bits (C$_{OUT}$, Carry, and Sum). Each input has equal binary weight, and Sum has the same weight as the inputs, while the weights of C$_{OUT}$ and Carry are one bit higher. C$_{OUT}$ is independent from C$_{IN}$. Each output is derived through the following equations.

(1)
$C_{\text {oUT }}= \left(X_1 \oplus X_2\right) X_3+\overline{\left(X_1 \oplus X_2\right)} X_1 .$
(2)
$\text { Carry }= \left(X_1 \oplus X_2 \oplus X_3 \oplus X_4\right) C_{I N} \\ +\overline{\left(X_1 \oplus X_2 \oplus X_3 \oplus X_4\right)} X_4 .$
(3)
$\text { Sum }= X_1 \oplus X_2 \oplus X_3 \oplus X_4 \oplus C_{\text {IN }} .$

As one of the earliest approximate compressor designs, Momeni et al. presented an approximate 4-2 compressor (referred to as ``Momeni''), which is shown in Fig. 1(b) [5]. In contrast to an exact compressor, the Momeni design removes C$_{IN}$ and C$_{OUT}$ signals to reduce partial products effectively by simplifying carry propagation between compressors. Furthermore, it reduces the hardware design complexity. Dual-quality approximate 4-2 compressors were proposed [6]. This design has flexibility to switch between exact and approximate operating modes. It is composed of an approximate part and a supplementary part, and each part is activated according to the mode. The two designs presented here are illustrated in Figs. 1(c) and (d), respectively. The carry prediction in the first design (referred to here as ``Akbar1'') is enhanced in the second design (referred to as ``Akbar2''). Subsequently, Approximate 4-2 compressors were later proposed and are referred here to as ``Venka'' [7] and ``Ahma'' [8]. These compressors were designed based on truth tables, as the Momeni compressor was. The use of a truth table is a representative approach to check the error distance of each input. The Venka compressor in Fig. 1(e) replaces several XOR gates with OR gates to decrease hardware costs since XOR gates have a significant impact on hardware costs. The Ahma compressor has reasonable accuracy and is a hardware-effective form since it consists of only three NOR gates and one NAND gate, as illustrated in Fig. 1(f).

Fig. 1. Schematic of 4-2 compressors: (a) Exact 4-2; (b) Momeni et al.[5]; (c)-(d) Akbari et al.[6]; (e) Venkatachalam et al.[7]; (f) Ahmadinejad et al.[8].}
../../Resources/ieie/IEIESPC.2022.11.6.455/fig1.png

3. Design Optimization of Approximate Compressor

Among the approximate 4-2 compressors presented in Section 2, we focus on the Momeni design. In this section, we briefly review the Momeni compressor and optimize it. We use Boolean equations and De Morgan’s law for optimization.

3.1 Momeni Compressor

The Momeni compressor was designed to enhance design efficiency by eliminating the carry input and output signals (C$_{IN}$ and C$_{OUT}$) of the exact 4-2 compressor. Fig. 1(b) shows the circuit of the Momeni approximate 4-2 compressor. As can be seen in Fig. 1(b), the one OR, two XNOR, and three NOR gates are components of this compressor. The expressions for the output Carry and Sum can be written as follows.

(4)
$\text { Carry } \left.=\overline{\left(\overline{X_1 X_2}+\overline{X_3 X_4}\right.}\right) .$
(5)
$\text { Sum }=\overline{X_1 \oplus X_2}+\overline{X_3 \oplus X_4} \text {. } $

A truth table for the Momeni design for all possible input combinations is demonstrated in Table 1. The errors occur in four input conditions (0000, 0011, 1100, and 1111), and the error distance is limited by ±1.

Table 1. Truth table of the Momeni compressor.

X$_{4}$

X$_{3}$

X$_{2}$

X$_{1}$

Carry

Sum

Difference

0

0

0

0

0

1

+1

0

0

0

1

0

1

0

0

0

1

0

0

1

0

0

0

1

1

0

1

-1

0

1

0

0

0

1

0

0

1

0

1

1

0

0

0

1

1

0

1

0

0

0

1

1

1

1

1

0

1

0

0

0

0

1

0

1

0

0

1

1

0

0

1

0

1

0

1

0

0

1

0

1

1

1

1

0

1

1

0

0

0

1

-1

1

1

0

1

1

1

0

1

1

1

0

1

1

0

1

1

1

1

1

1

-1

3.2 Proposed Optimized Compressor

Digital logic circuits can be expressed in Boolean equations. Equations developed in a specific form can be replaced by compound gates. Therefore, we examine the derived form of an existing logic design. By optimizing the Boolean equation, we can make an optimized compressor with compound gates.

Compound gates are mainly composed of AND and OR gates. If a circuit is in the sum-of-products (SOP) or product-of-sums (POS) form, then the gates can be changed to a compound gate: the AO type or OA type, respectively. The exact types of compound gates usually depend on the order of gates and the number of inputs. For example, consider the following expression:

(6)
$Y=\left(A_1+A_2\right) \cdot\left(A_3+A_4\right) . $

This expression is made up of two OR gates and one AND gate, and the form of this is POS. Since each OR gate has two inputs, this circuit can be replaced with one compound gate, OA22. These compound gates can be implemented very efficiently in terms of hardware by a combination of transistor connections, and most of the CMOS standard cell technology library includes them. Therefore, the hardware cost of OA22 is lower than that of the original circuit, while the result of the logic operation is identical.

The signals of the Momeni compressor can also be optimized in the same way. First, if De Morgan’s law is applied to the Carry signal, a POS form is derived. The induction equation is as follows.

(7)
$\text { Carry }=\overline{\left(\overline{X_1 \cdot X_2}+\overline{X_3 \cdot X_4}\right)}=\left(X_1+X_2\right) \cdot\left(X_3+X_4\right) \text {. }$

The Carry signal can be implemented as one compound gate, the POS form. Also, the generated Carry signal can be reused when generating the Sum signal. The Boolean equation of Sum signal is derived by:

(8)
$\text { Sum }=\overline{X_1 \oplus X_2}+\overline{X_3 \oplus X_4} \\ =\overline{\left(X_1+X_2\right) \cdot\left(\overline{X_1}+\overline{X_2}\right)}+\overline{\left(X_3+X_4\right) \cdot\left(\overline{X_3}+\overline{X_4}\right)} \\ =\overline{\left(X_1+X_2\right)}+\overline{\left(\overline{X_1}+\overline{X_2}\right)}+\overline{\left(X_3+X_4\right)}+\overline{\left(\overline{X_3}+\overline{X_4}\right)} \\ =\overline{X_1 \bar{X}_2}+X_1 X_2+\overline{X_3} \overline{X_4}+X_3 X_4 \\ =X_1 X_2+X_3 X_4+\overline{X_1+X_2}+\overline{X_3+X_4} \\ =X_1 X_2+X_3 X_4+\overline{\left(X_1+X_2\right) \cdot\left(X_3+X_4\right)} \\ =X_1 X_2+X_3 X_4+\overline{\text { Carry }} \text {. }$

The XOR gates can be altered to AND and OR gates. Then, De Morgan’s law makes four terms. Reapplying De Morgan's law to each term generates an expression that has the form of a sum of four products. In $\overline{X_{1}}\overline{X_{2}}+\overline{X_{3}}\overline{X_{4}}$, if the NOT gates are moved outward, a negative expression of Carry appears. Therefore, the Sum signal is the sum of three products using the Carry signal.

Consequently, the Sum and Carry signals are changed into the forms of the POS and SOP, respectively, so we can apply compound gates to the signals. Fig. 2 shows the proposed optimized compressor. The Carry signal is composed of two OR gates and one AND gate. Therefore, the compound gate of the Carry signal is OA22. The Sum signal consists of two AND gates and one OR gate with three input pins. The two output bits of AND gates and the inverted Carry signal are used as inputs of the last OR gate. Therefore, the optimized gate of the Sum signal is AO221.

Fig. 2. Diagram of the proposed optimized compressor.}
../../Resources/ieie/IEIESPC.2022.11.6.455/fig2.png

4. Experimental Result

In this section, we evaluate the hardware performance of the proposed and existing compressors. To evaluate and compare the designs, we implemented them in Verilog HDL and synthesized them with 32-nm CMOS technology using Synopsys Design Compiler.

4.1 Performance of 4-2 Compressors

Table 2 shows the simulation results of the compressors in terms of area, power, delay, and energy. The proposed design’s area is reduced from 18.30 µm$^{2}$ to 6.86 µm$^{2}$, which is about 2.7 times smaller than the original design. In addition, the power consumption is also reduced by about 2.9 times compared to the original. The proposed design achieves an energy reduction of 63.5% compared to the original one. This significant hardware cost reduction is the result of the optimization using compound gates.

Table 2. Hardware performance summary of compressors.

Area (µm$^{2}$)

Power (µW)

Delay

(ns)

Energy (fJ)

Original

18.30

3.85

0.14

0.52

Optimized

06.86

1.32

0.14

0.19

Improvement

62.5%

65.7%

-

63.5%

4.2 Performance of Multipliers using 4-2 Compressors

We also simulated the hardware performance of an approximate multiplier using these compressors. We used both C-N and C-FULL multiplier configurations. In the C-N configuration, the approximate compressors are used for only the less significant half columns of the partial product matrix, and in the C-FULL configuration, approximate compressors are used for every column of the partial product matrix. Fig. 3 shows the applied reduction scheme for the unsigned $8\times 8$ multiplier using the C-N configuration. Tables 3 and 4 show the hardware costs for 8$\times $8 multipliers for the C-N configuration and the C-FULL configuration, respectively [10].

In the case of the C-N configuration, the area and power of the multiplier with the optimized compressor decreased by 13.7% and 13.9% compared to the original one, respectively. Additionally, there was a 13.8% reduction in energy. In the case of the C-FULL configuration, the area and power of the multiplier with the optimized compressor decreased by 29.3% and 34.0%, respectively. For both multiplier configurations, the multiplier’s delay is not improved when adopting the optimized compressor whose speed is consistent compared to the original counterpart as shown in Table 2. The hardware cost is more significantly improved in the C-FULL configuration where the approximate compressor is mainly utilized rather than in C-N configuration.

Fig. 3. C-N approximate multiplier configuration.
../../Resources/ieie/IEIESPC.2022.11.6.455/fig3.png
Table 3. Hardware performance summary of C-N multipliers.

Area (µm$^{2}$)

Power (µW)

Delay

(ns)

Energy

(fJ)

Original

752.01

203.99

1.16

237.27

Optimized

649.08

175.73

1.16

204.42

Improvement

13.7%

13.9%

-

13.8%

Table 4. Hardware performance summary of C-FULL multipliers.

Area (µm$^{2}$)

Power (µW)

Delay

(ns)

Energy (fJ)

Original

662.55

160.42

1.02

163.18

Optimized

468.13

105.88

1.02

107.62

Improvement

29.3%

34.0%

-

34.0%

4.3 Comparison with Other Compressors

We compare the proposed compressor with eight other compressors. We divide them into two groups according to their characteristics. The first group includes the four compressors mentioned in Section 2 (Akbar1, Akbar2, Venka, and Ahma), and the second one contains four other compressors that exploit compound gates (Yang1, Yang2, Yang 3 [13], and Ha [14]). The Yang1 compressor utilizes OAI212, the Yang2 and Yang3 compressors use AO222, and the Ha compressor uses AO22.

Fig. 4 summarizes the implementation results of compressors in terms of area, power, delay, and energy. Note that the blue and yellow bars correspond to the first and second groups, respectively, and the red indicates the proposed design. When we compare the original compressor to the designs in the first group, all the other compressors except for Venka have better hardware performance than the original Momeni compressor. After optimization, the proposed compressor’s area is approximately 43% smaller and dissipates approximately 41% less power than the Akbar1 and Akbar2 compressors. Furthermore, the proposed compressor consumes more than 60% less area, power, and energy than the Venka compressor, which consumes the most resources in the group. Additionally, our design is about 23% smaller in area and power consumption than the Ahma compressor. Although the delay of the proposed design is not improved, the energy achieved is 0.19 fJ, thus outperforming other compressors because the power is greatly reduced.

Next, we compare the proposed compressor with the second group. The area and power of the original Momeni compressor are worse than those of the Ha compressor, which has the best hardware performance among the compressors in the second group. However, after the optimization, the proposed compressor outperforms the second group in all hardware aspects. Specifically, the proposed compressor has enhanced area, power, delay, and energy by 48%, 58%, 22%, and 66% over the Ha compressor, respectively. Compared to the Yang1 compressor, which requires the most hardware resources, the proposed compressor's area, power, and delay are approximately 70%, 73%, and 41% smaller, respectively. In particular, the energy is consumed about 84% less in the proposed design than in the Yang1 compressor. Clearly, our design optimization allows the proposed compressor to outperform the others in hardware cost.

Fig. 4. Comparison with other approximate 4-2 compressors in terms of area, power, delay, and energy.}
../../Resources/ieie/IEIESPC.2022.11.6.455/fig4.png

5. Conclusion

In this paper, we have presented an optimized Momeni 4-2 approximate compressor that reduces hardware resource consumption considerably. When implemented using 32-nm CMOS technology, the area and power of the proposed compressor are improved by 62.5% and 65.7%, respectively. In addition, our design allows the reduction of the area by 13.7% and 29.3% in C-N and C-FULL multiplier configurations, respectively. In particular, this optimized design reduced energy consumption by 34% in the C-FULL multiplier configuration. Also, the proposed compressor outperforms the other compressors considered here in terms of area, power, and energy.

ACKNOWLEDGMENTS

This work was supported in part by the Basic Science Research Program through National Research Foundation of Korea (NRF) funded by the Korean Government (MSIT) (NRF-2020R1A4A1019628) and the Ministry of Education (NRF-2019R1I1A3A01061266) and in part by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (4199990214394).

REFERENCES

1 
Moreau T., Sampson A., Ceze L., 2015, Approximate Computing: Making Mobile Systems More Efficient, IEEE Pervasive Computing, Vol. 14, No. 2, pp. 9-13DOI
2 
Chippa V. K., Chakradhar S. T., Roy K., Raghunathan A., 2013, Analysis and characterization of inherent application resilience for approximate computing, ACM/EDAC/IEEE Design Automation Conference (DAC), Vol. article 113, pp. 1-9DOI
3 
Gupta V., Mohapatra D., Raghunathan A., Roy K., 2013, Low-Power Digital Signal Processing Using Approximate Adders, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 32, No. 1, pp. 124-137DOI
4 
Wallace C. S., 1964, A Suggestion for a Fast Multiplier, IEEE Transactions on Electronic Computers, Vol. EC-13, No. 1, pp. 14-17DOI
5 
Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and Analysis of Approximate Compressors for Multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp. 984-994DOI
6 
Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361DOI
7 
Venkatachalam S., Ko S., 2017, Design of Power and Area Efficient Approximate Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5, pp. 1782-1786DOI
8 
Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise compressors for approximate multiplication at nanoscale, AEU - International Journal of Electronics and Communications, Vol. 110DOI
9 
Kong T., Li S., 2021, Design and Analysis of Approximate 4-2 Compressors for High-Accuracy Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 29, No. 10, pp. 1771-1781DOI
10 
Strollo A. G. M., Napoli E., De Caro D., Petra N., Meo G. D., 2020, Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034DOI
11 
Chang C.-H., Gu J., Zhang M., 2004, Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Trans. Circuits Syst. I Reg. Papers, Vol. 51, No. 10, pp. 1985-1997DOI
12 
Saha A., Pal R., Naik A. G., Pal D., 2018, Novel CMOS multi-bit counter for speed-power optimization in multiplier design, AEU - International Journal of Electronics and Communications, Vol. 95, pp. 189-198DOI
13 
Yang Z., Han J., Lombardi F., 2015, Approximate compressors for error-resilient multiplier design, IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pp. 183-186DOI
14 
Ha M., Lee S., 2018, Multipliers With Approximate 4-2 Compressors and Error Recovery Modules, IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9DOI
15 
Lin C. -H., Lin I. -C., 2013, High accuracy approximate multiplier with error correction, IEEE International Conference on Computer Design (ICCD), pp. 33-38DOI
16 
Kim Y., 2019, An Accuracy Enhanced Error Tolerant Adder with Carry Prediction for Approximate Computing, IEIE Transactions on Smart Computing and Processing, Vol. 8, No. 4, pp. 324-330DOI
17 
Kim Y., 2019, A Novel Approximate Adder with Enhanced Low-Cost Carry Prediction for Error Tolerant Computing, IEIE Transactions on Smart Computing and Processing, Vol. 8, No. 4, pp. 506-510DOI
18 
Seo H., Yang Y. S., Kim Y., 2020, Design and Analysis of an Approximate Adder with Hybrid Error Reduction, Electronics, Vol. 9, No. 3, pp. 471:1-13DOI
19 
Seo H., Lee J., Lee Donghui, Kim B., Kim Y., 2021, Design and Analysis of a Low-Cost Approximate Adder with OR and Zero Truncation, IEIE Transactions on Smart Computing and Processing, Vol. 10, No. 4, pp. 309-314DOI
20 
Lee J., Seo H., Seok H., Kim Y., 2021, A Novel Approximate Adder Design using Error Reduced Carry Prediction and Constant Truncation, IEEE Access, Vol. 9, pp. 119939-119953DOI
21 
Seok H., Seo H., Lee J., Kim Y., 2021, COREA: Delay- and Energy-Efficient Approximate Adder Using Effective Carry Speculation, Electronics, Vol. 10, No. 18, pp. 2234: 1-12DOI
22 
Lee J., Seo H., Kim Y., Kim Y., 2020, Approximate adder design with simplified lower-part approximation, IEICE Electronics Express, Vol. 17, No. 15, pp. 20200218DOI
23 
Choi W., Shim M., Seok H., Kim Y., 2021., DCPA: approximate adder design exploiting dual carry prediction, IEICE Electronics Express, Vol. 18, No. 23, pp. 20210431DOI

Author

Hyelin Seok
../../Resources/ieie/IEIESPC.2022.11.6.455/au1.png

Hyelin Seok received a B.S. degree from the School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea in 2022, where she is pursuing an M.S. degree. Her research interests include computer architecture, approximate arithmetic, and new computing systems.

Hyoju Seo
../../Resources/ieie/IEIESPC.2022.11.6.455/au2.png

Hyoju Seo received a B.S and M.S. degrees at the School of Computer Science and Engineering from Kyungpook National University, Daegu, Republic of Korea, in 2020 and 2022, respectively, where she is currently pursuing a Ph.D. Her research interests include approximate computing, neuromorphic computing, deep learning accelerator, and image processing.

Jungwon Lee
../../Resources/ieie/IEIESPC.2022.11.6.455/au3.png

Jungwon Lee received a B.S. degree from the School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea in 2021, where she is pursuing an M.S. degree. Her research interests include deep learning, approximate arithmetic, and approximate DRAM.

Yongtae Kim
../../Resources/ieie/IEIESPC.2022.11.6.455/au4.png

Yongtae Kim received B.S. and M.S. degrees in electrical engineering from Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D. degree from the Depart-ment of Electrical and Computer Engineering from Texas A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer with the Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea, where he is currently an assistant professor. His research interests are energy-efficient integrated circuits and systems, particularly neuromorphic computing and approximate computing, and new memory devices and architectures.