Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 01, p.38-47

ISSN (online) :

2287-5255

Received : 28 November 2022Revised : 29 December 2022Accepted : 31 December 2022

Article Id (etc) :

https://doi.org/10.5573/IEIESPC.2023.12.1.38

Regular Paper

This paper presents details of digital circuit design for computing DTCWT architecture and implemented on FPGA.

Optimized Distributive Arithmetic-based Hardware Accelerator for Dual Tree Complex Wavelet Transform Computation

Yashavanthakumar T. R.¹ Sampathrao L. Pinjare² Cyril Prasanna Raj P.³

(Research Scholar, Reva University, Asst. Professor, Dept. of ECE, Govt. Engineering College, Devagiri, averi-581110 yashvanth3@gmail.com )
(Dept. of ECE, NMIT, Bangalore slpinjare@gmail.com)
(Professor, Cambridge Institute of Technology, Bangalore cyrilyahoo@gmail.com)

^*Corresponding Author: Yashavanthakumar T. R.

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Hardware architectures for fast computation of complex wavelet transforms for image processing require optimized design approaches. The Dual Tree Complex Wavelet Transform (DTCWT) is twice as complex as the Discrete Wavelet Transform (DWT) and was designed while considering the distributive arithmetic (DA) algorithm, which is customized for the design of a 10- tap filter architecture. Redundancy in the filter coefficients was considered in optimizing the DA partial products, reducing the area resources by 97.65%. The reduced architecture was modeled in Verilog HDL and implemented on a Xilinx FPGA. The operating frequency is 312 MHz, and the power dissipation is less than 1 W. The proposed model is suitable for high-speed computation of DTCWT sub-bands on an FPGA platform.

Keywords

Distributive arithmetic algorithm, Memory efficient, Wavelet transform, Image processing, FPGA

1. Introduction

Wavelet-based image processing algorithms provide edge information in a given image localized in sub-bands and at different resolutions, along with the intensity component of the image. The Discrete Wavelet Transform (DWT) of an image generates sub-bands that capture intensity and directional features that are localized. Directional orientations in directions other than 0$^{\mathrm{o}}$, 90$^{\mathrm{o}}$and 45$^{\mathrm{o}}$ are captured in Dual Tree Complex Wavelet Transforms (DTCWTs). Shift invariant and additional directional orientation are the major advantages of DTCWT over DWT. The number of filters is twice as high in DTCWT than DWT, and the computation complexity is 2$^{\mathrm{n}}$:1.

The use of DTCWT over DWT for image processing applications is widely reported in the literature ^[1-^3]. Hardware implementation methods for DTCWT-based image processing provide a choice for real time applications. The filter structure in DTCWT is similar to DWT, and most of the hardware architectures reported in the literature for DTCWT are extensions of DWT architecture.

It is required to develop customized hardware architectures for DTCWT. Hardware models based on the Systolic Array (SA) algorithm and Distributive Arithmetic (DA) algorithm are widely used to develop fast architectures for DWT computation. Mohanty and Meher reported multiplier-less structures considering 9/7 filters and designed a DA structure with serial and parallel data-processing modules ^[4]. Mahajan and Mohanty designed arithmetic modules for a DA-based DWT structure to minimize the critical path and to improve processing speed ^[5]. Mohanty and Meher developed a multi-level decomposition structure for computing DWT sub-bands for image processing applications by optimizing memory requirements and improving operating speed ^[6].

Aziz et al. developed a DA-based DWT architecture based on 5/3 filters ^[7]. Similarly, Gardezi et al. used a canonical sign digit number system to reduce the number of arithmetic operations for 2D DWT computation ^[8]. Naik et al. developed a DWT IP that can be reconfigured and implemented it on Xilinx FPGA, demonstrating a delay of 11.577 ns and total power dissipation of less than 23.8 mW ^[9]. Anirban and Ayan developed a DA-based architecture for DWT, improving processing speed and memory efficient structure for 1D and 2D computation ^[10]. The DWT architectures reported in the literature are either for a 9/7 filter or 5/3 filter. In DTCWT computation, 10-tap filter size is used, and it is required to redesign the architectures that have been developed for 9/7 and 5/3 filter structures.

Poornima et al. extensively developed high-speed architectures based on systolic array logic for computing DTCWT and reported its implementation on an FPGA, limiting power dissipation to less than 10 mW ^[11]. Divakara et al. reported a hybrid DA structure for 2D DTCWT computation ^[12]. The methods presented in their work combine SA and DA logic to develop a DTCWT architecture and were implemented on an FPGA, demonstrating optimization in area, power, and speed performance ^[13].

In DTCWT, as there are four filters in each stage, and each of the filters is 10-tap, there are redundancies that need to be considered, and it is required to compute the filter outputs by combining all operations and eliminating redundancies. In this work, new methods were developed to identify redundancies among the filters and processing elements. Data movement and reuse of arithmetic operations were considered in design of novel methods for DTCWT computation. The developed method was implemented on an FPGA, and its performance was compared with other methods.

2. Related Work

The most popular DTCWT structure is based on a Kingsbury 10-tap filter. The 2D DTCWT structure is presented in Fig. 1. The input X(n1, n2) represents 2D data and is processed by two stages of filters. In the first stage, there are four filters: L$^{1}$a, H$^{1}$a, L$^{1}$b, and H$^{1}$b, which are low pas real, high pass real, low pass imaginary, and high pass imaginary filters, respectively. The input X is processed along the rows to generate the first-stage filter outputs, which are further processed by the second-stage filter banks.

In the second stage, the processing is carried out along the columns of the first-stage output. The filter coefficients for the second stage are represented as L$^{2}$a, H$^{2}$a, L$^{2}$b, and H$^{2}$b. The second-stage filter generates eight complex sub-bands represented as XLL$_{1}$, XLL$_{2}$, XLH$_{1}$, XLH$_{2}$, XHL$_{1}$, XHL$_{2}$, XHH$_{1}$, and XHH$_{2}$. Considering real sub-bands, there are 16 sub-bands, of which 4 are low pass and 12 are high pass. The filter coefficients for first stage are presented in Table 1.

The number of sub-bands generated by DTCWT is 16, which is 4 times higher than DWT sub-bands. The filter coefficients in the second stage are different from stage-one filter coefficients. Considering 10 filter coefficients per filter, the input data are processed along the rows in the first stage and along the columns in the second stage. If there are N inputs in each of M rows, the total numbers of multiplications and additions to process every row are 10N and 9N, respectively.

Considering M rows with each row having N elements, the total numbers of multiplications and additions are 10MN and 9MN, respectively. For the first stage, as there are four filters, the total numbers of multiplications and additions are 40MN and 36MN, respectively. For the second stage, the total numbers of multiplications and additions are 80MN and 72MN. The total numbers of multiplication and addition operations for level-1 decomposition are 120MN and 108MN, respectively. In addition to arithmetic operations, the intermediate memory elements required are 4NM for the first stage and 16NM for the second stage. The propagation delay is one of the major challenges in DTCWT computation as the processing structure needs to process the entire image.

It is required to reduce both the arithmetic and memory operations to optimize the area and speed. The filter coefficients require 16-bit signed bit representation, and this will further increase the computation complexity of the arithmetic operations. Scaling the filter coefficients by 256, rounding off to the nearest integer, and using 2’s complement signed representation reduces the computation complexity as the number of bits is reduced from 16 bits to 9 bits.

The level-1 sub-bands after decomposition generate 16 sub-bands each of size N/2 x M/2. The low pass sub-band is further decomposed to level-2 and level-3 sub-bands if three-level decomposition is considered. In order to reduce the computation complexity or to reduce the number of arithmetic operations and propagation delay, a DA algorithm was considered, and the algorithm was modified for DTCWT computation.

Fig. 1. DTCWT level-1 decomposition structure for 2D data.

Table 1. DTCWT filter coefficients for level-1 decomposition.

L¹a		H¹a
0	0	0	0
-0.08838	-23	-0.01122	-3
0.08838	23	0.01122	3
0.69587	178	0.08838	23
0.69587	178	0.08838	23
0.08838	23	-0.69587	-178
-0.08838	-23	0.69587	178
0.01122	3	-0.08838	-23
0.01122	3	-0.08838	-23
0	0	0	0
L¹b		H¹b
0.01122	3	0	0
0.01122	3	0	0
-0.08838	-23	-0.08838	-23
0.08838	23	-0.08838	-23
0.69587	178	0.69587	178
0.69587	178	-0.69587	-178
0.08838	23	0.08838	23
-0.08838	-23	0.08838	23
0	0	0.01122	3
0	0	-0.01122	-3

3. DA Architecture for DTCWT

In order to determine the primary path switching criteria for each MT, we take into account two metrics, including the historical relative difference of RTT against the estimated movement speed of MT. Both metrics are calculated only when MT is in dual-homing mode as it is located in the overlapping state among different networks, as shown in Fig. 1. In order to simplify our description, we assume that each MT has only one alternative path since the proposed scheme can be applied in a straight forward manner to the multi-alternative paths environment.

In the DA algorithm, the input data is represented using 2’s complement signed numbers. The filter coefficients are grouped together considering the binary weighting of input data bits. As the filter coefficients are constant, grouping of coefficients based on binary weighting of input data bits leads to partial products that are predefined based on input bit combinations. In the proposed DA algorithm, the number of partial products is reduced considering the redundancies in DTCWT filters. The proposed DA algorithm is presented in this section. A generic convolution operation for FIR filter is mathematically represented as Eq. (1),

(1)

$ Y=\sum _{k=1}^{k}A_{K}X_{K~ ~ ~ ~ ~ ~ ~ } $

Y is the filter output, X is the input, and A is the filter coefficient. The input X is represented in 2’s complement in Eq. (2):

(2)

$ X_{K}=-b_{k0}+~ \sum _{n=1}^{N-1}b_{kn}2^{n-1} $

$b_{k0}$ is the sign bit, and b$_{\mathrm{kn}}$ represents the binary bits. Substituting Eq. (2) in Eq. (1), the FIR filter output is given as Eq. (3):

(3)

$ Y=\sum _{K=1}^{K}A_{k}\left[-b_{k0}+\sum _{N=1}^{N-1}b_{kn}2^{-n}\right] $

The FIR output in Eq. (3) is rearranged by expanding the terms, and by reorganizing, the summation Eq. (4) is obtained.

(4)

$ ~ Y=\sum _{K=1}^{K}\left(b_{k0}A_{k}\right)+\sum _{k=1}^{k}\sum _{n=1}^{N-1}\left(A_{K}.b_{kn}\right)2^{-n~ } $

In Eq. (4), the first term is the sign bit, and the second term is the partial product term. For different combinations of n, the weighted binary bits b$_{\mathrm{kn}}$ are multiplied by the fixed coefficients A$_{\mathrm{k}}$, and the partial products are pre-computed. These pre-computed partial products are stored in the memory.

The memory elements are accessed by considering the input data as the address and each of the partial products is accumulated to generate the filter output. In DTCWT, there are four filters in the first stage. Considering Eq. (4), the output for the two filters is expressed in Eq. (5). The suffix x in Eq. (5) is either a or b, denoting real and imaginary terms and from this, an expression for all four filters can be found.

(5a)

$Y_{Lx}=-\sum _{K=1}^{10}b_{k0}L_{xk}+\sum _{k=1}^{10}\left[\sum _{n=1}^{N-1}(b_{kn}L_{xk})2^{-n}\right]$

(5b)

$Y_{Hx}=-\sum _{K=1}^{10}b_{k0}H_{xk}+\sum _{k=1}^{10}\left[\sum _{n=1}^{N-1}(b_{kn}H_{xk})2^{-n}\right]$

As there are 10 filter coefficients, the expression is suitable set for computing DTCWT filter outputs. As there are 10 filter coefficients, there will be 2$^{10}$ possible partial products. As there are four filters, the total number of memory units required will be 4 memory units with size of 1024 x 10 (40960) bits each. In order to reduce the number of partial products and to increase processing speed, a modified algorithm was developed. Considering Eq. (5a), by splitting the second in Eq. (5a) into two equal terms as in Eq. (6), the number of partial products is reduced and can be reused.

(6)

$ Y_{La}=\sum _{K=0}^{4}\sum _{n=1}^{N-1}\left[\left(L_{ak.}b_{kn}\right)2^{-n}\right]+ \\ \sum _{K=5}^{9}\sum _{n=1}^{N-1}\left[\left(L_{ak.}b_{kn}\right)2^{-n}\right] $

In the reorganized DA algorithm expression, the first term and second term have 5 filter coefficients, and hence, the number of partial products for each term is 2$^{5}$. The total number of storage memory bits required is 640 bits. By splitting the expression, the total number of bits to store partial product is reduced by 98.43%.

Table 2 presents the partial products for four filters based on the modified DA algorithm expression in Eq. (6). The block diagram for the modified DA algorithm is presented in Fig. 2. The input is stored in a PISO register of depth 10. Once the data is loaded into the PISO, the LSBs from each of the 10 registers are read out to form the address for the memory unit. As the memory unit is split into two sections, the LSBs from the top five registers are used as an address for memory unit 1, and the LSBs from the bottom 5 registers are used as an address for memory unit 2.

Each partial product read out from look up tables (LUTs) is accumulated in the accumulator section, and the final output is generated at the output of the summer. As the input data width is 9, the partial products are read out 9 times for different combinations of address bits, and the accumulator performs an operation 9 times. In the 10$^{\mathrm{th}}$ clock, the final output is generated at the output of the summer. The latency of the modified DA structure is 20 clocks (10 clocks for data loading into PISO and 10 clocks for data reading and accumulation).

It was observed that the contents of YLa$_{\mathrm{LUT1}}$, YHa$_{\mathrm{LUT1}}$, YLb$_{\mathrm{LUT1}}$, and YHb$_{\mathrm{LUT1}}$ from memory locations 10000 to 11111 were 178, 23, 23, and -178 higher than memory contents from locations 00000 to 01111. It was also observed that the memory contents of YLa$_{\mathrm{LUT2}}$, YHa$_{\mathrm{LUT2}}$, YLb$_{\mathrm{LUT2}}$, and YHb$_{\mathrm{LUT2}}$from memory locations 00000 to 01111 were similar to the contents in memory locations 10000 to 11111. Based on these observations, the reduced memory DA algorithm was designed.

Fig. 2. Modified DA structure for YLa filter.

Table 2. LUT contents of DTCWT filter.

A_4/9A_3/8A_2/7A_1/6A_0/5	YLa_LUT1	YLa_LUT2	YHa_LUT1	YHa_LUT2	YLb_LUT1	YLb_LUT2	YHb_LUT1	YHb_LUT2
00000	0	0	0	0	0	0	0	0
00001	0	23	0	-178	0	0	0	23
00010	-23	-23	-3	178	3	178	-23	23
00011	-23	0	-3	0	3	178	-23	46
00100	23	3	3	-23	3	23	-23	3
00101	23	26	3	-201	3	210	-23	26
00110	0	-20	0	155	6	210	46	26
00111	0	3	0	-23	6	379	-46	49
01000	178	3	23	-23	23	-23	178	-3
01001	178	26	23	-201	23	-23	178	20
01010	155	-20	20	155	26	155	155	20
01011	155	3	20	-23	26	155	155	43
01100	201	6	26	-46	26	0	155	0
01101	201	29	26	-224	26	178	155	23
01110	178	-17	20	132	29	178	132	23
01111	178	6	20	-46	29	356	132	46
10000	178	0	23	0	23	0	-178	0
10001	178	23	23	-178	23	0	-178	23
10010	155	-23	20	178	26	178	-201	23
10011	155	0	20	0	26	178	-201	46
10100	201	3	26	-23	26	23	-201	3
10101	201	26	26	-201	26	210	-201	26
10110	178	-20	20	155	29	210	-224	26
10111	178	3	20	-23	29	379	-224	49
11000	356	3	46	-23	46	-23	0	-3
11001	356	26	46	-201	46	-23	0	20
11010	333	-20	43	155	49	155	-23	20
11011	333	3	43	-23	49	155	-23	43
11100	379	6	49	-46	49	0	-23	0
11101	379	29	49	-224	49	178	-23	23
11110	356	-17	43	132	52	178	-46	23
11111	356	6	43	-46	52	356	-46	46

3.1 Memory Efficient DA Architecture

Table 3 presents the LUT content for four filters considering only term 1 of the expression presented in Eq. (6). The LUT contents provided are regrouped into two components considering the MSB address A4. If address A4 is 0, the contents of LUT are YLa$_{\mathrm{LUT1}}$, YHa$_{\mathrm{LUT1}}$, YLb$_{\mathrm{LUT1,}}$and YHa$_{\mathrm{bUT1}}$. If the address A4 is 1, the contents of LUT are 178 + YLa$_{\mathrm{LUT1}}$, 23 + YHa$_{\mathrm{LUT1}}$, 23 + YLb$_{\mathrm{LUT1,}}$and 178 + YHa$_{\mathrm{bUT1}}$. The address bits A$_{3}$, A$_{2}$, A$_{1}$, and A$_{0}$ are used to access the LUT contents, and the LUT depth is 16.

The DA structure for Term 1 of the YLa filter is presented in Fig. 3. The five input registers store the input data, and the LSBs of these registers are connected to the LUT address bits. The LSB of the top register is considered as address bit A4, which is connected to the multiplexer select line. At every clock, the LSBs of all four registers are used to read out the LUT content. The output of the LUT is accumulated according to the combinations of address bits, and the accumulated content is sent to the output summer.

The summer circuit performs addition of the accumulated output along with a constant number that is read from the output of 2:1 multiplexer. For the YLa filter, the constants are 0 and 178, which are used in the summer circuit to generate the final output and are stored in output register Reg1. The DA structure for YHa term 1 filter is presented in Fig. 4, and the constants are 0 and 23. Similarly, the DA structure for YLb and YHb filters were designed. The DA structure for term 2 for all four filters was deigned considering the LUT contents presented in Table 4. For the address bits A$_{3}$, A$_{2}$, A$_{1}$, and A$_{0}$, the LUT contents remain the same for both possible conditions of MSB address bit A$_{4}$. Considering the LUT contents for all four filters indicated in Table 5, the LUT size will be 16. Upon observation of LUT data, the partial products of the LUT for address bits 0000 to 0111 and 1000 to 1111 are repeated. The contents of Tables 4 and 5 are arrived considering optimum number of entries required for computing DTCWT partial products and for efficient implementation of the architecture on FPGA platform.

Considering the redundancy in the LUT content, further modification was carried out, and the contents of LUT were reduced as presented in Table 5. The address bits A$_{2}$, A$_{1}$, and A$_{0}$ are used to access the LUT contents, and the address bits A$_{4}$ and A$_{3}$ are used to enable a constant number to be added with the accumulated output at the summer circuit. The LUT contents for YLa and YHa filter term 2 are shown in Table 5. Similarly, the LUT content for YLb and YHb can be identified. Fig. 5 presents the reduced memory DA structure. The top two registers from the address bits A4 and A3 are not connected to the LUT address bits. The bottom three registers’ LSBs form the address of the LUT.

The depth of the LUT is 8, and the LUT content is read out every clock and accumulated at the accumulator module. The accumulated data is given to the summer to add the corresponding constant depending upon the status of address bit A3. The summer output is stored in the register Reg3. The final output of the filter is generated by summing the output of Reg1 and Reg3. Direct implementation of the DTCWT filter using DA algorithm will require 10 input registers, a LUT of size 1024 x 10, and an 11-bit accumulator. The total propagation delay will be T$_{\mathrm{LUT}}$ + T$_{\mathrm{ACC}}$, which represents the delay of LUT data readout and delay in the accumulator.

With four filters in the first stage, the total number of sub-modules required is presented in Table 6 and compared with the proposed optimized structure. The advantages of the proposed method are in terms of memory size of LUTs. The proposed method requires two LUTs per filter of size 16 x 10 and 8 x 10 (10 is the bit width of LUT contents). Compared with the direct method of implementation, the savings in memory size are 97.65% per filter. The number of adders and accumulators are increased by 3 and 1 compared with direct implementation, respectively. The critical path is increased by 2T$_{\mathrm{ADD}}$, and latency and throughput are increased by 2 clock cycles.

Fig. 3. Memory efficient DE structure for YLa term 1 filter.

Fig. 4. Memory efficient DA structure for YHa term 1 filter.

Fig. 5. Reduced memory DA structure for term 3 filter.

Table 3. Modified LUT contents for memory efficient DA (Term 1).

	L¹_a Filter Term 1		H¹_a Filter Term 1
Address bits	A₄=0	A₄=1	A₄=0	A₄=1
A₃ A₂ A₁ A₀	YLa_LUT1	178 + YLa_LUT1	YHa_LUT1	23 + YHa_LUT1
0000	0	0	0	0
0001	0	0	0	0
0010	-23	-23	-3	-3
0011	-23	-23	-3	-3
0100	23	23	3	3
0101	23	23	3	3
0110	0	0	0	0
0111	0	0	0	0
1000	178	178	23	23
1001	178	178	23	23
1010	155	155	20	20
1011	155	155	20	20
1100	201	201	26	26
1101	201	201	26	26
1110	178	178	20	20
1111	178	178	20	20

Table 4. Modified LUT contents for memory efficient DA (Term 2).

Address bits	L¹_a Filter Term 2		H¹_a Filter Term 2
Address bits	A₄=0	A₄=1	A₄=0	A₄=1
A₃ A₂ A₁ A₀	YLa_LUT2	YLa_LUT2	YHa_LUT1	YHa_LUT1
0000	0	0	0	0
0001	23	23	-178	-178
0010	-23	-23	178	178
0011	0	0	0	0
0100	3	3	-23	-23
0101	26	26	-201	-201
0110	-20	-20	155	155
0111	3	3	-23	-23
1000	3	3	-23	-23
1001	26	26	-201	-201
1010	-20	-20	155	155
1011	3	3	-23	-23
1100	6	6	-46	-46
1101	29	29	-224	-224
1110	-17	-17	132	132
1111	6	6	-46	-46

Table 5. Reduced memory DA contents for term 2 filters.

Address bits	L¹_a Filter Term 2		L¹_a Filter Term 2
Address bits	A₄ A₃=00	A₄ A₃=01	A₄ A₃=10	A₄ A₃=11
A₂ A₁ A₀	YLa_LUT2	3 + YLa_LUT2	YHa_LUT1	3 + YHa_LUT1
000	0	0	0	0
001	23	23	23	23
010	-23	-23	-23	-23
011	0	0	0	0
100	3	3	3	3
101	26	26	26	26
110	-20	-20	-20	-20
111	3	3	3	3

Table 6. Comparison of DA methods.

Implementation method	Filter	LUT Size	No. of adders	No. of accumulators	Critical path (T)	Clock delay
Implementation method	Filter	LUT Size	No. of adders	No. of accumulators	Critical path (T)	Latency	Throughput
Direct DA	YLa	1024	0	1	T_LUT +T_ACC	21	12
	YHa	1024	0	1		21	12
	YLb	1024	0	1		21	12
	YHb	1024	0	1		21	12
Split DA	YLa	32	1	2	T_LUT +T_ACC + T_ADD	22	13
	YHa	32	1	2		22	13
	YLb	32	1	2		22	13
	YHb	32	1	2		22	13
Proposed DA	YLa	24	3	2	T_LUT +T_ACC + 2T_ADD	23	14
	YHa	24	3	2		23	14
	YLb	24	3	2		23	14
	YHb	24	3	2		23	14

4. FPGA Implementation

The top level module for DTCWT computation comprises four filters: YLa, YHa, YLb, and YHb. They are represented as two pairs of filter banks denoted as tree ``a'' and tree ``b''. The tree ``a'' filter bank is YLa and YHa, and the tree ``b'' filter band is YLb and YHa. Each of these filter banks are modeled using Verilog HDL. The behavioral model for term 1 and term 2 of each of the filters was developed and verified for its functionality. The verified HDL model was integrated into a higher hierarchical model to form the tree ``a'' and tree ``b'' filter bank.

The input data is loaded into the 10 registers that are connected to filter banks. The output of the filters is stored in an output register and is transferred into a memory unit for further processing. In order to test or verify the DTCWT unit functionality, an impulse data sample was applied. The output generated from each of the filter was observed to match the filter coefficients. 8-bit input data is generated using random number logic and is processed by the DTCWT model. The output generated was noted and was verified with MATLAB results.

The functionally verified HDL code was synthesized, and a netlist was generated in Xilinx ISE. Implementation of 2D DTCWT was carried out by cascading the stage 1 DTCWT unit with the stage 2 DTCWT unit. In stage 1, there are two filter banks, and the outputs generated are stored in intermediate memory. In the second stage or the column processing unit, there are four filter banks. Each of these filter banks processes the outputs of stage 1 to generate 8 outputs.

The top level module of 2D DTCWT is presented in Fig. 6. A Verilog HDL model was developed for 2D DTCWT structure and was verified for its functionality. Considering the hardware resources from the synthesis, we identified the total resources occupied, and further planning for optimization could be carried out.

The resources estimated from the synthesis report was equivalent to the actual resources required for implementation. Table 7 presents the comparison of hardware resources of 1D-DTCWT considering two methods presented in this work: the split DA method and reduced memory DA method. The operating frequency in the proposed method is 382 MHz, which is close to the split DA method’s operating frequency.

The proposed architecture designed in this work uses 37% less slice LUTs than the split DA method. Table 8 presents the FPGA implementation results of the 2D-DTCWT structure proposed in this work. The proposed structure requires 3564 slice registers and 3912 slice LUTs for implementation, which are 52.80% and 49.07% less compared with the split DA architecture. The total power dissipation was estimated 1.01 W, and the operating frequency was 8.7% slower than the split DA structure. Table 9 compares the hardware metrics for the proposed architecture with existing work.

The hybrid DA architecture discussed in another study ^[12] optimizes area utilization on CLBs, and a DTCWT structure was implemented on a Virtex-5 FPGA for four filters. The algorithm proposed in another study ^[12] was modeled and extended for DTCWT computation of a 256 x 256 image. The results were compared with the proposed SAA architecture based on multiplexed DA logic. In the systolic array architecture ^[11], a pipelined method is used for DTCWT implementation.

Comparing the performance of the proposed method for DTCWT implementation over that of hybrid structure and systolic array structure, the operating frequency and area resources are optimal in the proposed method. From the results obtained, the proposed architecture consumes less than 49.55% power compared with existing methods for DTCWT implementation. The power dissipation on the Virtex-5 platform is 9% is less, and the operating frequency is 31.8% greater than in the existing DTCWT structures.

Fig. 6. Top level block diagram of 2D DTCWT.

Table 7. Implementation results of 1D DTCWT.

Slice Logic utilization	Proposed method	Split DA method
Number of Slice Registers	1204	2235
Number of Slice LUTs	1334	2123
Max. freq. of operation	382 MHz	395 MHz
Total power dissipation (W)	0.87	1.1

Table 8. Implementation results of 2D DTCWT.

FPGA utilization	Proposed method	Split DA method
Number of Slice Registers	3564	7552
Number of Slice LUTs	3912	7682
Max. freq. of operation	312 MHz	342 MHz
Total power dissipation (W)	1.01	2.04

Table 9. Comparison of DTCWT implementations.

FPGA Utilization	Proposed method	Ref. ^[12]	Ref. ^[13]	Ref. ^[14]
Number of Slice Register	3564	4112	7482	8799
Number of Slice LUTs	3912	4091	7224	8624
Total power (W)	1.01	1.71111	2.00207	2.192
Max. Frequency (MHz)	312 MHz	289.12	212.67	222.46

5. Conclusion

A DA algorithm was selected for implementing the 2D DTCWT architecture on an FPGA. The DA algorithm was modified by considering the DTCWT filters, and improved methods were designed and implemented on an FPGA. The performance metrics of DTCWT architectures developed on an FPGA were compared in terms of area, timing, and power. The high-speed DA is a recommended architecture in terms of throughput, latency, and area utilization.

ACKNOWLEDGMENTS

This research was carried out at Reva University, Bangalore, and Cambridge Institute of Technology, Bangalore, and we acknowledge them. We also acknowledge MATLAB for providing access to the libraries.

REFERENCES

R. T. Furbank, M. Tester, “Phenomics - Technologies to relieve the phenotyping bottleneck,” Trends in Plant Science, vol. 16, No. 12, pp. 635-644, 2011.

S. Paulus, J. Behmann, A.-K. Mahlein, L. Plumer, H. Kuhlmann, “Low-cost 3D systems: Suitable tools for plant phenotyping,” Sensors, vol. 14, No. 2, 2014, pp. 3001-3018. J. L. Araus, J. E. Cairns, “Field high-throughput phenotyping: the new crop breeding frontier,”Trends in Plant Science, vol. 9, No. 1, pp. 52-61, 2013.

L. Araus, J. E. Cairns, “Field high-throughput phenotyping: the new crop breeding frontier,”Trends in Plant Science, vol. 9, No. 1, pp. 52-61, 2013.

Massimo Minervini, Hanno Scharr, Sotirios Tsaftaris, “The Significance of Image Compression in Plant Phenotyping Applications,” Functional Plant Biology, vol. 42, No. 10, pp. 971-988, 2015.

Massimo Minervini, “Application-Aware Image Compression and Sensing Platform for Plant Phenotyping,” Computer Science and Engineering, IMT School for Advanced Studies, Italy, PhD Thesis, 2015.

A. Skodras, C. Christopoulos, T. Ebrahimi, “The JPEG 2000 still image compression standard, ” IEEE Signal Processing Magazine, vol. 18, No. 5, pp. 36-58, 2001.

N. Kingsbury, “The Dual-Tree Complex Wavelet Transform: A New Efficient Tool for Image Restoration and Enhancement,” Proceedings of 9th European Signal Processing Conference (EUSIPCO 1998), Rhodes, Greece, 8-11, pp. 1-4, Sept. 1998.

I. W. Selesnick, R. G. Baraniuk, “The Dual-Tree Complex Wavelet Transform,” IEEE Signal Processing Magazine, vol. 22, No. 6, pp. 123-151, 2004.

Joseph B. Boettcher, James E. Fowler, “Video Coding Using a Complex Wavelet Transform and Set Partitioning, ” IEEE Signal Processing Letters, vol. 14, pp. 633-636, 2007.

Li Hui Fang, Miao Guo Feng, Xu Hou Jie, “Images Compression Using Dual Tree Complex Wavelet Transform, ” In Proceedings of the 2010 International Conference of Information Science and Management Engineering, IEEE Computer Society, USA, vol. 1, pp. 559-56, 2010.

H. Naimi, A. B. H. Adamou-Mitiche, L. Mitiche, “Medical image denoising using dual tree complex thresholding wavelet transform and Wiener filter, ” Journal of King Saud University-Computer and Information Sciences, vol. 27, No. 1, pp. 40-45, 2015.

S. S. Divakara, Sudarshan Patilkulkarni, Cyril Prasanna Raj, “High Speed Area Optimized Hybrid DA Architecture for 2D-DTCWT,” International Journal of Image and Graphics vol. 18, No. 1, 2018.

Poornima B, Sumathi A, Cyril Prasanna Raj Premkumar, “Memory efficient high speed systolic array architecture design with multiplexed distributive arithmetic for 2D DTCWT computation on FPGA,” Journal of microelectronics, electronic components and materials, vol. 49, no. 3, pp. 119-132, 2019.

S. S. Divakara, Sudarshan Patilkulkarni, Cyril Prasanna Raj, “High Speed Modular Systolic array based DTCWT with Parallel Processing Architecture for 2D Image Transformation on FPGA, ” International Journal of Wavelets, Multiresolution and Information Processing, vol. 15, No. 05, 1750047, 2017.

Author

Yashavantha kumar T. R.

Yashavantha kumar T. R. has obtained a B.E.M. tech degree from S.J.C.E, Mysore. He has 5 years of industrial experience and 10 years of teaching experience. He served in an MNC company like IBM. He joined the government sector in 2010. He worked as an HOD in the department of ECE at Government Engineering College, Karwar. He is currently working as an assistant professor in the Department of ECE at Government Engineering College, Haveri. He is pursuing his PhD at Reva University, Bangalore

S. L. Pinjare

S. L. Pinjare received his PhD from Indian Institute of Technology, Chennai in 1981 and has more than 40 years of experience in industry, academia and research. He worked as professor in department of ECE, NMIT, Bangalore for 15 years and has also worked in Reva University. He has worked in ITI Limited, Bangalore in VLSI department. Currently he is supervising PhD scholars in Reva University, Bangalore in the areas of signal processing, image processing, VLSI design, MEMS technology and FPGA design. He has more than 40 publications in his field of research areas and has also filed four patents. He is IEEE senior member.

Cyril Prasanna Raj P.

Cyril Prasanna Raj P. is working as a Director at CCCIR, Cambridge Institute of Technology, Bangalore. Prior to this, he was at MS Engineering College, Bangalore, as a Research Dean for 10 years, and at MS Ramaiah University of Applied Sciences, Bangalore, as HOD in ECE Department for 12 years. He has more than 25 years of experience in teaching and research. DWT architectures over multicore platform and a novel DWT architecture for image compression have been awarded with 4 US patents. He also has 40 patents and has commercialized three products. He has more than 110 journal publications, authored 14 books, and is supervising 8 research scholars under VTU.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Optimized Distributive Arithmetic-based Hardware Accelerator for Dual Tree Complex Wavelet Transform Computation

Abstract

Keywords

1. Introduction

2. Related Work

Fig. 1. DTCWT level-1 decomposition structure for 2D data.

Table 1. DTCWT filter coefficients for level-1 decomposition.

3. DA Architecture for DTCWT

(1)

(2)

(3)

(4)

(5a)

(5b)

(6)

Fig. 2. Modified DA structure for YLa filter.

Table 2. LUT contents of DTCWT filter.

3.1 Memory Efficient DA Architecture

Fig. 3. Memory efficient DE structure for YLa term 1 filter.

Fig. 4. Memory efficient DA structure for YHa term 1 filter.

Fig. 5. Reduced memory DA structure for term 3 filter.

Table 3. Modified LUT contents for memory efficient DA (Term 1).

Table 4. Modified LUT contents for memory efficient DA (Term 2).

Table 5. Reduced memory DA contents for term 2 filters.

Table 6. Comparison of DA methods.

4. FPGA Implementation

Fig. 6. Top level block diagram of 2D DTCWT.

Table 7. Implementation results of 1D DTCWT.

Table 8. Implementation results of 2D DTCWT.

Table 9. Comparison of DTCWT implementations.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Yashavantha kumar T. R.

S. L. Pinjare

Cyril Prasanna Raj P.

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing