Mobile QR Code QR CODE

2025

Reject Ratio

81.5%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 15, No. 3, p.452-465

ISSN (online) :

2287-5255

Received : 6 January 2025Revised : 25 February 2025Accepted : 29 April 2025

DOI :

10.5573/IEIESPC.2026.15.3.452

Regular Paper

High Speed Accelerators Hardware Implementation for Fully Connected Neural Network Model Using 3D Systolic Array Architecture

(Pottipati Dileep Kumar Reddy) ^1,^* (Kota Venakata Ramanaih) ²

(Department of Electronics and Communication Engineering, Research Scholar, Yogi Vemana University/YSR Engineering College of YVU, Proddatur, Andhra Pradesh, India. dkr.pottipati@gmail.com)
(Department of Electronics and Communication Engineering, Dean Faculty of Engineering, Yogi Vemana University/YSR Engineering College of YVU, Proddatur, Andhra Pradesh, India. ramanaiahkota@gmail.com)

^*Corresponding Author : Pottipati Dileep Kumar Reddy, dkr.pottipati@gmail.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In Convolution Neural Network (CNN) is a primary building block for image processing applications with sub systems such as Convolution layer (CL), Max pooling Layer (MPL) & Fully Connected Neural Network (FCNN) layer. In order to address computation complexity of FCNN model in terms of processing speed, hardware implementation on FPGA is required to assess and optimize. In this study, systolic array algorithm-based 3D structure is developed to implement FCNN model. The 3D structure processes multiple frames of input data with three filters to generate simultaneously the FCNN output using multistage FCNN model. The processing elements that form the primary building block of systolic array model is designed suing basic arithmetic elements and control circuit for data synchronization. Verilog HDL is developed for the proposed model along with test bench to verify the functionality and the 3D structure with pipelined logic is implemented on Virtex-5 FPGA and form the synthesis report it is estimated that the operating frequency is 277 MHz which is 27% faster than direct implementation, power dissipation is also increased by 6% with tradeoff with computation speed. The 3D CNN structure is suitable for high-speed image processing applications.

Keywords

CNN, FCNN, MPL, Systolic array

1. Introduction

Convolutional Neural Network (CNN) models are trained with large data sets to learn the attributes in the input image and generate the required output as per the training process and hence has the potential to perform image processing applications such as object detection, classification and pattern recognition with superior performances. CNN models developed that have been predominantly executed over cloud as it requires higher computation power and large memory ^[1]. For all emerging applications such as robotics, drones, automotive and medical devices deployed for edge computing CNN model need to be implemented on hardware platforms. Hardware platforms that are considered for CNN model implementation are designed to optimize area resources, power dissipation and have minimum processing delay. Design and development of CNN model on hardware platforms drives large time critical applications. CNN model comprises of three major sub blocks: convolution operation, maximum pooling and fully connected neural network. CNN models such as SqueezeNet ^[2], Xception ^[3], MobileNet ^[4], VGG-16 ^[5], AlexNet ^[6] etc., are few of the most popular models that have been extensively used for image processing applications. The convolutional and fully connected sub systems in CNN model require multiplication and addition operations. The data movement in these sub systems is addressed using control modules that synchronize data movement involving memory modules. CNN model is trained, and the trained model is used for inference, which involves millions of arithmetic operations of which 90% of the arithmetic operations are carried out in the convolutional layer ^[7]. In order to reduce latency and improve processing speed parallel structures are required to implement sub systems of CNN model ^[8] ^[9].

Parallelism in CPU and GPU platforms is through use of multilevel cache, single instruction multiple data, multiple threads, multiple ALUSs, shared controllers and memories. Convolutional and fully connected modules of CNN are implemented or mapped to matrix multiplication process in CPU and GPU platforms. In CNN model data processing requires MPL logic operation, storage and complex control operations for implementing convolutional and fully connected layers. GPU has large number of cores that are efficient for performing parallel operations and is suitable for CNN model implementation. Power requirement and area occupancy for sub system implementation is complex in GPUs. Sankaradas et al. have developed coprocessor modules for CNN sub system implementation on FPA platform. In their work 20-bit and 16-bit fixed point number system for weight matrix representation and feature representation respectively ^[10]. Gokhale et al. have demonstrated CNN coprocessor implementation for mobile embedded applications with processing speed of 200 GOP/s on Xilinx platform ^[11]. Eyeriss is an efficient reconfigurable module for CNN implementation supporting different sizes of feature maps, convolutional kernel size and data reuse ^[12]. Changnei Qiu et al. ^[13] have developed CNN coprocessor with 1D processor for convolutional row operation and 3D processor for overall computation with pulsating array structure. VGG16 CNN logic based convolutional and fully connected subsystem is implemented on FPGA using 16-bit fixed point number system with operation speed of 316 GOP/s at operating frequency of 200 MHz with power dissipation limited to less than 9.25W. Tomyslav sledevic et al. ^[14] in their work have developed CNN accelerator with four convolutional layers with pre-trained CNN model and is implemented on FPGA using 16-bit fixed point number system. The developed model is observed to processes data in less than 8.8 ms for classification of image of size 512 x 128. Xu Yang et al. ^[15] propose deep learning accelerator for FPGA implementation with parallelization model optimizing power and area. The model is capable of processing 14 frames per second with 5 GOP/s and is demonstrated to be 28% higher than DSP operating frequency. Zang et al. ^[16] have developed deep learning architecture with adder logic based on bespoke method. The proposed architecture is implemented on FPGA improving computing efficiency considering shorter bit-width for data representation and post quantization process. Sui X et al. ^[17] have proposed hardware efficient and accurate CNN model with pruning logic and has been implemented on ZYNQ FPGA. The resource utilization on FPGA for CNN implementation is optimized and is demonstrated to be superior to DSP modules. From the literature studies it is observed that CNN model comprising of convolutional layer and fully connected layer are computationally intensive for hardware implementation. In order or reduce computation complexity, convolutional layer is implemented with optimization methods considering setting up of bit widths for data representation, pruning logic, pipelining and parallelism. The fully connected layer processes the data after rearrangement and will need to process 256 data inputs with minimum of three stages of neurons. Thus the complexity of fully connected network will be on par with convolution layer complexity ^[1].Reducing the computation complexity of fully connected network will achieve 50% reduction on computation metrics of CNN in terms of area, timing and power ^[1].The trained CNN model will use predefined masks in the convolutional layer and the fully connected layer will require optimum trained weights for data processing. Design of customized architecture considering predefined masks and trained weights will further reduce computation complexity. In this paper, new methods are developed considering the correlation between weights and convolutional weights. The developed model is HDL coded and is verified for logic correctness. CNN model developed is implemented on FPGA optimizing area, timing and power. Detailed discussion on hardware architecture design is presented comparing with existing models.

2. Convolutional Neural Network

Convolutional neural network consists of Convolutional Layer (CL), Maximum Pooling Layer (MPL) and Fully Connected Neural Network Layer (FCNNL). The CL and MPL modules are repeated in sequence N number of times (N is dependent on input image size and the is decided considering complexity of implementation, N is usually set to 3). CL and MPL together reduce dimensionality of input image and also capture the significant features from the input image at different levels such as low level, mid-level and high level. Fig. 1 presents the CNN structure for image processing applications. The input image is processed by CL and MPL layers reducing dimensionality and the final output of MPL layer which is 2D is flattened by converting 2D to 1D. The flattened 1D data is processed by the trained FCNNL model to generate the output.

Fig. 1. CNN structure.

The convolutional layer uses filters to extract the features from the input. Number of features to be extracted is depended on number of filters used in the convolutional layer. If three features maps are required to be extracted from the input in the first stage, three filters are used and the CL model generates three frames from the input image representing the feature maps. CL module is computationally intensive and occupies more resources. CNN model has processes an input of size $h \times w \times c_{in}$ using $k \times x_k \times c_{in} \times c_{out}$ kernel to generate an output data of $h \times w \times c_{out}$. The parameters $h$ and $w$ are the height and weight of input data, $k$ is the size of the kernel, $c_{in}$ is the number of input channels or frames, $c_{out}$ is the number of output frames. Eq. (1) represents the convolution operation,

(1)

$ G(y, x, j) = \sum_{u=1}^k \sum_{v=1}^k k(u, v, j) \times l(y+u-1, x+v-1, j). $

The parameter $k$ is the convolutional kernel of size $k \times k \times c_{in}$. The computation complexity in CNN is measured by considering the number of multiplication and accumulation operations (MAC) and is expressed as in Eq. (2).

(2)

$ C_{ds} = k^2 \cdots c_{in} \cdot h \cdot w + c_{in} \cdot c_{out} \cdot h \cdot w. $

Fig. 2 presents the demonstration of convolution operation of a $6 \times 6$ image processed to generate three feature maps using filters of size $3 \times 3$.

Fig. 2. Convolution operation.

The three filters of size $3 \times 3$ process the input data using convolution operation and generate three frames that capture three features as per the predefined filter property. The filters slide across the input data and multiplication, accumulation operation is carried out to compute the feature map. In the MPL module the feature map output of convolution operation is reduced by identifying the pixels with maximum intensity and discarding all other pixels. Fig. 3 presents the structure of FCNN layer. Fig. 4 presents the single neuron structure and its mathematical model. The inputs are represented as $x_n$, weights are represented as $w_{kn}$, bias is $b_k$ and the network activation function is $\phi(\cdot)$. The variable $n$ is the size of input vector and $k$ is the neuron number. The output of max pooling layer is flattened into 1D structure and sis processed by the FCNN layer.

Fig. 3. Structure of fully connected neural network layer.

Fig. 4. Single neuron structure.

Training is required to be carried out for CNN model. The generic block diagram of CNN module is presented is Fig. 5, it comprises of a CNN model which is trained and the trained weights along with the data set is stored in the computer. During inference operation, the trained weights for CNN are loaded and the input data from the camera is loaded into the CNN module for processing. The output generated is stored in external memory. Fig. 6 presents the accelerator architecture for CNN model that comprises of weight memory, input memory, DMA and FSM module. The convolutional layer with max pooling is the core of the accelerator. The off-chip memory and CPU are used for data control and input data storage.

Fig. 5. Training and inference of CNN module.

Fig. 6. CNN accelerator architecture.

3. Design of Proposed CNN Architecture

From the discussions carried out in the previous section, CNN architecture comprises of neural network structure as the fully connected layer that processes the input data or the reordered data into suitable output. The CNN models comprising of convolutional layer, maximum pooling layer and fully connected neural network layer. The fully connected neural network layer is the last stage of CNN model. In this work the neural network architecture is designed to have an input layer of $22 \times 1$ vectors. Fig. 7 presents the fully connected layer structure. The hidden layer 1 outputs are denoted as $\{a_1^1, a_2^1, a_3^1, ..., a_{s1}^1\}$. The intermediate output of each neuron of the intermediate layer is denoted as $\{n_1^1, n_2^1, n_3^1, ..., n_{s1}^1\}$. Similarly, the intermediate and output of second layer is denoted as $n_2^{s2}$ and $a_2^{s2}$. The network activation function for all the hidden layers and output layer is selected to be tansig or purelin. The intermediate output ($n_m^1$) of hidden layer 1 is mathematically represented as in Eq. (3), and the output of hidden layer is demoted as in Eq. (4).

(3)

$ n_m^1 = \sum_{i=1}^{22} E_l w_{m,l}^1 + b_m^1, $

(4)

$ a_m^1 = \text{tansig}(n_m^1),~m = 1,~2,~3,~4,~...,~15. $

Fig. 7. Fully connected neural network model for CNN.

Table 1. Summarizes the size of frames for proposed CNN model.

Input	Size ($N * N$)	Frames
Input	$N \times N$	1
CL1	$(N - 1) \times (N - 1)$	4
MP1	$[(N - 1)/2] \times [(N - 1)/2]$	4
CL2	$[(N - 3)/2] \times [(N - 3)/2]$	16
MP2	$[(N - 3)/4] \times [(N - 3)/4]$	16
CL3	$[(N - 7)/4] \times [(N - 7)/4]$	$16 \times 4$
MP3	$[(N - 7)/8] \times [(N - 7)/8]$	$16 \times 4$
OL1	$[(N - 7)/8] \times [(N - 7)/8]$	$8 \times 4$
OL2	$[(N - 7)/8] \times [(N - 7)/8]$	$8 \times 2$
FCNN	1D Network

The weights are denoted by $w_{m,l}$ and the bias is denoted as $b_m^1$ For the first hidden layer, if the number of neurons is set to N, the number of weights is 22 N. As every neuron will have 22 weights as there are 22 energy levels. The number of biases in the hidden layer will be m. Similarly all other hidden layer outputs are denoted and represented as $n_k^p$, where $k$ represents number of neurons in the hidden layer $p$. Eqs. (5) and (6) presents the intermediate output and final output of hidden layer 2.

(5)

$ n_k^2 = \sum_{i=1}^{15} a_l^1 w_{k,l}^2 + b_k^1, $

(6)

$ a_k^1 = \text{tansig}(n_k^1),~\text{where}~n = 1,~2,~3,~...,~40. $

In Proposed architecture the Input data is processed by 3 stages of CL + MPL module & one layer of Fully Connected Neural Network Layer (FCNN). The FCNN model in this work comprises of two hidden layers .The CL module has 4 filters to extract features in all directions of orthogonality. The proposed architecture supports processing of High-Definition input images of size $2048 \times 2048$ Pixels. The 1st stage generates $1024 \times 1024$ Size image which is down sampled by 2 to obtain $512 \times 512$ without loss of data. The 2nd stage & 3rd stage together process the output of down sampled module to generate image $32 \times 32$. The Flatting modules converts 2D ($32 \times 32$) data to 1D Vector of size ($1024 \times 1$). The FCNN Layer process 1024 vectors using 4 stages of Neural Network module to generate M Outputs.

Input image of size $N \times N$ is processed by three stage of convolutional layer and max pooling layer. Every stage of convolutional layer reduces the image size from $N \times N$ to $(N - 1) \times (N - 1)$. The output of convolution layer is processed by maximum pooling layer to generate output of $(N - 1)/2 \times (N - 1)/2$. In this process an input image of size $N \times N$ is used to generate four frames of feature map $(N - 1)/2 \times (N - 1)/2$. In second stage of CL+MPL $(N - 1)/2 \times (N - 1)/2$ input of four frames is processed by the stage2 module to generate four groups of frames each of four frames of size generating 4 Groups of Frames (GOPs) each of size $[(N - 3)/4 \times (N - 3)/4]$. Further the 3rd stage of CL+MPL is processed 16 frames to generate 64 frames, these 64 frames are grouped into four groups each of 16 frames with frame size of $(N - 7)/2 \times (N - 7)/2$. Fig. 8 illustrates the representation of GOPs after 3rd stage processing. Similar structure is used in the first and second stage of CNN model. Based on these discussions provided, next section presents discussion on systolic array architecture for processing GOPs.

Fig. 8. 3rd stage of proposed CNN architecture generating 4 groups each of 16 frames with size [(N - 7)/8 x (N - 7)/8].

4. Systolic Array Architecture for FCNN

The systolic array architecture design for first stage FCNN comprises of weight filters with weight coefficients represented as $\{W_{0a}, W_{0b}, W_{1a}, W_{1b}\}$ are represented in matrix form as in Eq. (7), and the input data considered for processing is expressed as in Eq. (8),

(7)

$ H = \begin{bmatrix} w_{0a}^0 & w_{0a}^1 & w_{0a}^2 & w_{0a}^3 & w_{0a}^4 & w_{0a}^5 & w_{0a}^6 & w_{0a}^7 & w_{0a}^8 \\ w_{0b}^0 & w_{0b}^1 & w_{0b}^2 & w_{0b}^3 & w_{0b}^4 & w_{0b}^5 & w_{0b}^6 & w_{0b}^7 & w_{0b}^8 \\ w_{1a}^0 & w_{1a}^1 & w_{1a}^2 & w_{1a}^3 & w_{1a}^4 & w_{1a}^5 & w_{1a}^6 & w_{1a}^7 & w_{1a}^8 \\ w_{1b}^0 & w_{1b}^1 & w_{1b}^2 & w_{1b}^3 & w_{1b}^4 & w_{1b}^5 & w_{1b}^6 & w_{1b}^7 & w_{1b}^8 \end{bmatrix}, $

(8)

$ X = \begin{bmatrix} X_0^0 & X_0^1 & X_0^2 & X_0^3 & X_0^4 & X_0^5 & X_0^6 & X_0^7 & X_0^8 & X_0^9 & X_0^{10} & X_0^{11} \\ X_1^0 & X_1^1 & X_1^2 & X_1^3 & X_1^4 & X_1^5 & X_1^6 & X_1^7 & X_1^8 & X_1^9 & X_1^{11} & X_1^{11} \\ X_2^0 & X_2^1 & X_2^2 & X_2^3 & X_2^4 & X_2^5 & X_2^6 & X_2^7 & X_2^8 & X_2^9 & X_2^{12} & X_2^{11} \\ X_3^0 & X_3^1 & X_3^2 & X_3^3 & X_3^4 & X_3^5 & X_3^6 & X_3^7 & X_3^8 & X_3^9 & X_3^{10} & X_3^{11} \\ X_4^0 & X_4^1 & X_4^2 & X_4^3 & X_4^4 & X_4^5 & X_4^6 & X_4^7 & X_4^8 & X_4^9 & X_4^{10} & X_4^{11} \\ X_5^0 & X_5^1 & X_5^2 & X_5^3 & X_5^4 & X_5^5 & X_5^6 & X_5^7 & X_5^8 & X_5^9 & X_5^{10} & X_5^{11} \\ X_6^0 & X_6^1 & X_6^2 & X_6^3 & X_6^4 & X_6^5 & X_6^6 & X_6^7 & X_6^8 & X_6^9 & X_6^{10} & X_6^{11} \\ X_7^0 & X_7^1 & X_7^2 & X_7^3 & X_7^4 & X_7^5 & X_7^6 & X_7^7 & X_7^8 & X_7^9 & X_7^{10} & X_7^{11} \\ X_8^0 & X_8^1 & X_8^2 & X_8^3 & X_8^4 & X_8^5 & X_8^6 & X_8^7 & X_8^8 & X_8^9 & X_8^{10} & X_8^{11} \\ X_9^0 & X_9^1 & X_9^2 & X_9^3 & X_9^4 & X_9^5 & X_9^6 & X_9^7 & X_9^8 & X_9^9 & X_9^{10} & X_9^{11} \end{bmatrix}. $

Considering an input image ($X$) of size $10 \times 10$, the systolic array algorithm performs the data processing operation to generate the out $Y = [W] \cdot [Y]$ of size ($4 \times 10$). The 10 filter coefficients for every represented as $\{W_{0a}, W_{0b}, W_{1a}, W_{1b}\}$ are multiplied with the 10 input samples which requires 10 multipliers and 9 adders for every output sample. To perform matrix multiplication of $Y = [W] \cdot [X]$ requires 400 multiplication and 360 addition operations. For the input data $X$ to be processed and to generate the $Y$ data representing the FCNN output it is required to perform matrix multiplication operation considering input image in size of $10 \times 10$. The input image of size $N \times N$ is processed by considering four rows together and can be extended to process ‘n’ rows together by setting the processing elements appropriately. Each row or column of the input data is arranged as in Eq. (9) to perform processing of multiple matrix elements and generate successive outputs.

(9)

$ \begin{bmatrix} X_0^0 & X_1^0 & X_2^0 & X_3^0 & X_4^0 & X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 \\ X_1^0 & X_2^0 & X_3^0 & X_4^0 & X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 \\ X_2^0 & X_3^0 & X_4^0 & X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 \\ X_3^0 & X_4^0 & X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 \\ X_4^0 & X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 \\ X_5^0 & X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 & X_{14}^0 \\ X_6^0 & X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 & X_{14}^0 & X_{15}^0 \\ X_7^0 & X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 & X_{14}^0 & X_{15}^0 & X_{16}^0 \\ X_8^0 & X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 & X_{14}^0 & X_{15}^0 & X_{16}^0 & X_{17}^0 \\ X_9^0 & X_{10}^0 & X_{11}^0 & X_{12}^0 & X_{13}^0 & X_{14}^0 & X_{15}^0 & X_{16}^0 & X_{17}^0 & X_{18}^0 \end{bmatrix}. $

In Eq. (9) the first row of input data represented as $X_0^0$ to $X_{18}^0$ is arranged into 2D matrix of size $10 \times 10$ elements, similarly the second row to fourth row of elements are arranged as $10 \times 10$ elements and is cascaded into a 3D matrix of size $10 \times 10 \times 4$. If ‘n’ rows are considered for processing then the 3D matrix size will be $10 \times 10 \times n$. Processing of the 3D input data is carried out using 3D systolic array structure shown in Fig. 9. The processing elements $P(x, y, z)$ are arranged in three dimensions starting from $P(0,0,0)$ at the left bottom position. The input data from the 3D matrix is fed into the structure from the bottom and data moves upwards at every clock. The processing elements $P(0,0,0)$, $P(0,0,1)$, $P(0,0,2)$, and $P(0,0,3)$ are connected to the inputs from the first layer of 3D matrix (denoted as $X_0$), the second frame is fed into the processing elements $P(1,0,0)$, $P(1,0,1)$, $P(1,0,2)$, and $P(1,0,3)$ and so on the ‘n’ frame is fed into the processing elements of $P(n,0,0)$, $P(n0,1)$, $P(n,0,2)$, and $P(n,0,3)$.

Fig. 9. 3D systolic array architecture for FCNN.

The filter coefficients $\{W_{0a}, W_{0b}, W_{1a}, W_{1b}\}$ are fed into the 3D structure from Z-X plane as shown in Fig. 9. For each of the processing elements (layers of PEs) in the Z-X plane different set of filter coefficients are fed into the structure for data processing. The data arrangement for inputs and filter coefficients entering the 3D structure are designed to achieve high throughput. The primary building block of the 3D structure is the processing elements and is represented as in Fig. 10 with inputs $X$ and $W$ to generate output $Y$. The internal structure of the processing element is shown in Fig. 10. The building blocks of processing elements are adder, multiplier, delay register, $2 : 1$ multiplexer and storage registers. The primary function of processing element is to perform multiplication of two operands and accumulate the intermediate products. The inputs that enter into the processing elements are shifted out after one clock and the output accumulated are shifted out after 10 clocks into the output register. The control input S manages the flow of data into the PE and enabling the internal modules to perform the operation with data synchronization.

Fig. 10. Processing element of systolic array.

Along the X-Y plane of the 3D systolic array structure shown in Fig. 5, 10 processing elements are arranged every column and input data is fed into the bottom most processing element. Fig. 12 presents the X-Y plane for processing two rows simultaneously (Row 1 and Row 2) this is extended for processing 4 rows or ‘n’ rows. The input data is fed from the bottom and data moves to the next processing element every clock cycle. The filter coefficients Hoa is fed into the processing elements from left to right. In order to perform the multiplication operation as in Eq. (7) the Hoa coefficients are fed into each row of the processing elements with delay. The W0a coefficients entering the first row of PEs $(0,0,0)$ and $(1,0,0)$ are fed into the array without any preceding zero. The Hoa coefficients for the second row PE $(0,0,0)$, $(0,1,1)$ are delayed by 2 clocks are preceded by two zeros and is represented as $-2w_{0a}^0$.

Similarly the Hoa coefficients enter each of the rows in the X-Y plane are represented as $0w_{0a}^0$, $-2w_{0a}^0$, $-4w_{0a}^0$, ..., $-18w_{0a}^0$.

The first output of $P(0,0,0)$ represented as $y_0^0$ is generated after 10 clock cycles, the first output of $P(1,0,0)$ represented as $y_0^1$ is generated after 11 clock cycles. The first output of $P(0,1,0)$ represented as $y_0^1$ is generated after 12 clock cycles. Similarly, the outputs of each processing elements are generated. The Z-Y plane of 3D systolic array structure is designed to compute the outputs of four filters simultaneously. The PEs is arranged as shown in Fig. 6 and the data $X$ enters into the array from the bottom into all the columns. There are 10 rows of PE and in each row there are four PEs. The filter coefficients W0a, W0b, W1a and W1b are fed into the array as shown in Fig. 6 with preceding zeros as discussed previously to meet the matrix operations of Eq. (4). The outputs of PEs $(0,0,0)$, $(0,0,1)$, $(0,0,2)$ and $(0,0,3)$ are generated at 10th clock cycle and the outputs of all other PEs are generated after 2 clock cycle delays from the reference PE $(0,0,0)$. The systolic array architecture model is developed as a 3D structure to process data simultaneously and generate the required output for CNN model. The design is hierarchically modelled from basic elements to processing elements to 3D structure. Verilog model for the structure is developed along with test bench to verify the functionality.

Fig. 11. Data movement in 3D systolic array along X-Y plane.

Fig. 12. Data processing along Z-Y plane in 3D systolic array structure.

5. Results and Discussion

Although Palette faithfully translates infrared images from visible images retraining detail textures, it has a limitation A random number generator is designed to generate integer numbers between 0 to 255 and the data is stored in the internal memory. The control unit of the FCNN processor loads the data into the processor unit for data decomposition. The simulation results are obtained and compared with simulation results in MATLAB environment. In order to identify the hardware resource utilization synthesis is carried out in Xilinx ISE environment. From the synthesized netlist generated, the first stage comprises of row processor and the second stage comprises of column processor. The column processor processes the outputs of row processor and generates 16 sub bands. The input data width and output data width is set to 20 bits so that any overflow in the output data is avoided. The ‘rst’ and ‘clk’ ports are also included in the design to ensure data is synchronized. The ‘load’ pin is enabled to load the data into the FCNN processor and is synchronized with regard to clock input. Post synthesis simulation results of FCNN processor is verified to check functionality. Data_in is the input data each of them are of 20 bits, the input data is randomly selected to be between 0 to 32. The row processing outputs of FF1 and FF2 are presented in the simulation results and the L output is further processed by the column processor to generate the four sub bands. Similarly, the next group of four outputs is also presented in the simulation results that are computed by considering only the FF2 outputs from first stage.

The functionally correct Verilog HDL model is synthesized targeting Virtex-5 FPGA. The synthesis report generated in the Xilinx ISE environment is considered for estimation of hardware resources on FPGA. Considering the hardware resources from the synthesis it is identified the total resources occupied and further planning for optimization can be carried out. The resources estimated from the synthesis report is equivalent to the actual resources required for implementation.

Table 2 presents the comparison of hardware resources of FCNN considering two methods: with pipelined structure and without pipelined structure. The operating frequency in pipelined architecture is 70 MHz higher than without pipelined approach.

Table 2. Implementation results of FCNN (with pipeline).

Slice logic utilization	Systolic array	Direct implementation
Number of slice registers	2,876	2,235
Number of slice LUTs	2,764	2,123
Max. freq. of operation	277.2 MHz	201.3 MHz

Systolic array architecture designed in this work occupies 22% more slices registers resources than direct implementation of FCNN. However, the operating frequency of systolic array architecture is faster than direct implementation by a factor of 27%. The processing time of FCNN is reduced and hence there is an improvement of CNN operating speed that can be developed to operate at 277 MHz. Table 3 presents the FPGA implementation results of FCNN structure proposed in this work (without pipeline). The parallel processing structure requires 7122 registers, 8734 slices and operates at maximum frequency of 155.4 MHz. Table 4 presents the power dissipation report. The total power dissipation is estimated to be of 1.922 W and the power is increased by 0.1W as compared with architecture without parallel & pipelining processing.

Table 3. Implementation results of FCNN (without pipeline).

FPGA utilization	Systolic array	Direct implementation
Number of slice registers	7,123	5,534
Number of slice LUTs	8,733	6,700
Max. freq. of operation	155.4 MHz	128.2 MHz

Table 4. Power report of FCNN (with pipeline).

Parameter	Systolic array	Direct implementation
Total quiescent power	0.820 W	0.800 W
Total dynamic power	1.190 W	1.11 W

The operating frequency of systolic array architecture based FCNN implementation is faster by factor of 17% as compared with direct implementation. From the implementation results the systolic array architecture is faster and is more useful for real time image registration process.

In order to estimate the contribution of high speed FCNN model and its impact on processing speed of CNN, in this work a hardware-software co-simulation model is developed. In this model, the CNN model is developed and MATLAB, and the FCNN model is developed in both hardware and software model. The hardware model is the FPGA implementation of FCNN and the software model is the MATLAB model. The software model of CNN is modelled in MATLAB and input image is processed to estimate the computation time. Computing platform considered is Intel I5 processor with quad cores. The estimated time is in few milliseconds for the inference model for an image of size $2048 \times 2048$, of which the CNN model requires 50% of the processing time. In the co-simulation environment, the FCNN model working on FPGA operates at 277 MHz and the processing delay is estimated to be nearly 4 milliseconds. The inference time of the developed model with hardware implementation of FCNN is reduced by 48%.

In the FCNN structure the power dissipation is estimated to be 3W which is required to be reduced to 1.5 W if low power methods are adopted for implementation. The power dissipation in systolic array architecture is higher than direct implementation, however with increase in operating frequency the power dissipation also increases. It is required to develop low power methods and reduce power dissipation in systolic array architecture design.

Table 5. Power report of FCNN (without pipeline).

Parameter	Systolic array	Direct implementation
Total quiescent power	1.94 W	1.93 W
Total dynamic power	1.14 W	1.018 W

6. Conclusion

In this work systolic array based FCNN architecture for image decomposition using two systolic processors are design and implemented on FPGA. The systolic array architecture is designed with parallel processing modules and pipelined structure to improve throughput. With additional modules of arithmetic operations, the computation complexity is improved by 4%. The operating speed is improved to 156 MHz and power dissipation is limited to 3W. Pipelined structure requires additional resources and power consumption; however, the processes speed is faster than direct implementation. With validating the model on FPGA and verifying the results with MATLAB model, the developed FCNN is suitable for ASIC implementation and IP development. In order to further improve the operating speed dedicated architectures need to be designed.

Acknowledgement

I Would like acknowledge Dr. Cyril Prasanna Raj P for his valuable inputs and efforts.

References

A. A. Elngar , M. Arafa , A. Fathy , B. Moustafa , O. Mahmoud , M. Shaban , N. Fawzy , Image classification based on CNN: a survey, Journal of Cybersecurity and Information Management, Vol. 6, No. 1, pp. 18-50, 2021

F. N. Iandola , S. Han , M. W. Moskewicz , K. Ashraf , W. J. Dally , K. Keutzer , Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size, arXiv preprint arXiv:1602.07360, 2016

F. Chollet , Xception: deep learning with depthwise separable convolutions, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800-1807, 2017

A. G. Howard , M. Zhu , B. Chen , D. Kalenichenko , W. Wang , T. Weyand , M. Andreetto , H. Adam , Mobilenets: Rfficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017

K. Simonyan , A. Zisserman , Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014

A. Krizhevsky , I. Sutskever , G. Hinton , ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, Vol. 25, 2012

J. Cong , B. Xiao , Minimizing computation in convolutional neural networks, Proc. of International Conference on Artificial Neural Networks, 2014

H. Kim , K. Choi , Low power FPGA-SoC design techniques for CNN-based object detection accelerator, Proc. of 2019 IEEE 10th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, pp. 1130-1134, 2019

V. Sze , Y.-H. Chen , J. Emer , A. Suleiman , Z. Zhang , Hardware for machine learning: challenges and opportunities, arXiv preprint arXiv:1612.07625, 2016

M. Sankaradas , V. Jakkula , S. Cadambi , S. Chakradhar , I. Durdanovic , E. Cosatto , H. P. Graf , A massively parallel coprocessor for convolutional neural networks, Proc. of 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, pp. 53-60, 2009

V. Gokhale , J. Jin , A. Dundar , B. Martini , E. Culurciello , A 240 G-ops/s mobile coprocessor for deep neural networks, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 682-687, 2014

Y.-H. Chen , T. Krishna , J. S. Emer , V. Sze , Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE Journal of Solid-State Circuits, Vol. 52, No. 1, pp. 127-138, 2017

C. Qiu , X. Wang , T. Zhao , Q. Li , B. Wang , H. Wang , An FPGA-based convolutional neural network coprocessor, Machine Learning in Mobile Computing: Methods and Applications, Vol. 2021, 2021

T. Sledevic , A. Serackis , D. Plonis , FPGA implementation of a convolutional neural network and its application for pollen detection upon entrance to the beehive, Agriculture, Vol. 12, No. 11, pp. 1849, 2022

X. Yang , C. Zhuang , W. Feng , Z. Yang , Q. Wang , FPGA implementation of a deep learning acceleration core architecture for image target detection, Applied Sciences, Vol. 13, No. 7, pp. 4144, 2023

Z. Zang , D. Xiao , Q. Wang , Z. Jiao , Y. Chen , D. D. Li , Compact and robust deep learning architecture for fluorescence lifetime imaging and FPGA implementation, Methods and Applications in Fluorescence, Vol. 11, No. 2, pp. 025002, 2023

X. Sui , Q. Lv , L. Zhi , B. Zhu , Y. Yang , Y. Zhang , Z. Tan , A hardware-friendly high-precision CNN pruning method and its FPGA implementation, Sensors, Vol. 23, No. 2, pp. 824, 2023

Pottipati Dileep Kumar Reddy

Pottipati Dileep Kumar Reddy is currently working as a research scholar at YSR Engineering College of Yogi Vemana University in the Electronics & Communication Engineering Department. He received his M.Tech. degree in VLSI design from SRM University India, in 2014. In 2012, he completed his B.Tech degree in electronics & communication engineering from JNTUA, Ananthapur, India through the 4-year program. His research interests include quantum computing, image processing, neural networks, and VLSI design. He can also be contacted through email: dkr.pottipati@gmail.com.

Kota Venakata Ramanaih

Kota Venakata Ramanaih is a professor in the Electronics and Communication Engineering Department. Currently working as Dean Faculty of Engineering, Y.S.R. Engineering College of Yogi Vemana University, Proddatur. He has more than 28 years of experience in teaching. His areas of research interests include low-power VLSI design architectures, image processing, neural network based image compression, etc. He published papers in more than 115 international and national journals. He received the Adarsh Vidya Saraswati Rashtriya Puraskar National Award. Under his guidance, nine students received Ph.D. degrees from Yogi Vemana University Kadapa, JNTUA Ananthapuram, and JNTUK Kakinada. He obtained his Ph.D. from JNT University of Hyderabad in 2009, M.Tech from JNTU College of Engineering Kukatpally Hyderabad in 1998, and B.E. from KBNCE Gulbarga University Gulbarga in 1992.