1. Introduction
Convolutional Neural Network (CNN) models are trained with large data sets to learn
the attributes in the input image and generate the required output as per the training
process and hence has the potential to perform image processing applications such
as object detection, classification and pattern recognition with superior performances.
CNN models developed that have been predominantly executed over cloud as it requires
higher computation power and large memory [1]. For all emerging applications such as robotics, drones, automotive and medical devices
deployed for edge computing CNN model need to be implemented on hardware platforms.
Hardware platforms that are considered for CNN model implementation are designed to
optimize area resources, power dissipation and have minimum processing delay. Design
and development of CNN model on hardware platforms drives large time critical applications.
CNN model comprises of three major sub blocks: convolution operation, maximum pooling
and fully connected neural network. CNN models such as SqueezeNet [2], Xception [3], MobileNet [4], VGG-16 [5], AlexNet [6] etc., are few of the most popular models that have been extensively used for image
processing applications. The convolutional and fully connected sub systems in CNN
model require multiplication and addition operations. The data movement in these sub
systems is addressed using control modules that synchronize data movement involving
memory modules. CNN model is trained, and the trained model is used for inference,
which involves millions of arithmetic operations of which 90% of the arithmetic operations
are carried out in the convolutional layer [7]. In order to reduce latency and improve processing speed parallel structures are
required to implement sub systems of CNN model [8]
[9].
Parallelism in CPU and GPU platforms is through use of multilevel cache, single instruction
multiple data, multiple threads, multiple ALUSs, shared controllers and memories.
Convolutional and fully connected modules of CNN are implemented or mapped to matrix
multiplication process in CPU and GPU platforms. In CNN model data processing requires
MPL logic operation, storage and complex control operations for implementing convolutional
and fully connected layers. GPU has large number of cores that are efficient for performing
parallel operations and is suitable for CNN model implementation. Power requirement
and area occupancy for sub system implementation is complex in GPUs. Sankaradas et
al. have developed coprocessor modules for CNN sub system implementation on FPA platform.
In their work 20-bit and 16-bit fixed point number system for weight matrix representation
and feature representation respectively [10]. Gokhale et al. have demonstrated CNN coprocessor implementation for mobile embedded
applications with processing speed of 200 GOP/s on Xilinx platform [11]. Eyeriss is an efficient reconfigurable module for CNN implementation supporting
different sizes of feature maps, convolutional kernel size and data reuse [12]. Changnei Qiu et al. [13] have developed CNN coprocessor with 1D processor for convolutional row operation
and 3D processor for overall computation with pulsating array structure. VGG16 CNN
logic based convolutional and fully connected subsystem is implemented on FPGA using
16-bit fixed point number system with operation speed of 316 GOP/s at operating frequency
of 200 MHz with power dissipation limited to less than 9.25W. Tomyslav sledevic et
al. [14] in their work have developed CNN accelerator with four convolutional layers with
pre-trained CNN model and is implemented on FPGA using 16-bit fixed point number system.
The developed model is observed to processes data in less than 8.8 ms for classification
of image of size 512 x 128. Xu Yang et al. [15] propose deep learning accelerator for FPGA implementation with parallelization model
optimizing power and area. The model is capable of processing 14 frames per second
with 5 GOP/s and is demonstrated to be 28% higher than DSP operating frequency. Zang
et al. [16] have developed deep learning architecture with adder logic based on bespoke method.
The proposed architecture is implemented on FPGA improving computing efficiency considering
shorter bit-width for data representation and post quantization process. Sui X et
al. [17] have proposed hardware efficient and accurate CNN model with pruning logic and has
been implemented on ZYNQ FPGA. The resource utilization on FPGA for CNN implementation
is optimized and is demonstrated to be superior to DSP modules. From the literature
studies it is observed that CNN model comprising of convolutional layer and fully
connected layer are computationally intensive for hardware implementation. In order
or reduce computation complexity, convolutional layer is implemented with optimization
methods considering setting up of bit widths for data representation, pruning logic,
pipelining and parallelism. The fully connected layer processes the data after rearrangement
and will need to process 256 data inputs with minimum of three stages of neurons.
Thus the complexity of fully connected network will be on par with convolution layer
complexity [1].Reducing the computation complexity of fully connected network will achieve 50% reduction
on computation metrics of CNN in terms of area, timing and power [1].The trained CNN model will use predefined masks in the convolutional layer and the
fully connected layer will require optimum trained weights for data processing. Design
of customized architecture considering predefined masks and trained weights will further
reduce computation complexity. In this paper, new methods are developed considering
the correlation between weights and convolutional weights. The developed model is
HDL coded and is verified for logic correctness. CNN model developed is implemented
on FPGA optimizing area, timing and power. Detailed discussion on hardware architecture
design is presented comparing with existing models.
2. Convolutional Neural Network
Convolutional neural network consists of Convolutional Layer (CL), Maximum Pooling
Layer (MPL) and Fully Connected Neural Network Layer (FCNNL). The CL and MPL modules
are repeated in sequence N number of times (N is dependent on input image size and the is decided considering complexity of implementation,
N is usually set to 3). CL and MPL together reduce dimensionality of input image and
also capture the significant features from the input image at different levels such
as low level, mid-level and high level. Fig. 1 presents the CNN structure for image processing applications. The input image is
processed by CL and MPL layers reducing dimensionality and the final output of MPL
layer which is 2D is flattened by converting 2D to 1D. The flattened 1D data is processed
by the trained FCNNL model to generate the output.
The convolutional layer uses filters to extract the features from the input. Number
of features to be extracted is depended on number of filters used in the convolutional
layer. If three features maps are required to be extracted from the input in the first
stage, three filters are used and the CL model generates three frames from the input
image representing the feature maps. CL module is computationally intensive and occupies
more resources. CNN model has processes an input of size $h \times w \times c_{in}$
using $k \times x_k \times c_{in} \times c_{out}$ kernel to generate an output data
of $h \times w \times c_{out}$. The parameters $h$ and $w$ are the height and weight
of input data, $k$ is the size of the kernel, $c_{in}$ is the number of input channels
or frames, $c_{out}$ is the number of output frames. Eq. (1) represents the convolution operation,
The parameter $k$ is the convolutional kernel of size $k \times k \times c_{in}$.
The computation complexity in CNN is measured by considering the number of multiplication
and accumulation operations (MAC) and is expressed as in Eq. (2).
Fig. 2 presents the demonstration of convolution operation of a $6 \times 6$ image processed
to generate three feature maps using filters of size $3 \times 3$.
Fig. 2. Convolution operation.
The three filters of size $3 \times 3$ process the input data using convolution operation
and generate three frames that capture three features as per the predefined filter
property. The filters slide across the input data and multiplication, accumulation
operation is carried out to compute the feature map. In the MPL module the feature
map output of convolution operation is reduced by identifying the pixels with maximum
intensity and discarding all other pixels. Fig. 3 presents the structure of FCNN layer. Fig. 4 presents the single neuron structure and its mathematical model. The inputs are represented
as $x_n$, weights are represented as $w_{kn}$, bias is $b_k$ and the network activation
function is $\phi(\cdot)$. The variable $n$ is the size of input vector and $k$ is
the neuron number. The output of max pooling layer is flattened into 1D structure
and sis processed by the FCNN layer.
Fig. 3. Structure of fully connected neural network layer.
Fig. 4. Single neuron structure.
Training is required to be carried out for CNN model. The generic block diagram of
CNN module is presented is Fig. 5, it comprises of a CNN model which is trained and the trained weights along with
the data set is stored in the computer. During inference operation, the trained weights
for CNN are loaded and the input data from the camera is loaded into the CNN module
for processing. The output generated is stored in external memory. Fig. 6 presents the accelerator architecture for CNN model that comprises of weight memory,
input memory, DMA and FSM module. The convolutional layer with max pooling is the
core of the accelerator. The off-chip memory and CPU are used for data control and
input data storage.
Fig. 5. Training and inference of CNN module.
Fig. 6. CNN accelerator architecture.
3. Design of Proposed CNN Architecture
From the discussions carried out in the previous section, CNN architecture comprises
of neural network structure as the fully connected layer that processes the input
data or the reordered data into suitable output. The CNN models comprising of convolutional
layer, maximum pooling layer and fully connected neural network layer. The fully connected
neural network layer is the last stage of CNN model. In this work the neural network
architecture is designed to have an input layer of $22 \times 1$ vectors. Fig. 7 presents the fully connected layer structure. The hidden layer 1 outputs are denoted
as $\{a_1^1, a_2^1, a_3^1, ..., a_{s1}^1\}$. The intermediate output of each neuron
of the intermediate layer is denoted as $\{n_1^1, n_2^1, n_3^1, ..., n_{s1}^1\}$.
Similarly, the intermediate and output of second layer is denoted as $n_2^{s2}$ and
$a_2^{s2}$. The network activation function for all the hidden layers and output layer
is selected to be tansig or purelin. The intermediate output ($n_m^1$) of hidden layer
1 is mathematically represented as in Eq. (3), and the output of hidden layer is demoted as in Eq. (4).
Fig. 7. Fully connected neural network model for CNN.
Table 1. Summarizes the size of frames for proposed CNN model.
|
Input
|
Size ($N * N$)
|
Frames
|
|
Input
|
$N \times N$
|
1
|
|
CL1
|
$(N - 1) \times (N - 1)$
|
4
|
|
MP1
|
$[(N - 1)/2] \times [(N - 1)/2]$
|
4
|
|
CL2
|
$[(N - 3)/2] \times [(N - 3)/2]$
|
16
|
|
MP2
|
$[(N - 3)/4] \times [(N - 3)/4]$
|
16
|
|
CL3
|
$[(N - 7)/4] \times [(N - 7)/4]$
|
$16 \times 4$
|
|
MP3
|
$[(N - 7)/8] \times [(N - 7)/8]$
|
$16 \times 4$
|
|
OL1
|
$[(N - 7)/8] \times [(N - 7)/8]$
|
$8 \times 4$
|
|
OL2
|
$[(N - 7)/8] \times [(N - 7)/8]$
|
$8 \times 2$
|
|
FCNN
|
1D Network
|
The weights are denoted by $w_{m,l}$ and the bias is denoted as $b_m^1$ For the first
hidden layer, if the number of neurons is set to N, the number of weights is 22 N.
As every neuron will have 22 weights as there are 22 energy levels. The number of
biases in the hidden layer will be m. Similarly all other hidden layer outputs are
denoted and represented as $n_k^p$, where $k$ represents number of neurons in the
hidden layer $p$. Eqs. (5) and (6) presents the intermediate output and final output of hidden layer 2.
In Proposed architecture the Input data is processed by 3 stages of CL + MPL module
& one layer of Fully Connected Neural Network Layer (FCNN). The FCNN model in this
work comprises of two hidden layers .The CL module has 4 filters to extract features
in all directions of orthogonality. The proposed architecture supports processing
of High-Definition input images of size $2048 \times 2048$ Pixels. The 1st stage generates
$1024 \times 1024$ Size image which is down sampled by 2 to obtain $512 \times 512$
without loss of data. The 2nd stage & 3rd stage together process the output of down
sampled module to generate image $32 \times 32$. The Flatting modules converts 2D
($32 \times 32$) data to 1D Vector of size ($1024 \times 1$). The FCNN Layer process
1024 vectors using 4 stages of Neural Network module to generate M Outputs.
Input image of size $N \times N$ is processed by three stage of convolutional layer
and max pooling layer. Every stage of convolutional layer reduces the image size from
$N \times N$ to $(N - 1) \times (N - 1)$. The output of convolution layer is processed
by maximum pooling layer to generate output of $(N - 1)/2 \times (N - 1)/2$. In this
process an input image of size $N \times N$ is used to generate four frames of feature
map $(N - 1)/2 \times (N - 1)/2$. In second stage of CL+MPL $(N - 1)/2 \times (N -
1)/2$ input of four frames is processed by the stage2 module to generate four groups
of frames each of four frames of size generating 4 Groups of Frames (GOPs) each of
size $[(N - 3)/4 \times (N - 3)/4]$. Further the 3rd stage of CL+MPL is processed
16 frames to generate 64 frames, these 64 frames are grouped into four groups each
of 16 frames with frame size of $(N - 7)/2 \times (N - 7)/2$. Fig. 8 illustrates the representation of GOPs after 3rd stage processing. Similar structure
is used in the first and second stage of CNN model. Based on these discussions provided,
next section presents discussion on systolic array architecture for processing GOPs.
Fig. 8. 3rd stage of proposed CNN architecture generating 4 groups each of 16 frames
with size [(N - 7)/8 x (N - 7)/8].
4. Systolic Array Architecture for FCNN
The systolic array architecture design for first stage FCNN comprises of weight filters
with weight coefficients represented as $\{W_{0a}, W_{0b}, W_{1a}, W_{1b}\}$ are represented
in matrix form as in Eq. (7), and the input data considered for processing is expressed as in Eq. (8),
Considering an input image ($X$) of size $10 \times 10$, the systolic array algorithm
performs the data processing operation to generate the out $Y = [W] \cdot [Y]$ of
size ($4 \times 10$). The 10 filter coefficients for every represented as $\{W_{0a},
W_{0b}, W_{1a}, W_{1b}\}$ are multiplied with the 10 input samples which requires
10 multipliers and 9 adders for every output sample. To perform matrix multiplication
of $Y = [W] \cdot [X]$ requires 400 multiplication and 360 addition operations. For
the input data $X$ to be processed and to generate the $Y$ data representing the FCNN
output it is required to perform matrix multiplication operation considering input
image in size of $10 \times 10$. The input image of size $N \times N$ is processed
by considering four rows together and can be extended to process ‘n’ rows together
by setting the processing elements appropriately. Each row or column of the input
data is arranged as in Eq. (9) to perform processing of multiple matrix elements and generate successive outputs.
In Eq. (9) the first row of input data represented as $X_0^0$ to $X_{18}^0$ is arranged into
2D matrix of size $10 \times 10$ elements, similarly the second row to fourth row
of elements are arranged as $10 \times 10$ elements and is cascaded into a 3D matrix
of size $10 \times 10 \times 4$. If ‘n’ rows are considered for processing then the
3D matrix size will be $10 \times 10 \times n$. Processing of the 3D input data is
carried out using 3D systolic array structure shown in Fig. 9. The processing elements $P(x, y, z)$ are arranged in three dimensions starting from
$P(0,0,0)$ at the left bottom position. The input data from the 3D matrix is fed into
the structure from the bottom and data moves upwards at every clock. The processing
elements $P(0,0,0)$, $P(0,0,1)$, $P(0,0,2)$, and $P(0,0,3)$ are connected to the inputs
from the first layer of 3D matrix (denoted as $X_0$), the second frame is fed into
the processing elements $P(1,0,0)$, $P(1,0,1)$, $P(1,0,2)$, and $P(1,0,3)$ and so
on the ‘n’ frame is fed into the processing elements of $P(n,0,0)$, $P(n0,1)$, $P(n,0,2)$,
and $P(n,0,3)$.
Fig. 9. 3D systolic array architecture for FCNN.
The filter coefficients $\{W_{0a}, W_{0b}, W_{1a}, W_{1b}\}$ are fed into the 3D structure
from Z-X plane as shown in Fig. 9. For each of the processing elements (layers of PEs) in the Z-X plane different set
of filter coefficients are fed into the structure for data processing. The data arrangement
for inputs and filter coefficients entering the 3D structure are designed to achieve
high throughput. The primary building block of the 3D structure is the processing
elements and is represented as in Fig. 10 with inputs $X$ and $W$ to generate output $Y$. The internal structure of the processing
element is shown in Fig. 10. The building blocks of processing elements are adder, multiplier, delay register,
$2 : 1$ multiplexer and storage registers. The primary function of processing element
is to perform multiplication of two operands and accumulate the intermediate products.
The inputs that enter into the processing elements are shifted out after one clock
and the output accumulated are shifted out after 10 clocks into the output register.
The control input S manages the flow of data into the PE and enabling the internal
modules to perform the operation with data synchronization.
Fig. 10. Processing element of systolic array.
Along the X-Y plane of the 3D systolic array structure shown in Fig. 5, 10 processing elements are arranged every column and input data is fed into the
bottom most processing element. Fig. 12 presents the X-Y plane for processing two rows simultaneously (Row 1 and Row 2) this
is extended for processing 4 rows or ‘n’ rows. The input data is fed from the bottom
and data moves to the next processing element every clock cycle. The filter coefficients
Hoa is fed into the processing elements from left to right. In order to perform the
multiplication operation as in Eq. (7) the Hoa coefficients are fed into each row of the processing elements with delay.
The W0a coefficients entering the first row of PEs $(0,0,0)$ and $(1,0,0)$ are fed
into the array without any preceding zero. The Hoa coefficients for the second row
PE $(0,0,0)$, $(0,1,1)$ are delayed by 2 clocks are preceded by two zeros and is represented
as $-2w_{0a}^0$.
Similarly the Hoa coefficients enter each of the rows in the X-Y plane are represented
as $0w_{0a}^0$, $-2w_{0a}^0$, $-4w_{0a}^0$, ..., $-18w_{0a}^0$.
The first output of $P(0,0,0)$ represented as $y_0^0$ is generated after 10 clock
cycles, the first output of $P(1,0,0)$ represented as $y_0^1$ is generated after 11
clock cycles. The first output of $P(0,1,0)$ represented as $y_0^1$ is generated after
12 clock cycles. Similarly, the outputs of each processing elements are generated.
The Z-Y plane of 3D systolic array structure is designed to compute the outputs of
four filters simultaneously. The PEs is arranged as shown in Fig. 6 and the data $X$ enters into the array from the bottom into all the columns. There
are 10 rows of PE and in each row there are four PEs. The filter coefficients W0a,
W0b, W1a and W1b are fed into the array as shown in Fig. 6 with preceding zeros as discussed previously to meet the matrix operations of Eq.
(4). The outputs of PEs $(0,0,0)$, $(0,0,1)$, $(0,0,2)$ and $(0,0,3)$ are generated at
10th clock cycle and the outputs of all other PEs are generated after 2 clock cycle
delays from the reference PE $(0,0,0)$. The systolic array architecture model is developed
as a 3D structure to process data simultaneously and generate the required output
for CNN model. The design is hierarchically modelled from basic elements to processing
elements to 3D structure. Verilog model for the structure is developed along with
test bench to verify the functionality.
Fig. 11. Data movement in 3D systolic array along X-Y plane.
Fig. 12. Data processing along Z-Y plane in 3D systolic array structure.
5. Results and Discussion
Although Palette faithfully translates infrared images from visible images retraining
detail textures, it has a limitation A random number generator is designed to generate
integer numbers between 0 to 255 and the data is stored in the internal memory. The
control unit of the FCNN processor loads the data into the processor unit for data
decomposition. The simulation results are obtained and compared with simulation results
in MATLAB environment. In order to identify the hardware resource utilization synthesis
is carried out in Xilinx ISE environment. From the synthesized netlist generated,
the first stage comprises of row processor and the second stage comprises of column
processor. The column processor processes the outputs of row processor and generates
16 sub bands. The input data width and output data width is set to 20 bits so that
any overflow in the output data is avoided. The ‘rst’ and ‘clk’ ports are also included
in the design to ensure data is synchronized. The ‘load’ pin is enabled to load the
data into the FCNN processor and is synchronized with regard to clock input. Post
synthesis simulation results of FCNN processor is verified to check functionality.
Data_in is the input data each of them are of 20 bits, the input data is randomly
selected to be between 0 to 32. The row processing outputs of FF1 and FF2 are presented
in the simulation results and the L output is further processed by the column processor
to generate the four sub bands. Similarly, the next group of four outputs is also
presented in the simulation results that are computed by considering only the FF2
outputs from first stage.
The functionally correct Verilog HDL model is synthesized targeting Virtex-5 FPGA.
The synthesis report generated in the Xilinx ISE environment is considered for estimation
of hardware resources on FPGA. Considering the hardware resources from the synthesis
it is identified the total resources occupied and further planning for optimization
can be carried out. The resources estimated from the synthesis report is equivalent
to the actual resources required for implementation.
Table 2 presents the comparison of hardware resources of FCNN considering two methods: with
pipelined structure and without pipelined structure. The operating frequency in pipelined
architecture is 70 MHz higher than without pipelined approach.
Table 2. Implementation results of FCNN (with pipeline).
|
Slice logic utilization
|
Systolic array
|
Direct implementation
|
|
Number of slice registers
|
2,876
|
2,235
|
|
Number of slice LUTs
|
2,764
|
2,123
|
|
Max. freq. of operation
|
277.2 MHz
|
201.3 MHz
|
Systolic array architecture designed in this work occupies 22% more slices registers
resources than direct implementation of FCNN. However, the operating frequency of
systolic array architecture is faster than direct implementation by a factor of 27%.
The processing time of FCNN is reduced and hence there is an improvement of CNN operating
speed that can be developed to operate at 277 MHz. Table 3 presents the FPGA implementation results of FCNN structure proposed in this work
(without pipeline). The parallel processing structure requires 7122 registers, 8734
slices and operates at maximum frequency of 155.4 MHz. Table 4 presents the power dissipation report. The total power dissipation is estimated to
be of 1.922 W and the power is increased by 0.1W as compared with architecture without
parallel & pipelining processing.
Table 3. Implementation results of FCNN (without pipeline).
|
FPGA utilization
|
Systolic array
|
Direct implementation
|
|
Number of slice registers
|
7,123
|
5,534
|
|
Number of slice LUTs
|
8,733
|
6,700
|
|
Max. freq. of operation
|
155.4 MHz
|
128.2 MHz
|
Table 4. Power report of FCNN (with pipeline).
|
Parameter
|
Systolic array
|
Direct implementation
|
|
Total quiescent power
|
0.820 W
|
0.800 W
|
|
Total dynamic power
|
1.190 W
|
1.11 W
|
The operating frequency of systolic array architecture based FCNN implementation is
faster by factor of 17% as compared with direct implementation. From the implementation
results the systolic array architecture is faster and is more useful for real time
image registration process.
In order to estimate the contribution of high speed FCNN model and its impact on processing
speed of CNN, in this work a hardware-software co-simulation model is developed. In
this model, the CNN model is developed and MATLAB, and the FCNN model is developed
in both hardware and software model. The hardware model is the FPGA implementation
of FCNN and the software model is the MATLAB model. The software model of CNN is modelled
in MATLAB and input image is processed to estimate the computation time. Computing
platform considered is Intel I5 processor with quad cores. The estimated time is in
few milliseconds for the inference model for an image of size $2048 \times 2048$,
of which the CNN model requires 50% of the processing time. In the co-simulation environment,
the FCNN model working on FPGA operates at 277 MHz and the processing delay is estimated
to be nearly 4 milliseconds. The inference time of the developed model with hardware
implementation of FCNN is reduced by 48%.
In the FCNN structure the power dissipation is estimated to be 3W which is required
to be reduced to 1.5 W if low power methods are adopted for implementation. The power
dissipation in systolic array architecture is higher than direct implementation, however
with increase in operating frequency the power dissipation also increases. It is required
to develop low power methods and reduce power dissipation in systolic array architecture
design.
Table 5. Power report of FCNN (without pipeline).
|
Parameter
|
Systolic array
|
Direct implementation
|
|
Total quiescent power
|
1.94 W
|
1.93 W
|
|
Total dynamic power
|
1.14 W
|
1.018 W
|
6. Conclusion
In this work systolic array based FCNN architecture for image decomposition using
two systolic processors are design and implemented on FPGA. The systolic array architecture
is designed with parallel processing modules and pipelined structure to improve throughput.
With additional modules of arithmetic operations, the computation complexity is improved
by 4%. The operating speed is improved to 156 MHz and power dissipation is limited
to 3W. Pipelined structure requires additional resources and power consumption; however,
the processes speed is faster than direct implementation. With validating the model
on FPGA and verifying the results with MATLAB model, the developed FCNN is suitable
for ASIC implementation and IP development. In order to further improve the operating
speed dedicated architectures need to be designed.
Acknowledgement
I Would like acknowledge Dr. Cyril Prasanna Raj P for his valuable inputs and efforts.
References
A. A. Elngar , M. Arafa , A. Fathy , B. Moustafa , O. Mahmoud , M. Shaban
, N. Fawzy , Image classification based on CNN: a survey, Journal of Cybersecurity
and Information Management, Vol. 6, No. 1, pp. 18-50, 2021

F. N. Iandola , S. Han , M. W. Moskewicz , K. Ashraf , W. J. Dally , K.
Keutzer , Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5
mb model size, arXiv preprint arXiv:1602.07360, 2016

F. Chollet , Xception: deep learning with depthwise separable convolutions, Proc.
of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800-1807,
2017

A. G. Howard , M. Zhu , B. Chen , D. Kalenichenko , W. Wang , T. Weyand
, M. Andreetto , H. Adam , Mobilenets: Rfficient convolutional neural networks
for mobile vision applications, arXiv preprint arXiv:1704.04861, 2017

K. Simonyan , A. Zisserman , Very deep convolutional networks for large-scale
image recognition, arXiv preprint arXiv:1409.1556, 2014

A. Krizhevsky , I. Sutskever , G. Hinton , ImageNet classification with deep
convolutional neural networks, Advances in Neural Information Processing Systems,
Vol. 25, 2012

J. Cong , B. Xiao , Minimizing computation in convolutional neural networks,
Proc. of International Conference on Artificial Neural Networks, 2014

H. Kim , K. Choi , Low power FPGA-SoC design techniques for CNN-based object
detection accelerator, Proc. of 2019 IEEE 10th Annual Ubiquitous Computing, Electronics
and Mobile Communication Conference, pp. 1130-1134, 2019

V. Sze , Y.-H. Chen , J. Emer , A. Suleiman , Z. Zhang , Hardware for machine
learning: challenges and opportunities, arXiv preprint arXiv:1612.07625, 2016

M. Sankaradas , V. Jakkula , S. Cadambi , S. Chakradhar , I. Durdanovic
, E. Cosatto , H. P. Graf , A massively parallel coprocessor for convolutional
neural networks, Proc. of 20th IEEE International Conference on Application-specific
Systems, Architectures and Processors, pp. 53-60, 2009

V. Gokhale , J. Jin , A. Dundar , B. Martini , E. Culurciello , A 240 G-ops/s
mobile coprocessor for deep neural networks, Proc. of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pp. 682-687, 2014

Y.-H. Chen , T. Krishna , J. S. Emer , V. Sze , Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks, IEEE Journal of
Solid-State Circuits, Vol. 52, No. 1, pp. 127-138, 2017

C. Qiu , X. Wang , T. Zhao , Q. Li , B. Wang , H. Wang , An FPGA-based
convolutional neural network coprocessor, Machine Learning in Mobile Computing: Methods
and Applications, Vol. 2021, 2021

T. Sledevic , A. Serackis , D. Plonis , FPGA implementation of a convolutional
neural network and its application for pollen detection upon entrance to the beehive,
Agriculture, Vol. 12, No. 11, pp. 1849, 2022

X. Yang , C. Zhuang , W. Feng , Z. Yang , Q. Wang , FPGA implementation
of a deep learning acceleration core architecture for image target detection, Applied
Sciences, Vol. 13, No. 7, pp. 4144, 2023

Z. Zang , D. Xiao , Q. Wang , Z. Jiao , Y. Chen , D. D. Li , Compact
and robust deep learning architecture for fluorescence lifetime imaging and FPGA implementation,
Methods and Applications in Fluorescence, Vol. 11, No. 2, pp. 025002, 2023

X. Sui , Q. Lv , L. Zhi , B. Zhu , Y. Yang , Y. Zhang , Z. Tan , A
hardware-friendly high-precision CNN pruning method and its FPGA implementation, Sensors,
Vol. 23, No. 2, pp. 824, 2023

Pottipati Dileep Kumar Reddy
Pottipati Dileep Kumar Reddy is currently working as a research scholar at YSR Engineering
College of Yogi Vemana University in the Electronics & Communication Engineering Department.
He received his M.Tech. degree in VLSI design from SRM University India, in 2014.
In 2012, he completed his B.Tech degree in electronics & communication engineering
from JNTUA, Ananthapur, India through the 4-year program. His research interests include
quantum computing, image processing, neural networks, and VLSI design. He can also
be contacted through email: dkr.pottipati@gmail.com.
Kota Venakata Ramanaih is a professor in the Electronics and Communication Engineering
Department. Currently working as Dean Faculty of Engineering, Y.S.R. Engineering College
of Yogi Vemana University, Proddatur. He has more than 28 years of experience in teaching.
His areas of research interests include low-power VLSI design architectures, image
processing, neural network based image compression, etc. He published papers in more
than 115 international and national journals. He received the Adarsh Vidya Saraswati
Rashtriya Puraskar National Award. Under his guidance, nine students received Ph.D.
degrees from Yogi Vemana University Kadapa, JNTUA Ananthapuram, and JNTUK Kakinada.
He obtained his Ph.D. from JNT University of Hyderabad in 2009, M.Tech from JNTU College
of Engineering Kukatpally Hyderabad in 1998, and B.E. from KBNCE Gulbarga University
Gulbarga in 1992.