Mobile QR Code QR CODE

  1. (Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, Korea {lukpcm, jckim, silverhj, truongnx, hyuk_jae_lee}@capp.snu.ac.kr )



CNN accelerator, Processing element, Hardware utilization, FPGA, YOLO-v3

1. Introduction

Object detection has been actively studied for a broad range of applications across various domains, such as traffic monitoring [2], unmanned stores [3], and autonomous driving [4]. Many of these applications require low-latency and real-time responses (i.e., more than 30 frames per second). Owing to the rapid evolution of deep neural networks (DNNs), the performance capabilities of object detection models are undergoing rapid enhancements in model accuracy and execution times. In particular, single-stage object detectors, such as YOLOv3 [5] and EfficientDet [6], have achieved a good tradeoff between model accuracy and real-time execution using graphical processing units (GPUs). Unfortunately, GPUs consume considerable power, making them unsuitable for many energy/power-constrained applications.

In recent years, other alternative solutions, such as Field Programmable Gate Arrays (FPGAs), have received more attention for DNN accelerators because of their low latency, good power efficiency, high configurability, and rapid prototyping. Particularly, many FPGA implementations of YOLO accelerators [7-9] have been proposed. [7] proposed a streaming architecture design for YOLO-v2 and its tiny version in which a layer has its processing elements (PEs). Despite the fast inference speed due to high paralleling and pipelining among multiple layers, the architecture requires a huge buffer or BRAMs. As a result, it is only suitable for highly customized networks, such as binarized weight or activation ones, which generally suffer from a large accuracy drop. ShortcutFusion [1] proposes a generic CNN accelerator that effectively supports various networks, including MobileNet-v2, EfficientNet-B0, ResNet-50, and YOLO-v3. The accelerator consists of 4096 eight-bit multipliers and adder trees, which work in parallel, to achieve high accuracy and performance. Unfortunately, its processing elements are not fully utilized, leading to relatively low hardware utilization (e.g., 68.42% for YOLO-v3).

To address the problem, this paper proposes ShortcutFusion++, an improved version of ShortcutFusion, by specifically optimizing the PE utilization of the baseline. In particular, two common and high-impact PE under-utilization cases were observed, and a method to solve them was proposed. The contributions of the paper are as follows:

1) Under-utilization: two common cases of low PE utilization were observed when mapping YOLO-v3 into ShortcutFusion. Specifically, the baseline dataflow showed relatively low utilization (i.e., 34.01%) when executing stride=2 convolution and a row-reuse scheme.

2) Proposed method: this paper proposes a flexible prefetching scheme and re-design the output buffer to address the abovementioned cases. Utilizing the proposed approaches, ShortcutFusion++ avoids unnecessary stall cycles during feeding data to PEs and writing the results to external memory.

3) Experiments: The experimental results show that ShortcutFusion++ achieves 80.95% hardware utilization for YOLO-v3, outperforming its baseline by 12.53%.

The remainder of this paper is organized as follows. Section 2 introduces the background related to ShortcutFusion. In Section 3, the optimization methods are described. Section 4 presents the evaluation method and experimental results, and Section 5 concludes the paper.

2. Related Works

2.1 CNN Accelerators and Processing Elements

Having millions of multiplication-accumulate (MAC) operations, a convolutional (CONV) layer can be expressed by six or seven nested loops [10]. On the other hand, a CONV layer can be transformed into a general matrix-matrix or matrix-vector multiplication using the im2col transform [11]. As a result, two typical PE designs for generic CNN accelerators are the systolic arrays [12,13] and inner-product multipliers with an adder tree [14,15]. ShorcutFusion [1] consists of $T_{o}$ CONV kernels. Each CONV kernel consists of $T_{i}$ multipliers and an adder tree. In particular, $T_{i}$ and $T_{o}$ were set to 64, resulting in 4096 multipliers in total.

Assume that $C_{i}$ and $C_{o}$ are the number of input feature channels and output filters, respectively, for a given CONV layer. The number of computing cycles is as follows:

(1)
$ cycle=H\times W\times \left\lceil \frac{K\times K\times C_{i}}{Ti}\right\rceil \times \left\lceil \frac{C_{o}}{To}\right\rceil $

where K is the filter size, and H and W are the height and width of the output feature maps, respectively.

2.2 ShortcutFusion

Fig. 1 shows the architecture of the CNN accelerator in [1]. In particular, it consists of a controller, two DMA modules for loading the weights and the input feature maps (IFMs) of the model, and one DMA module for writing the output feature maps (OFMs). The controller selects either a row-based weight reuse scheme or a frame-based weight reuse scheme. Notably, although there are many loop interchange or tiling options for six nested loops, ShorcutFusion utilizes only two weight reuse schemes by observing that a CONV layer may have (1) large IFMs with a small number of weights or (2) small IFMs with a large number of weights.

The two reuse schemes are described in Fig. 2. The frame-based reuse scheme is utilized when the IFMs are small enough to be stored in an on-chip buffer. In particular, it reuses the weight blocks (i.e., $K\times K\times T_{i}$ pixels) while the input sliding cube (i.e., $K\times K\times T_{i}$ pixels) passes through a single frame of the IFM, as shown in Fig. 2(a). The input data in the sliding cube are convolved with $To$ weight blocks, generating the partial sum of the OFM (i.e., $H\times W\times T_{o}$ pixels). When the input sliding cube hits the end of the frame, it moves towards the channel direction to generate the next partial sum, which is accumulated to the previous result.

The row-based weight reuse scheme is utilized when the IFMs are relatively large, and the number of weights is small. In particular, weights are preloaded and reused while the input sliding cube passes through a single row, as shown in Fig. 2(b). This generates the partial sum of the OFM (i.e., $1\times W\times T_{o}$ pixels). When the sliding cube hits the end of the row, it moves towards the channel direction, remaining in the same row. By remaining in the same row, the generated output can be accumulated to the previous output, which becomes the final output if the input sliding cube hits the end of the channel.

The main controller selects either a frame-based dataflow or a row-based dataflow depending on the weight reuse scheme. In frame-based dataflow (red arrow in Fig. 1), the IFMs are fetched from the on-chip buffer and go through the line buffer and the CONV window module. When the CONV kernel generates the OFMs, it writes the OFMs to the on-chip buffer. In the row-based dataflow (blue arrow in Fig. 1), however, the IFMs are fetched from the off-chip memory. The input loader module loads the IFMs from the off-chip memory using DMA. When the CONV kernel generates the OFMs, it writes the OFMs to the output buffer. The output writer module also stores the OFMs to off-chip memory using DMA.

Fig. 1. Block diagram of CNN accelerator in[1].
../../Resources/ieie/IEIESPC.2022.11.6.474/fig1.png
Fig. 2. Weight reuse scheme in Shortcut-Fusion: (a) Frame-based weight reuse; (b) Row-based weight reuse. Borrowed from Fig. 3. in[1].
../../Resources/ieie/IEIESPC.2022.11.6.474/fig2.png

2.3 Motivations

YOLO-v3 [5] is a well-known object detector that has achieved a good tradeoff between model accuracy and real-time execution. In particular, the number of input and output channels $C_{i}$ and $C_{o}$ in YOLO-v3 are 64, 128, 256, 512, and 1024. Therefore, it is likely for the PEs of ShortcutFusion to be fully utilized according to Eq. (1). Unfortunately, as reported in [1], ShortcutFusion only achieves a utilization of 68.42% for YOLO-v3. This phenomenon prompted this study to determine the sources of underutilization in which PEs are forced to IDLE during data movement.

3. Proposed Work

3.1 Under-utilization of PEs

This subsection specifically quantifies the layer-wise PE utilization for YOLO-v3 using ShortcutFusion. Although the frame-based weight reuse scheme generally achieves higher utilization than the row-based one, it may require a huge on-chip buffer to store IFMs. Therefore, following [1], the cutpoint is set to 9 to meet a constraint of on-chip buffer size. As a result, a row-based weight reuse scheme was applied for CONV layers 0-8, while a frame-based weight reuse scheme was applied for the remaining CONV layers 9-76. Table 1 lists the profiling results. As shown in the third row, the ‘Frame-based’ category (i.e., layer with the frame-based dataflow) shows a relatively high PE utilization of 81.15%. The ‘Row-based’ category (i.e., layer with the row-based dataflow) shows poor PE utilization of 33.89%. Specifically, the ‘Stride = 2’ category (i.e., layers 1, 4, 9, 26, and 43 with stride = 2) suffers from severe under-utilization with a PE utilization of 34.01%. The phenomenon occurs for both the row-based scheme (e.g., layers 1 and 4) and the frame-based scheme (e.g., layers 9, 26, and 43). Because those ``stride=2'' layers account for 24.35% of the overall execution time, their low utilization causes a utilization drop for the entire network.

The following subsections analyze the source of low utilization on those layers and propose methods to enhance the utilization.

Table 1. Average PE utilization of CNN accelerator in ShortcutFusion.

Category

Layer #

Runtime Ratio

Avg. util

Row-based

0–8

26.95%

33.89%

Frame-based

9–76

73.05%

81.15%

Stride = 2

1,4,9,26,43

24.35%

34.01%

Overall

0–76

100%

68.42%

3.2 Optimization on Stride = 2 Convolution

Fig. 3 shows the dataflow during the 3${\times}$3 convolution. CONV kernel needs nine cycles to consume one window data (i.e., $3\times 3\times \mathrm{T}_{\mathrm{i}}$ pixels). Therefore, to synchronize with the CONV kernel, the controller must fetch one window data for every nine cycles.

In the case of stride = 1 convolution, however, 2-column data (i.e., $3\times 2\times \mathrm{T}_{\mathrm{i}}$ pixels) can be reused from the previous window data, as shown in Fig. 3(a). As a result, only one column of data (i.e., $3\times 1\times \mathrm{T}_{\mathrm{i}}$ pixels) needs to be fetched from the controller every nine cycles.

Fig. 3(b) shows the dataflow during stride = 2 convolution. In the case of stride = 2 convolution, only one column of data can be reused from previous window data. Therefore, two-column data must be fetched from the controller every nine cycles. On the other hand, it takes 18 cycles to fetch 2-column data because the fetching speed of the original accelerator is fixed. This makes the CONV kernel idle for nine cycles, which abruptly decreases the PE utilization.

A flexible prefetching scheme was proposed to resolve this issue (Fig. 3(c)). When conducting stride = 2 convolutions, the controller increases the data fetching speed, thus fetching 2-column data every nine cycles. As a result, the window data can be ready for every nine cycles, and the CONV kernel does not need to wait for the next window data, which avoids unnecessary stall cycles.

One more optimization point exists to increase PE utilization during stride = 2 convolution. The controller requires three-row data (i.e., $3\times \mathrm{W}\times \mathrm{T}_{\mathrm{i}}$ pixels) to fetch the column data. Therefore, three-row data should be prefetched to the line buffer before the computation starts. On the other hand, after prefetching the first three-row data, the amount of data that needs to be prefetched is smaller because the dataflow can reuse the row data from the line buffer.

In the case of stride = 1 convolution, with only a single row of data (i.e., $1\times \mathrm{W}\times \mathrm{T}_{\mathrm{i}}$ pixels) prefetch, all three-row data are ready because two can be reused from the line buffer. For the stride = 2 convolution, however, it requires two-row data because only one can be reused. The original accelerator always prefetches the same amount of row data (i.e., single row data) regardless of the layer type. Therefore, it suffers from under-utilization during the stride = 2 convolution because of insufficient prefetching.

This can also be resolved by applying a flexible prefetching scheme, which flexibly chooses the amount of prefetching. Using the proposed method, the amount of prefetched data is increased. Hence there is no under-utilization.

Fig. 3. Dataflow during a 3x3 convolution: (a) stride = 1; (b) stride = 2; (c) stride = 2 with optimization method.
../../Resources/ieie/IEIESPC.2022.11.6.474/fig3.png

3.3 Optimization on Row-based Dataflow

This subsection analyzes the source of under-utilization in row-based dataflow and presents an optimization method to increase PE utilization in the row-based dataflow.

During the row-based dataflow, the results showed lower PE utilization than frame-based dataflow. This is because the location of the feature map is different. In frame-based dataflow, the accelerator reads the IFMs from the on-chip buffer and writes the OFMs to the on-chip buffer. Using the on-chip buffer, the data bandwidth is very high, which supports the rapid movement of input and output data. This makes it easy to pipeline the whole computation, which can easily utilize the PEs.

In row-based dataflow, however, the accelerator reads the IFMs from off-chip memory and writes the OFMs back to off-chip memory. Because the data movement of input and output data is slow, it is difficult to pipeline the computation fully in row-based dataflow. Thus, PE utilization is lower than frame-based dataflow.

Fig. 4. shows a timing diagram of the CONV operation and DMA operation in the row-based dataflow. As shown in Fig. 4(a), the CONV and DMA operations are not pipelined, leading to low PE utilization. Both operations cannot be pipelined because of the data hazard caused by concurrent access to the output buffer. During the CONV operation, the CONV kernel writes the OFM to the output buffer. During the DMA process, however, the output writer reads OFM and transfers the data to off-chip memory. If both operations are pipelined, the OFM data can be overwritten by the next CONV kernel before the DMA. Although the next CONV operation starts later than the DMA, a data hazard can occur due to the low bandwidth of off-chip memory.

The output buffer is reconstructed in the proposed method to enable the pipeline between two operations. The reconstructed output buffer consists of two separate buffers. Using two separate buffers, both operation switch between two buffers (Fig. 4(b)). When the output writer reads the OFM data from one buffer, the CONV kernel writes the following OFM data to the other buffer. Because of this, the data hazard is removed, and both operations can be well-pipelined.

Fig. 4. Timing diagram of CONV operation and DMA operation in row-based dataflow: (a) Before optimization, both operations are not pipelined; (b) After optimization, the operations are well-pipelined.
../../Resources/ieie/IEIESPC.2022.11.6.474/fig4.png

4. Performance Evaluation

4.1 Evaluation Method

This subsection presents the evaluation method for PE utilization. The PE utilization can be measured by the total operation count divided by the maximum number of operations with a given number of PEs. Since each PE can execute two operations (i.e., 1 multiplication + 1 addition) for every single cycle, the maximum number of operations is $2\times \left(PE\,\,count\right)\times \left(cycle\,\,count\right)\,.$ Therefore, PE utilization can be formulated as follows:

(1)
$ PE~ utilization=~ \frac{OP_{MUL}+OP_{ADD}}{2\times T_{i}\times T_{o}\times cycle} $

where $OP_{MUL}$ is the number of multiplications; $OP_{ADD}$ is the number of additions; $cycle$ is the number of cycles.

In addition, $OP_{MUL}$ and $OP_{ADD}$ are formulated as follows:

(2)
$ OP_{MUL}=~ K^{2}\times C_{i}\times C_{o}\times H\times W \\ $
(3)
$ OP_{ADD}=OP_{MUL}\times \frac{K^{2}\times C_{i}-1}{K^{2}\times C_{i}}~ $

where $K$ is the width of the convolution kernel; $C_{i}$ is the number of input feature channels; $C_{o}$ is the number of output filters; $H$ is the height of OFMs; $W$ is the width of the OFMs.

The number of cycles can be obtained from RTL simulation. Therefore, PE utilization can be measured using this equation.

4.2 Evaluation Result

This section shows the results of two optimization methods compared to the original CNN accelerator [1]. All results are based on the YOLO-v3 network. As shown in Table 2, with the stride = 2 convolution optimization method, the average PE utilization in stride = 2 convolution achieves 67.21%, which is improved by 33.20% from the baseline. This is because the proposed flexible data prefetching scheme removes the unnecessary stall cycles. In addition, with the row-based dataflow optimization method, it achieved a PE utilization of 52.53%, which is improved by 19.1%. This shows that the reconstructed output buffer successfully increased the PE utilization by pipelining the PE operation and writing OFMs.

Fig. 5. shows the layer-wise PE utilization of the three different optimization steps. For layers 1, 4, 9, 26, and 43, stride = 2 convolutions were performed, so the stride = 2 convolution optimization (yellow in Fig. 5) shows the improvement. During the layer from 0 to 8, the accelerator used row-based dataflow, so the row-based dataflow optimization (blue in Fig. 5) shows improvement.

Table 3 lists the overall PE utilization by applying optimizations. The results indicated that ShortcutFusion++ achieved a PE utilization of 80.95%, showing 12.53% improvement from the baseline.

Fig. 5. Layer-wise PE utilization with the optimization method.
../../Resources/ieie/IEIESPC.2022.11.6.474/fig5.png
Table 2. PE utilization of target layer with optimization method.

Optimization

PE utilization

Improvement

Before Opt.

After Opt.

Stride = 2

34.01%

67.21%

+33.20%

Row-based

44.09%

52.53%

+8.44%

Table 3. Overall PE utilization with optimization method.

Optimization

PE utilization

Baseline

68.42%

Stride=2 Conv. Optimization

77.75%

Stride=2 Conv. Optimization + Row-based Dataflow Optimization

80.95%

5. Conclusion

This paper reported the under-utilization of the processing element in ShortcutFusion and proposed two optimization methods to increase PE utilization. By applying both optimizations, ShortcutFusion++ can highly utilize the processing elements, outperforming the baseline in PE utilization.

ACKNOWLEDGMENTS

This work was supported in part by the R&D Program of MOTIE/KEIT (No. 20010582, Development of deep learning based low power HW IP design technology for image processing of CMOS image sensors) and in part by the Technology Innovation Program (or Industrial Strategic Technology Development Program – No. 20014490, Development of Technology for Commercializing Lv.4 Self-driving Computing Platform Based on Centralized Architecture) funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).

REFERENCES

1 
Nguyen Duy Thanh, Je Hyeonseung, Nguyen Tuan Nghia, Ryu Soojung, Lee Kyujoong, Lee Hyuk-Jae, ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 69, No. 6, pp. 2477-2489DOI
2 
Yadav Satya Prakash., 2020, Vision-based detection, tracking, classification of vehicles., IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 6, pp. 427-434DOI
3 
Zhang Haijun, et al. , 2019:, Toward new retail: A benchmark dataset for smart unmanned vending machines, IEEE Transactions on Industrial informatics, Vol. 16, No. 12, pp. 7722-7731DOI
4 
Choi Jiwoong, et al. , 2022, Efficient Object Detection Acceleration Methods for Autonomous-driving Embedded Platforms, IEIE Transactions on Smart Processing & Computing, Vol. 11, No. 4, pp. 255-261DOI
5 
Redmon Joseph., Farhadi Ali., 2018, YOLOv3: An incremental improvement, arXiv preprint arXiv: 1804.02767DOI
6 
Tan Mingxing., Le Quoc V., 2019, EfficientNet: Rethinking model scaling for convolutional neural networks, In Proceedings of International Conference on Machine Learning (ICML)DOI
7 
Nguyen Duy Thanh, et al. , 2019, A high-throughput and power-efficient FPGA implementation of YOLO CNN for object detection, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 27, No. 8, pp. 1861-1873.DOI
8 
Nguyen Duy Thanh., Kim Hyun., Lee Hyuk-Jae., 2020, Layer-specific optimization for mixed data flow with mixed precision in FPGA design for CNN-based object detectors, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, No. 6, pp. 2450-2464DOI
9 
Zhang Xiaofan, et al. , 2018, DNNBuilder: An automated tool for building high-performance DNN hardware accelerators for FPGAs, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEEDOI
10 
Dave Shail, et al. , 2019, Dmazerunner: Executing perfectly nested loops on dataflow accelerators, ACM Transactions on Embedded Computing Systems (TECS), Vol. 18, No. 5s, pp. 1-27DOI
11 
Lai Liangzhen, Suda Naveen, Chandra Vikas, 2018, CMSIS-NN: Efficient neural network kernels for arm cortex-m cpus, arXiv preprint arXiv:1801. 06601DOI
12 
Chen Yu-Hsin, et al. , 2016, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE journal of solid-state circuits, Vol. 52, No. 1, pp. 127-138DOI
13 
Kung H., 1980, Algorithms for VLSI processor arrays, Introduction to VLSI systems, pp. 271-292URL
14 
Bai Lin, Zhao Yiming, Huang Xinming, 2018, A CNN accelerator on FPGA using depthwise separable convolution, IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 65, No. 10, pp. 1415-1419DOI
15 
Ma Yufei, et al., 2016, Scalable and modularized RTL compilation of convolutional neural networks onto FPGA, 2016 26th international conference on field programmable logic and applications (FPL). IEEEDOI

Author

Chunmyung Park
../../Resources/ieie/IEIESPC.2022.11.6.474/au1.png

Chunmyung Park received his B.S. degree in electrical and computer engineering from Seoul National University, Seoul, South Korea, in 2020. He is currently working toward an integrated M.S. and Ph.D. degree in electrical and computer engineering at Seoul National University, Seoul, South Korea. His current research interests include computer architecture and SoC for neural network processing.

Jicheon Kim
../../Resources/ieie/IEIESPC.2022.11.6.474/au2.png

Jicheon Kim received his B.S. degree in electrical and computer engineering from the University of Seoul in 2011, and M.S. degree from Seoul National University, Seoul, South Korea, in 2013. From 2013 to 2017, he was with the SoC Division, GCT Semiconductor, Seoul, South Korea. In 2017, he joined the S. LSI Division, Samsung Electronics Corporation. His current research interests include computer architecture and SoC for machine learning.

Eunjae Hyun
../../Resources/ieie/IEIESPC.2022.11.6.474/au3.png

Eunjae Hyun received his B.S. degree in biosystems engineering and M.S. degree in bioengineering from Seoul National University, Seoul, South Korea, in 2010 and 2014, respectively. He is currently working toward a Ph.D. degree in electrical and computer engineering at Seoul National Univer-sity, Seoul, South Korea. He has participated in various projects at Samsung Electronics' DMC Research Center and S.LSI Division since 2014 and has participated in the development of image signal processing algorithms integrated into commercial image sensors for the past four years until 2021. His current research interests include computer architecture and SoC for neural network processing.

Xuan Truong Nguyen
../../Resources/ieie/IEIESPC.2022.11.6.474/au4.png

Xuan Truong Nguyen received his B.S. in Electrical Engineering from Hanoi University of Science and Technology, Hanoi, Vietnam, in 2011; M.S., and Ph.D. degrees in Electrical Engineering and Computer Science from Seoul National University, Seoul, Korea, in 2015 and 2019, respectively. He is working as a postdoctoral fellow from BK21+ of the Electrical and Computer Engineering Department of Seoul National University. His research interests include algorithm and SoC design for low-complexity computer vision and multimedia applications.

Hyuk-Jae Lee
../../Resources/ieie/IEIESPC.2022.11.6.474/au5.png

Hyuk-Jae Lee received his B.S. and M.S. degrees in electronics engi-neering from Seoul National University, Seoul, South Korea, in 1987 and 1989, respectively, and the Ph.D. degree in electrical and computer Engineering from Purdue University, West Lafayette, IN, USA, in 1996. From 1998 to 2001, he was a Senior Component Design Engineer at the Server and Workstation Chipset Division, Intel Corporation, Hillsboro, OR, USA. From 1996 to 1998, he was a Faculty Member at the Department of Computer Science, Louisiana Tech University, Ruston, LS, USA. In 2001, he joined the School of Electrical Engineering and Computer Science at Seoul National University, where he is a Professor. He is the Founder of Mamurian Design, Inc., Seoul, a fabless SoC design house for multimedia applications. His current research interests include computer architecture and SoC for multimedia applications.