Mobile QR Code QR CODE
Title ShortcutFusion++: Optimizing an End-to-End CNN Accelerator for High PE Utilization
Authors (Chunmyung Park) ; (Jicheon Kim) ; (Eunjae Hyun) ; (Xuan Truong Nguyen) ; (Hyuk-Jae Lee)
DOI https://doi.org/10.5573/IEIESPC.2022.11.6.474
Page pp.474-480
ISSN 2287-5255
Keywords CNN accelerator; Processing element; Hardware utilization; FPGA; YOLO-v3
Abstract ShorcutFusion [1] is an end-to-end framework that effectively maps many well-known deep neural networks (DNNs), such as MobileNet-v2, EfficientNet-B0, ResNet-50, and YOLO-v3, to a generic CNN accelerator on FPGA. Nevertheless, its processing elements are not fully utilized when supporting various networks, leading to relatively low hardware utilization (e.g., 68.42% for YOLO-v3). This study aimed to enhance the performance of ShortcutFusion and introduce ShortcutFusion++ by proposing two simple but effective techniques for eliminating unnecessary stalls in conventional design. First, the prefetching scheme was re-designed to avoid bubble cycles when feeding data to the PE array. Second, the output buffer was reconstructed to pipeline the operations of PEs and the process of writing output feature maps to off-chip memory. The experimental results show that ShortcutFusion++ achieves a PE utilization of 80.95% for the wellknown object detection network YOLO-v3, outperforming its baseline by 12.53%.