||Optimizing TensorFlow Performance by Reconstructing the Convolution Routine
||(Minseong Kim) ; (Kyu Hyun Choi) ; (Yoonah Paik) ; (Seon Wook Kim)
|| TensorFlow; Profiling; Optimization; Batch
||Using deep learning, we can currently build computational models composed of multiple processing layers to learn representations of data. Convolutional neural networks (CNNs) have been widely adopted to achieve significant performance in image recognition and classification.
TensorFlow, an open-source deep learning framework from Google, uses profiling to select one convolution algorithm, from among several available, as the core of a CNN to deliver the best performance in terms of execution time and memory usage. However, the overhead from profiling is considerably significant, because TensorFlow executes and profiles all the available algorithms for the best selection whenever an application is launched. We observe that memory usage overshoots during profiling, which limits data parallelism, and thus, fails to deliver maximum performance. In this paper, we present a novel profiling method to reduce overhead by storing the profile result from the first run and reusing it from the second run on. Using Inception-V3, we achieved up to 1.12 times and 1.11 times higher throughput, compared to the vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, without losing accuracy.