Mobile QR Code

1. (School of Electrical and Computer Engineering, Korea University / 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea {minseong, nurlonn, yoonpaik, seon}@korea.ac.kr )
2. (SK Hynix Inc. / 2091 Gyeongchung-daero, Bubaleup, Icheon-si, Gyeonggi-do 17336, Korea. This work was done when Minseong Kim and Kyu Hyun Choi were with Korea University )

TensorFlow, Profiling, Optimization, Batch

## 1. Introduction

TensorFlow [1] is one of the most well-known deep learning frameworks, providing high-level APIs for defining, training, and inferencing deep learning models on various machines, such as multicores and GPUs. TensorFlow performs numerical computations based on data flow graphs that are a series of TensorFlow operations. In the past few years, image recognition and classification studies have made tremendous progress due to the collaborative development of such frameworks as TensorFlow, Caffe [2], Theano [3,4], Keras [5], CNTK [6], Torch [7], and PyTorch [8]. Besides, deep learning techniques have been actively studied and developed in various application fields, such as image recognition [9,10], speech recognition [11,12], public health [13,14], load monitoring [15,16], and so on. In particular, the convolutional neural network (CNN), one of the deep learning models, achieves higher performance on image recognition tasks than humans in some domains [17-19].

Among various CNN operations, the convolution operation is the key component, many variants of which are currently available [20-22]. Because the performance of these algorithms depends on hardware configuration, input characteristics, and model behavior, TensorFlow utilizes various algorithms, i.e., it selects the best algorithm for each convolution operation, using profiling to achieve optimal performance. However, the profiling overhead is considerably large, because all available algorithms are examined for each convolution node, and the examination incurs overhead in execution time and memory usage. In addition, the current profiling process is performed whenever the application is launched; thus, redundant profiling processes might be executed.

In this paper, we present a novel profiling method to reduce overhead by storing and reusing the profiling results. We conducted a comprehensive study on TensorFlow and analyzed memory usage over time that limits data parallelism and fails to deliver maximum performance. We observed that memory capacity overshoots during the profiling phase, and we found opportunities to eliminate this memory overshoot by recording profiling results and reusing them without losing accuracy. Thus, we constructed a novel profiling framework to store profiling results during the first run and reuse the profiling results from the second run on, regardless of the TensorFlow detail configurations, on NVidia GPUs using cuDNN [23], which is a GPU-accelerated library of primitives for deep neural networks. From the measurement results, we could achieve up to 1.12 times and 1.11 times higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, using Inception-V3, which is a widely used image classification model that is trained with ImageNet datasets [24].

The rest of the paper is organized as follows. In Section~2, we describe the background to and motivation for our study. In Section 3, we introduce the overall architecture of our proposed profiling method. We evaluate the performance of the method (in terms of throughput) in Section 4, and present memory usage over time. The paper concludes in Section 5.

## 2. TensorFlow and Its Profiling Execution

### 2.1 TensorFlow

TensorFlow is an open-source software library developed by Google [1], which performs numerical computations based on data flow graphs to conduct machine learning and deep neural network research [1]. TensorFlow provides high-level APIs that make it easy to construct a neural network with high flexibility that can deploy computations either on CPUs or GPUs on desktops, servers, or mobile devices. The basic unit of TensorFlow is a computational graph where each node represents an operation, and each edge, called a tensor, represents data. The data comprise a set of primitive values shaped into an array of any dimension. The computation node takes zero or more tensors as input and produces a tensor as output.

The CNN is the current state-of-the-art image classification architecture, and it applies a series of convolution filters to raw images in order to extract and learn higher-level features. The CNN receives input and transforms it through a series of hidden layers composed of convolutional, pooling, and dense fully connected layers. The convolutional layer is the core of the CNN and consumes most of the computational capacity from among all the layers. TensorFlow employs a few convolution algorithms and selects one of them to achieve optimal performance in terms of memory usage and execution time with respect to the input parameters. The selection is performed in the profiling phase.

### 2.2 Profiling

Algorithm 1 presents TensorFlow's convolution profiling algorithm. TensorFlow examines whether each convolution node was profiled when triggering the node. If the node was profiled, the selected algorithms are chosen. Otherwise, profiling is performed to find the best algorithm for the node.

There are two reasons for profiling to enhance performance: identifying an algorithm to minimize memory usage and one to minimize execution time. To find the algorithms, all available convolutions are executed, depending on the convolution type, such as forward, backward input, and backward filter. Table 1 shows the available algorithms from TensorFlow with NVidia GPUs. In the forward convolution type, IMPLICIT_GEMM, IMPLICIT_PRECOMP_GEMM, GEMM, and FFT algorithms are available. All the algorithms perform a matrix-to-matrix product, but their memory usage and execution times differ. An implicit algorithm like IMPLICIT_GEMM requires less memory, but executes for a longer time than explicit ones, such as GEMM and FFT [23,25].

Fig. 1 shows an example record from profiling. The first four lines represent the convolution node's ID (name: starting address), an applied algorithm number, and its execution time in $ms$. Line 5 shows the convolution type. In this example, the type is $\textit{backward_filter}$, and the algorithm numbers on lines 1 to 4 represent the algorithms in Table 1, in order. In other words, algorithm numbers 0, 1, 2, and 3 imply $\textit{ALGO_0}$, $\textit{ALGO_1}$, $\textit{FFT}$, and $\textit{ALGO_3}$, respectively. The fields in the input and output shapes in lines 6 and 8, respectively, present the data type (F32: 32-bit floating-point), batch size, image width, image height, and image channel. For the filter shape in line 7, the fields are the filter width, filter height, input channel, and output channel. Finally, the last line shows the two selected algorithms from the profile: the first (3), achieves the minimum execution time, and the second (0) requires the least memory.

Basically, the profiled information includes convolution type, input shape, filter shape, and output shape. We show in Section 3.1 that the batch size has no significant effect on algorithm selection, so we can use the profiled information from the first run for future runs.

##### Table 1. Applicable convolution algorithms according to convolution type [23].
 Algorithm Meaning Memory usage IMPLICIT_GEMM Matrix product Low (no extra memory workspace) IMPLICIT_PRECOMP_GEMM Matrix product Medium (needs some memory workspace) GEMM Explicit matrix product High (needs significant memory workspace) FFT Fast Fourier transform approach High (stores intermediate results of FFT) FFT_TILING Fast Fourier transform approach, but splits the input into 32x32 tiles Medium (needs significantly less than FFT for large images) (a) Forward Algorithm Meaning Memory usage ALGO_0 Sum of matrix product (non-deterministic) Medium ALGO_1 Sum of matrix product (deterministic) Medium FFT Fast Fourier transform approach (deterministic) High (stores intermediate results of FFT) FFT_TILING Fast Fourier transform approach, but splits the input into 32x32 tiles (deterministic) Medium (needs significantly less than FFT for large images) (b) Backward input Algorithm Meaning Memory usage ALGO_0 Sum of matrix product (non-deterministic) Medium ALGO_1 Sum of matrix product (deterministic) Medium FFT Fast Fourier transform approach (deterministic) High (stores intermediate results of FFT) ALGO_3 Sum of matrix product (precomputation, non-deterministic) Medium (c) Backward filter

### 2.3 Profiling Drawbacks

Fig. 2 presents the memory usage of our GPU platform (detailed in Section 4) over time in TensorFlow when the batch sizes (the numbers of processed images in one execution) are 16 and 32 for the Inception-V3 application, which is a widely used image classification model trained with the ImageNet Large Visual Recognition Challenge 2012 datasets. From the experiment, we identified drawbacks in profiling execution. Memory usage overshoots approximately 10 seconds after the start of execution, and the overshoot was more than 70% of the total memory, after which memory usage was constant at less than 40\%.

We found that the overshoot resulted from memory allocation at the beginning of program execution, i.e., at the profiling phase. The memory required for profiling all the algorithms was allocated initially. The allocated memory for the selected algorithm was used for the future execution, and the other regions were deallocated. More memory for processing more batches cannot be allocated for future runs of the selected algorithm. Therefore, profiling limits the exploitation of higher data-level parallelism.

## 3. Architecture Proposal

### 3.1 Algorithm Selection

Table 2 compares profiling information for minimum execution time while running Inception-V3 on batches of 16 and 32. We categorize the results into three cases: $\textit{Case~1}$represents two batches for which the same algorithm was selected; $\textit{Case 2}$ represents a situation with the same number of different algorithm selections; $\textit{Case 3}$ represents a situation with different numbers of the same algorithm being selected. In this example, $\textit{Case 1} covers 90% of the total selections, and the other cases cover the remaining 10\%. In$\textit{Case 1}$, there is no difference between, and no obstacles to storing and reusing, the results, but$\textit{Case 2}$and$\textit{Case 3}$require careful approaches. In Fig. 1, which shows the profile information, algorithm 3 was selected for minimum execution time. However, the execution times of algorithm 3 and algorithm 1 are 0.809984$ms$and 0.816864$ms$, respectively, a difference of only 0.8%. Between$\textit{Case 2}$and$\textit{Case 3}$, the execution time difference is negligible, and therefore, there is no significant difference in the overall performance when using either algorithm. In the case of algorithms with minimum memory usage, the profiles are always the same. Table 3 presents some comparison results of profile information obtained by running the algorithm twice with a batch size of 32. We compared the results to confirm there were no obstacles in storing and reusing the information when there is repeated execution under the same conditions. As a result, most of them belonged to$\textit{Case 1}$, and the remaining belonged to$\textit{Case 3}\$, but there was no problem in storing and reusing them, because the performance of the two algorithms does not differ noticeably, as mentioned above. Therefore, the profile information can be saved to the profile record, and the profile record can be reused for four classification criteria: convolution type, input shape, filter shape, and output shape (excluding batch size).

##### Table 2. Comparison of the proﬁling results for different batch sizes with minimum execution times.
 CT Input shape Filter shape Output shape Size 16 batch Size 32 batch Selected algorithm # Selected algorithm # F Nx73x73x80 1x1x64x80 Nx73x73x64 IMP_P_GEMM 2 IMP_P_GEMM 2 F Nx73x73x64 1x1x64x80 Nx73x73x80 IMP_P_GEMM 4 IMP_P_GEMM 4 F Nx5x5x768 1x1x768x128 Nx5x5x128 IMP_GEMM 4 IMP_GEMM 4 F Nx5x5x128 5x5x128x768 Nx1x1x128 GEMM 4 GEMM 4 BI Nx8x8x448 3x3x448x384 Nx8x8x384 FFT_TILING 4 FFT_TILING 4 BI Nx8x8x384 3x1x384x384 Nx8x8x384 ALGO_1 8 ALGO_1 8 BI Nx8x8x384 1x3x384x384 Nx8x8x384 ALGO_1 8 ALGO_1 8 BI Nx73x73x80 3x3x80x192 Nx71x71x192 FFT_TILING 2 FFT_TILING 2 BF Nx35x35x48 5x5x48x64 Nx35x35x64 ALGO_3 6 ALGO_3 6 BF Nx35x35x288 3x3x288x384 Nx17x17x384 ALGO_1 2 ALGO_1 2 (a) Case 1 CT Input shape Filter shape Output shape Size 16 batch Size 32 batch Selected algorithm # Selected algorithm # F Nx8x8x2048 1x1x2048x192 Nx8x8x192 IMP_GEMM 4 IMP_P_GEMM 4 (b) Case 2 CT Input shape Filter shape Output shape Size 16 batch Size 32 batch Selected algorithm # Selected algorithm # BF Nx35x35x96 3x3x96x96 Nx35x35x96 ALGO_0 15 ALGO_0 9 ALGO_1 5 ALGO_1 0 ALGO_3 4 ALGO_3 15 BF Nx35x35x64 3x3x64x96 Nx35x35x96 ALGO_1 5 ALGO_1 7 ALGO_3 3 ALGO_3 1 (b) Case 3
CT: Convolution type; F: Forward; BI: Backward Input; BF: Backward Filter; #: the number of selections.
##### Table 3. Some comparison results of profile information when running twice in Case 3 and with a backward filter.
 Input shape Filter shape Output shape 1st run 2nd run Selected algorithm # Selected algorithm # 32x8x8x448 3x3x448x384 32x8x8x384 ALGO_1 3 ALGO_1 4 ALGO_3 1 ALGO_3 0 32x17x17x768 1x1x768x192 32x17x17x192 ALGO_0 9 ALGO_0 8 ALGO_3 15 ALGO_3 16
#: the number of selections.

### 3.2 Overall Architecture

Algorithm 2 shows our proposed convolution profiling algorithm. For each convolution node, the algorithm examines whether the convolution node was profiled. If the node was not profiled, it checks whether a profile record is available from the previous run. If a profile record is available, we select the algorithm from the profile record. Otherwise, we perform profiling to find the best algorithm, and save the profile information in the profile record file for future runs.

## 4. Experimental Evaluation

The performance evaluation of our proposed architecture was done on the platform detailed in Table 4. For the evaluation, TensorFlow r1.4 and the Inception-V3 application were used. We trained the application with ImageNet datasets, and used batch sizes of 32, 48, and 56.

Fig. 3 compares the performance in terms of seconds per execution, and Fig. 4 does so in terms of images per second, i.e., the throughput between vanilla TensorFlow, TensorFlow with XLA JIT compilation, and our approach with XLA.

The execution time of our approach is similar to that of XLA for batch sizes 32 and 48. With a batch size of 56, our technique requires 9.8% less execution time than the others. The throughput of our approach is also similar to that of XLA for batch sizes 32 and 48. With the batch size of 56, our technique outperformed the vanilla algorithm by 1.12 times. For the vanilla TensorFlow, throughput increased to 16457, 17658, and 18397 when the batch size was 32, 48, and 56, respectively, but it did not reach the maximum throughput of the proposed architecture. With XLA, the throughput increased from 18120 to 20147 as the batch size increased from 32 to 48, but decreased to 18438 for batch size 56. The reason for the decrease in throughput is that the larger batch size implies higher memory usage, and thus, the memory insufficiency prefers an algorithm needing minimum memory usage over minimum execution time. Our proposed architecture eliminates the high memory usage, and the throughput steadily increased: 18179, 19820, and 20511 for batch sizes of 32, 48, and 56, respectively.

In order to quantitatively observe the reduced memory usage, Figs. 5 and 6 present the memory usage by GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation and for our architecture with batch sizes of 32, 48, and 56. TensorFlow with XLA JIT incurs a memory usage overshoot at the initial execution instance, i.e., the profiling phase. Otherwise, the memory usage of the proposed architecture was constant over time, i.e., our proposed architecture eliminates memory usage overshoot in the profiling phase. Thus, we confirmed that our approach has higher performance than existing schemes, and we can expect our approach to achieve higher performance for larger batch sizes.

F
##### Table 4. Server speciﬁcations for the experiment.
 Feature Description CPU Intel Xeon CPU E5-1620 v4 @ 3.50 GHz CPU memory 32 GB GPU NVidia Titan XP (2 ea) GPU memory 12 GB per GPU

## 5. Conclusion

In this paper, we proposed a novel profiling method to reduce overhead by storing profile information from the first run and reusing it on subsequent runs. During the first run, algorithms that are efficient in terms of memory or runtime in a specific execution environment are determined from among the various algorithms. As a result, in the same execution environment, the same execution result for the algorithms could be expected. In such cases, profiling at every execution is generally not the right way. Also, in some cases, the difference between the results for memory- or runtime-efficient algorithms is minimal. Therefore, our proposed method can eliminate the related overhead by profiling the previous result when the execution environment does not change. We analyzed memory usage over time in detail, and observed that memory overshoots during profiling. The memory overshoot limits data-level parallelism, thus degrading execution throughput.

Based on this knowledge, we devised a new profiling method to eliminate the overshoot and improve performance. Using Inception-V3 trained with ImageNet datasets, we achieved higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation by up to 1.12 times and 1.11 times, respectively, without losing accuracy.

### ACKNOWLEDGMENTS

This work was partially supported by SK Telecom Co., LTD.

### REFERENCES

1
Abadi M., et al. , 2016, Tensorﬂow: A System for Large-scale Machine Learning, in Proc. of OSDI'16, berkeley, ca, usa, pp. 265-283
2
Jia Y., et al. , 2014, Caffe: Convolutional Architecture for Fast Feature Embedding, in Proc. of MM'14, new york, ny, usa, acm, pp. 675-678
3
Bastien F., et al. , 2012, Theano: New Features and Speed Improvements, CoRR
4
Al-Rfou R., et al. , 2016, Theano: A Python Framework for Fast Computation of Mathematical Expressions, CoRR
5
2015, Keras
6
Seide F., Agarwal A., 2016, CNTK: Microsoft's Open-source Deep-learning Toolkit, in Proc. of KDD'16, New York, NY, USA, ACM, pp. 2135-2135
7
Collobert R., Bengio S., Mariéthoz J., 2002, Torch: A Modular Machine Learning Software Library, Technical Report, IDIAP
8
2017, PyTorch.,
9
Salazar N. D., et al. , 2018, Application of Transfer Learning for Object Recognition Using Convolutional Neural Networks, in Proc. of 2018 IEEE Colombian Conference on Applications in Computational Intelligence, Medellin, Colombia, pp. 14-25
10
Spanhol F. A., et al. , 2016, Breast Cancer Histopathological Image Classiﬁcation Using Convolutional Neural Networks, in Proc. of IJCNN'16, Vancouver, BC, Canada, pp. 2560-2567
11
Siniscalchi S. M., Salerno V. M., 2016, Adaptation to New Microphones Using Artiﬁcial Neural Networks with Trainable Activation Functions, IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, No. 8, pp. 1959-1965
12
Hautamäki V., et al. , 2015, Boosting Universal Speech Attributes Classiﬁcation with Deep Neural Network for Foreign Accent Characterization, in Proc. of INTERSPEECH'15, Dresden, Germany, pp. 408-412
13
Garzón-Alfonso C. C., Rodríguez-Martínez M., 2018, Twitter Health Surveillance (THS) System, in Proc. of 2018 IEEE International Conference on Big Data, Seattle, W,A USA, pp. 1647-1654
14
Ravì D., et al. , 2016, Deep Learning for Health Informatics, IEEE Journal of Biomedical and Health Informatics, Vol. 21, No. 1, pp. 4-21
15
Salerno V., Rabbeni G., 2018, An Extreme Learning Machine Approach to Effective Energy Disaggregation, Electronics, Vol. 7, No. 10, pp. 235
16
Zhang C., et al. , 2018, Sequence-to-point Learning with Neural Networks for Non-intrusive Load Monitoring, in Proc. of AAAI'18, New Orleans, LA, USA, pp. 2604-2611
17
He K., Zhang X., Ren S., Sun J., 2015, Delving Deep into Rectiﬁers: Surpassing Human-level Performance on Imagenet Classiﬁcation, in Proc. of ICCV'15, Santiago, Chile, pp. 1026-1034
18
Dodge S. F., Karam L. J., 2017, A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions, in Proc. of ICCCN'17, Vancouver, BC, Canada, pp. 1-7
19
He K., Zhang X., Ren S., Sun J., 2016, Deep Residual Learning for Image Recognition, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 770-778
20
Vasilache N., et al. , 2014, Fast Convolutional Nets with fbfft: A GPU Performance Evaluation, CoRR
21
Winograd S., 1980, Arithmetic Complexity of Computations, Society for Industrial and Applied Mathematics
22
Lavin A., Gray S., 2016, Fast Algorithms for Convolutional Neural Networks, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 4013-4021
23
Chetlur S., et al. , 2014, cuDNN: Efﬁcient Primitives for Deep Learning, CoRR
24
Deng J., et al. , 2009, ImageNet: A Large-scale Hierarchical Image Database, in Proc. of CVPR'09, Miami, FL, USA, pp. 248-255
25
Rhu M., et al. , 2016, vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efﬁcient Neural Network Design, in Proc. of MICRO'16, Taipei, Taiwan, pp. 1-13

## Author

##### Minseong Kim

Minseong Kim received a BSc in electrical engineering and a PhD in electronics, electrical and computer engineering from Korea University in 2011 and 2018, respectively. He is currently with SK hynix, South Korea. His research interests include compiler construction, microarchitectures, memory systems, and parallel and distributed architectural simulator designs.

##### Kyu Hyun Choi

Kyu Hyun Choi received a PhD in electronics, electrical and computer engineering from Korea University in 2018. He is currently a senior engineer with SK hynix, South Korea. His research interests include microarchitectures, GPUs, parallel computing, memory system design, and processing in memory.

##### Yoonah Paik

Yoonah Paik received her B.Eng. in Electrical Engineering from Korea University, Seoul, Korea, in 2015, and she is currently working toward an integrated MSc and PhD in the Compiler and Microarchitecture Lab at Korea University. Her current research interests include microarchitectures, memory system design, and system software.

##### Seon Wook Kim

Seon Wook Kim received his BSc in Electronics and Computer Engineering from Korea University in 1988, an MSc in Electrical Engineering from Ohio State University in 1990, and a PhD in Electrical and Computer Engineering from Purdue University in 2001. He was a senior researcher at the Agency for Defense Development (1990-1995) and was a staff software engineer at Intel/KSL (2001-2002). Currently, he is a professor for the School of Electrical and Computer Engineering at Korea University. His research interests include compiler construction, microarchitectures, system optimization, and SoC design. He is a senior member of ACM and IEEE.