Mobile QR Code QR CODE

  1. (School of Electrical and Computer Engineering, Korea University / 145 Anam-ro, Seongbuk-gu, Seoul 02841, Korea {minseong, nurlonn, yoonpaik, seon}@korea.ac.kr )
  2. (SK Hynix Inc. / 2091 Gyeongchung-daero, Bubaleup, Icheon-si, Gyeonggi-do 17336, Korea. This work was done when Minseong Kim and Kyu Hyun Choi were with Korea University )



TensorFlow, Profiling, Optimization, Batch

1. Introduction

TensorFlow [1] is one of the most well-known deep learning frameworks, providing high-level APIs for defining, training, and inferencing deep learning models on various machines, such as multicores and GPUs. TensorFlow performs numerical computations based on data flow graphs that are a series of TensorFlow operations. In the past few years, image recognition and classification studies have made tremendous progress due to the collaborative development of such frameworks as TensorFlow, Caffe [2], Theano [3,4], Keras [5], CNTK [6], Torch [7], and PyTorch [8]. Besides, deep learning techniques have been actively studied and developed in various application fields, such as image recognition [9,10], speech recognition [11,12], public health [13,14], load monitoring [15,16], and so on. In particular, the convolutional neural network (CNN), one of the deep learning models, achieves higher performance on image recognition tasks than humans in some domains [17-19].

Among various CNN operations, the convolution operation is the key component, many variants of which are currently available [20-22]. Because the performance of these algorithms depends on hardware configuration, input characteristics, and model behavior, TensorFlow utilizes various algorithms, i.e., it selects the best algorithm for each convolution operation, using profiling to achieve optimal performance. However, the profiling overhead is considerably large, because all available algorithms are examined for each convolution node, and the examination incurs overhead in execution time and memory usage. In addition, the current profiling process is performed whenever the application is launched; thus, redundant profiling processes might be executed.

In this paper, we present a novel profiling method to reduce overhead by storing and reusing the profiling results. We conducted a comprehensive study on TensorFlow and analyzed memory usage over time that limits data parallelism and fails to deliver maximum performance. We observed that memory capacity overshoots during the profiling phase, and we found opportunities to eliminate this memory overshoot by recording profiling results and reusing them without losing accuracy. Thus, we constructed a novel profiling framework to store profiling results during the first run and reuse the profiling results from the second run on, regardless of the TensorFlow detail configurations, on NVidia GPUs using cuDNN [23], which is a GPU-accelerated library of primitives for deep neural networks. From the measurement results, we could achieve up to 1.12 times and 1.11 times higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation, respectively, using Inception-V3, which is a widely used image classification model that is trained with ImageNet datasets [24].

The rest of the paper is organized as follows. In Section~2, we describe the background to and motivation for our study. In Section 3, we introduce the overall architecture of our proposed profiling method. We evaluate the performance of the method (in terms of throughput) in Section 4, and present memory usage over time. The paper concludes in Section 5.

2. TensorFlow and Its Profiling Execution

2.1 TensorFlow

TensorFlow is an open-source software library developed by Google [1], which performs numerical computations based on data flow graphs to conduct machine learning and deep neural network research [1]. TensorFlow provides high-level APIs that make it easy to construct a neural network with high flexibility that can deploy computations either on CPUs or GPUs on desktops, servers, or mobile devices. The basic unit of TensorFlow is a computational graph where each node represents an operation, and each edge, called a tensor, represents data. The data comprise a set of primitive values shaped into an array of any dimension. The computation node takes zero or more tensors as input and produces a tensor as output.

The CNN is the current state-of-the-art image classification architecture, and it applies a series of convolution filters to raw images in order to extract and learn higher-level features. The CNN receives input and transforms it through a series of hidden layers composed of convolutional, pooling, and dense fully connected layers. The convolutional layer is the core of the CNN and consumes most of the computational capacity from among all the layers. TensorFlow employs a few convolution algorithms and selects one of them to achieve optimal performance in terms of memory usage and execution time with respect to the input parameters. The selection is performed in the profiling phase.

2.2 Profiling

Algorithm 1 presents TensorFlow's convolution profiling algorithm. TensorFlow examines whether each convolution node was profiled when triggering the node. If the node was profiled, the selected algorithms are chosen. Otherwise, profiling is performed to find the best algorithm for the node.

There are two reasons for profiling to enhance performance: identifying an algorithm to minimize memory usage and one to minimize execution time. To find the algorithms, all available convolutions are executed, depending on the convolution type, such as forward, backward input, and backward filter. Table 1 shows the available algorithms from TensorFlow with NVidia GPUs. In the forward convolution type, IMPLICIT_GEMM, IMPLICIT_PRECOMP_GEMM, GEMM, and FFT algorithms are available. All the algorithms perform a matrix-to-matrix product, but their memory usage and execution times differ. An implicit algorithm like IMPLICIT_GEMM requires less memory, but executes for a longer time than explicit ones, such as GEMM and FFT [23,25].

Fig. 1 shows an example record from profiling. The first four lines represent the convolution node's ID (name: starting address), an applied algorithm number, and its execution time in $ms$. Line 5 shows the convolution type. In this example, the type is $\textit{backward_filter}$, and the algorithm numbers on lines 1 to 4 represent the algorithms in Table 1, in order. In other words, algorithm numbers 0, 1, 2, and 3 imply $\textit{ALGO_0}$, $\textit{ALGO_1}$, $\textit{FFT}$, and $\textit{ALGO_3}$, respectively. The fields in the input and output shapes in lines 6 and 8, respectively, present the data type (F32: 32-bit floating-point), batch size, image width, image height, and image channel. For the filter shape in line 7, the fields are the filter width, filter height, input channel, and output channel. Finally, the last line shows the two selected algorithms from the profile: the first (3), achieves the minimum execution time, and the second (0) requires the least memory.

Basically, the profiled information includes convolution type, input shape, filter shape, and output shape. We show in Section 3.1 that the batch size has no significant effect on algorithm selection, so we can use the profiled information from the first run for future runs.

Algorithm 1. TensorFlow’s convolution profiling.
../../Resources/ieie/IEIESPC.2021.10.2.128/al1.png
Fig. 1. Example record from TensorFlow profiling.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig1.png
Table 1. Applicable convolution algorithms according to convolution type [23].

Algorithm

Meaning

Memory usage

IMPLICIT_GEMM

Matrix product

Low (no extra memory workspace)

IMPLICIT_PRECOMP_GEMM

Matrix product

Medium (needs some memory workspace)

GEMM

Explicit matrix product

High (needs significant memory workspace)

FFT

Fast Fourier transform approach

High (stores intermediate results of FFT)

FFT_TILING

Fast Fourier transform approach, but splits the input into 32x32 tiles

Medium (needs significantly less than FFT for large images)

(a) Forward

Algorithm

Meaning

Memory usage

ALGO_0

Sum of matrix product (non-deterministic)

Medium

ALGO_1

Sum of matrix product (deterministic)

Medium

FFT

Fast Fourier transform approach (deterministic)

High (stores intermediate results of FFT)

FFT_TILING

Fast Fourier transform approach, but splits the input into 32x32 tiles (deterministic)

Medium (needs significantly less than FFT for large images)

(b) Backward input

Algorithm

Meaning

Memory usage

ALGO_0

Sum of matrix product (non-deterministic)

Medium

ALGO_1

Sum of matrix product (deterministic)

Medium

FFT

Fast Fourier transform approach (deterministic)

High (stores intermediate results of FFT)

ALGO_3

Sum of matrix product (precomputation, non-deterministic)

Medium

(c) Backward filter

2.3 Profiling Drawbacks

Fig. 2 presents the memory usage of our GPU platform (detailed in Section 4) over time in TensorFlow when the batch sizes (the numbers of processed images in one execution) are 16 and 32 for the Inception-V3 application, which is a widely used image classification model trained with the ImageNet Large Visual Recognition Challenge 2012 datasets. From the experiment, we identified drawbacks in profiling execution. Memory usage overshoots approximately 10 seconds after the start of execution, and the overshoot was more than 70% of the total memory, after which memory usage was constant at less than 40\%.

We found that the overshoot resulted from memory allocation at the beginning of program execution, i.e., at the profiling phase. The memory required for profiling all the algorithms was allocated initially. The allocated memory for the selected algorithm was used for the future execution, and the other regions were deallocated. More memory for processing more batches cannot be allocated for future runs of the selected algorithm. Therefore, profiling limits the exploitation of higher data-level parallelism.

Fig. 2. Memory usage in GPUs over time for the Inception-V3 application in TensorFlow.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig2.png

3. Architecture Proposal

3.1 Algorithm Selection

Table 2 compares profiling information for minimum execution time while running Inception-V3 on batches of 16 and 32. We categorize the results into three cases: $\textit{Case~1} $represents two batches for which the same algorithm was selected; $\textit{Case 2}$ represents a situation with the same number of different algorithm selections; $\textit{Case 3}$ represents a situation with different numbers of the same algorithm being selected. In this example, $\textit{Case 1} covers 90% of the total selections, and the other cases cover the remaining 10\%.

In $\textit{Case 1}$, there is no difference between, and no obstacles to storing and reusing, the results, but $\textit{Case 2}$ and $\textit{Case 3}$ require careful approaches. In Fig. 1, which shows the profile information, algorithm 3 was selected for minimum execution time. However, the execution times of algorithm 3 and algorithm 1 are 0.809984 $ms$ and 0.816864 $ms$, respectively, a difference of only 0.8%. Between $\textit{Case 2}$ and $\textit{Case 3}$, the execution time difference is negligible, and therefore, there is no significant difference in the overall performance when using either algorithm. In the case of algorithms with minimum memory usage, the profiles are always the same.

Table 3 presents some comparison results of profile information obtained by running the algorithm twice with a batch size of 32. We compared the results to confirm there were no obstacles in storing and reusing the information when there is repeated execution under the same conditions. As a result, most of them belonged to $\textit{Case 1}$, and the remaining belonged to $\textit{Case 3}$, but there was no problem in storing and reusing them, because the performance of the two algorithms does not differ noticeably, as mentioned above. Therefore, the profile information can be saved to the profile record, and the profile record can be reused for four classification criteria: convolution type, input shape, filter shape, and output shape (excluding batch size).

Table 2. Comparison of the profiling results for different batch sizes with minimum execution times.

CT

Input shape

Filter shape

Output shape

Size 16 batch

Size 32 batch

Selected algorithm

#

Selected algorithm

#

F

Nx73x73x80

1x1x64x80

Nx73x73x64

IMP_P_GEMM

2

IMP_P_GEMM

2

F

Nx73x73x64

1x1x64x80

Nx73x73x80

IMP_P_GEMM

4

IMP_P_GEMM

4

F

Nx5x5x768

1x1x768x128

Nx5x5x128

IMP_GEMM

4

IMP_GEMM

4

F

Nx5x5x128

5x5x128x768

Nx1x1x128

GEMM

4

GEMM

4

BI

Nx8x8x448

3x3x448x384

Nx8x8x384

FFT_TILING

4

FFT_TILING

4

BI

Nx8x8x384

3x1x384x384

Nx8x8x384

ALGO_1

8

ALGO_1

8

BI

Nx8x8x384

1x3x384x384

Nx8x8x384

ALGO_1

8

ALGO_1

8

BI

Nx73x73x80

3x3x80x192

Nx71x71x192

FFT_TILING

2

FFT_TILING

2

BF

Nx35x35x48

5x5x48x64

Nx35x35x64

ALGO_3

6

ALGO_3

6

BF

Nx35x35x288

3x3x288x384

Nx17x17x384

ALGO_1

2

ALGO_1

2

(a) Case 1

CT

Input shape

Filter shape

Output shape

Size 16 batch

Size 32 batch

Selected algorithm

#

Selected algorithm

#

F

Nx8x8x2048

1x1x2048x192

Nx8x8x192

IMP_GEMM

4

IMP_P_GEMM

4

(b) Case 2

CT

Input shape

Filter shape

Output shape

Size 16 batch

Size 32 batch

Selected algorithm

#

Selected algorithm

#

BF

Nx35x35x96

3x3x96x96

Nx35x35x96

ALGO_0

15

ALGO_0

9

ALGO_1

5

ALGO_1

0

ALGO_3

4

ALGO_3

15

BF

Nx35x35x64

3x3x64x96

Nx35x35x96

ALGO_1

5

ALGO_1

7

ALGO_3

3

ALGO_3

1

(b) Case 3

CT: Convolution type; F: Forward; BI: Backward Input; BF: Backward Filter; #: the number of selections.
Table 3. Some comparison results of profile information when running twice in Case 3 and with a backward filter.

Input shape

Filter shape

Output shape

1st run

2nd run

Selected algorithm

#

Selected algorithm

#

32x8x8x448

3x3x448x384

32x8x8x384

ALGO_1

3

ALGO_1

4

ALGO_3

1

ALGO_3

0

32x17x17x768

1x1x768x192

32x17x17x192

ALGO_0

9

ALGO_0

8

ALGO_3

15

ALGO_3

16

#: the number of selections.

3.2 Overall Architecture

Algorithm 2 shows our proposed convolution profiling algorithm. For each convolution node, the algorithm examines whether the convolution node was profiled. If the node was not profiled, it checks whether a profile record is available from the previous run. If a profile record is available, we select the algorithm from the profile record. Otherwise, we perform profiling to find the best algorithm, and save the profile information in the profile record file for future runs.

Algorithm 2. Our proposed convolution profiling.
../../Resources/ieie/IEIESPC.2021.10.2.128/al2.png

4. Experimental Evaluation

The performance evaluation of our proposed architecture was done on the platform detailed in Table 4. For the evaluation, TensorFlow r1.4 and the Inception-V3 application were used. We trained the application with ImageNet datasets, and used batch sizes of 32, 48, and 56.

Fig. 3 compares the performance in terms of seconds per execution, and Fig. 4 does so in terms of images per second, i.e., the throughput between vanilla TensorFlow, TensorFlow with XLA JIT compilation, and our approach with XLA.

The execution time of our approach is similar to that of XLA for batch sizes 32 and 48. With a batch size of 56, our technique requires 9.8% less execution time than the others. The throughput of our approach is also similar to that of XLA for batch sizes 32 and 48. With the batch size of 56, our technique outperformed the vanilla algorithm by 1.12 times. For the vanilla TensorFlow, throughput increased to 16457, 17658, and 18397 when the batch size was 32, 48, and 56, respectively, but it did not reach the maximum throughput of the proposed architecture. With XLA, the throughput increased from 18120 to 20147 as the batch size increased from 32 to 48, but decreased to 18438 for batch size 56. The reason for the decrease in throughput is that the larger batch size implies higher memory usage, and thus, the memory insufficiency prefers an algorithm needing minimum memory usage over minimum execution time. Our proposed architecture eliminates the high memory usage, and the throughput steadily increased: 18179, 19820, and 20511 for batch sizes of 32, 48, and 56, respectively.

In order to quantitatively observe the reduced memory usage, Figs. 5 and 6 present the memory usage by GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation and for our architecture with batch sizes of 32, 48, and 56. TensorFlow with XLA JIT incurs a memory usage overshoot at the initial execution instance, i.e., the profiling phase. Otherwise, the memory usage of the proposed architecture was constant over time, i.e., our proposed architecture eliminates memory usage overshoot in the profiling phase. Thus, we confirmed that our approach has higher performance than existing schemes, and we can expect our approach to achieve higher performance for larger batch sizes.

Fig. 3. Performance comparison in terms of seconds per execution. Lower is better.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig3.png
F
Fig. 4. Throughput comparison in terms of images per second. Higher is better.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig4.png
Fig. 5. The memory usage of GPUs over time for the Inception-V3 application on TensorFlow with XLA JIT compilation.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig5.png
Fig. 6. The memory usage of GPUs over time for the Inception-V3 application on the proposed architecture.
../../Resources/ieie/IEIESPC.2021.10.2.128/fig6.png
Table 4. Server specifications for the experiment.

Feature

Description

CPU

Intel Xeon CPU E5-1620 v4 @ 3.50 GHz

CPU memory

32 GB

GPU

NVidia Titan XP (2 ea)

GPU memory

12 GB per GPU

5. Conclusion

In this paper, we proposed a novel profiling method to reduce overhead by storing profile information from the first run and reusing it on subsequent runs. During the first run, algorithms that are efficient in terms of memory or runtime in a specific execution environment are determined from among the various algorithms. As a result, in the same execution environment, the same execution result for the algorithms could be expected. In such cases, profiling at every execution is generally not the right way. Also, in some cases, the difference between the results for memory- or runtime-efficient algorithms is minimal. Therefore, our proposed method can eliminate the related overhead by profiling the previous result when the execution environment does not change. We analyzed memory usage over time in detail, and observed that memory overshoots during profiling. The memory overshoot limits data-level parallelism, thus degrading execution throughput.

Based on this knowledge, we devised a new profiling method to eliminate the overshoot and improve performance. Using Inception-V3 trained with ImageNet datasets, we achieved higher throughput than vanilla TensorFlow and TensorFlow with XLA JIT compilation by up to 1.12 times and 1.11 times, respectively, without losing accuracy.

ACKNOWLEDGMENTS

This work was partially supported by SK Telecom Co., LTD.

REFERENCES

1 
Abadi M., et al. , 2016, Tensorflow: A System for Large-scale Machine Learning, in Proc. of OSDI'16, berkeley, ca, usa, pp. 265-283URL
2 
Jia Y., et al. , 2014, Caffe: Convolutional Architecture for Fast Feature Embedding, in Proc. of MM'14, new york, ny, usa, acm, pp. 675-678DOI
3 
Bastien F., et al. , 2012, Theano: New Features and Speed Improvements, CoRRURL
4 
Al-Rfou R., et al. , 2016, Theano: A Python Framework for Fast Computation of Mathematical Expressions, CoRRURL
5 
2015, KerasURL
6 
Seide F., Agarwal A., 2016, CNTK: Microsoft's Open-source Deep-learning Toolkit, in Proc. of KDD'16, New York, NY, USA, ACM, pp. 2135-2135DOI
7 
Collobert R., Bengio S., Mariéthoz J., 2002, Torch: A Modular Machine Learning Software Library, Technical Report, IDIAPURL
8 
2017, PyTorch., URL
9 
Salazar N. D., et al. , 2018, Application of Transfer Learning for Object Recognition Using Convolutional Neural Networks, in Proc. of 2018 IEEE Colombian Conference on Applications in Computational Intelligence, Medellin, Colombia, pp. 14-25DOI
10 
Spanhol F. A., et al. , 2016, Breast Cancer Histopathological Image Classification Using Convolutional Neural Networks, in Proc. of IJCNN'16, Vancouver, BC, Canada, pp. 2560-2567DOI
11 
Siniscalchi S. M., Salerno V. M., 2016, Adaptation to New Microphones Using Artificial Neural Networks with Trainable Activation Functions, IEEE Transactions on Neural Networks and Learning Systems, Vol. 28, No. 8, pp. 1959-1965DOI
12 
Hautamäki V., et al. , 2015, Boosting Universal Speech Attributes Classification with Deep Neural Network for Foreign Accent Characterization, in Proc. of INTERSPEECH'15, Dresden, Germany, pp. 408-412URL
13 
Garzón-Alfonso C. C., Rodríguez-Martínez M., 2018, Twitter Health Surveillance (THS) System, in Proc. of 2018 IEEE International Conference on Big Data, Seattle, W,A USA, pp. 1647-1654DOI
14 
Ravì D., et al. , 2016, Deep Learning for Health Informatics, IEEE Journal of Biomedical and Health Informatics, Vol. 21, No. 1, pp. 4-21DOI
15 
Salerno V., Rabbeni G., 2018, An Extreme Learning Machine Approach to Effective Energy Disaggregation, Electronics, Vol. 7, No. 10, pp. 235DOI
16 
Zhang C., et al. , 2018, Sequence-to-point Learning with Neural Networks for Non-intrusive Load Monitoring, in Proc. of AAAI'18, New Orleans, LA, USA, pp. 2604-2611URL
17 
He K., Zhang X., Ren S., Sun J., 2015, Delving Deep into Rectifiers: Surpassing Human-level Performance on Imagenet Classification, in Proc. of ICCV'15, Santiago, Chile, pp. 1026-1034DOI
18 
Dodge S. F., Karam L. J., 2017, A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions, in Proc. of ICCCN'17, Vancouver, BC, Canada, pp. 1-7DOI
19 
He K., Zhang X., Ren S., Sun J., 2016, Deep Residual Learning for Image Recognition, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 770-778DOI
20 
Vasilache N., et al. , 2014, Fast Convolutional Nets with fbfft: A GPU Performance Evaluation, CoRRURL
21 
Winograd S., 1980, Arithmetic Complexity of Computations, Society for Industrial and Applied MathematicsDOI
22 
Lavin A., Gray S., 2016, Fast Algorithms for Convolutional Neural Networks, in Proc. of CVPR'16, Las Vegas, NV, USA, pp. 4013-4021DOI
23 
Chetlur S., et al. , 2014, cuDNN: Efficient Primitives for Deep Learning, CoRRURL
24 
Deng J., et al. , 2009, ImageNet: A Large-scale Hierarchical Image Database, in Proc. of CVPR'09, Miami, FL, USA, pp. 248-255DOI
25 
Rhu M., et al. , 2016, vDNN: Virtualized Deep Neural Networks for Scalable, Memory-efficient Neural Network Design, in Proc. of MICRO'16, Taipei, Taiwan, pp. 1-13DOI

Author

Minseong Kim
../../Resources/ieie/IEIESPC.2021.10.2.128/au1.png

Minseong Kim received a BSc in electrical engineering and a PhD in electronics, electrical and computer engineering from Korea University in 2011 and 2018, respectively. He is currently with SK hynix, South Korea. His research interests include compiler construction, microarchitectures, memory systems, and parallel and distributed architectural simulator designs.

Kyu Hyun Choi
../../Resources/ieie/IEIESPC.2021.10.2.128/au2.png

Kyu Hyun Choi received a PhD in electronics, electrical and computer engineering from Korea University in 2018. He is currently a senior engineer with SK hynix, South Korea. His research interests include microarchitectures, GPUs, parallel computing, memory system design, and processing in memory.

Yoonah Paik
../../Resources/ieie/IEIESPC.2021.10.2.128/au3.png

Yoonah Paik received her B.Eng. in Electrical Engineering from Korea University, Seoul, Korea, in 2015, and she is currently working toward an integrated MSc and PhD in the Compiler and Microarchitecture Lab at Korea University. Her current research interests include microarchitectures, memory system design, and system software.

Seon Wook Kim
../../Resources/ieie/IEIESPC.2021.10.2.128/au4.png

Seon Wook Kim received his BSc in Electronics and Computer Engineering from Korea University in 1988, an MSc in Electrical Engineering from Ohio State University in 1990, and a PhD in Electrical and Computer Engineering from Purdue University in 2001. He was a senior researcher at the Agency for Defense Development (1990-1995) and was a staff software engineer at Intel/KSL (2001-2002). Currently, he is a professor for the School of Electrical and Computer Engineering at Korea University. His research interests include compiler construction, microarchitectures, system optimization, and SoC design. He is a senior member of ACM and IEEE.