Mobile QR Code

1. (Intelligent Image Processing Research Center, Korea Electronics Technology Institute / Seongnam, Korea hcmoon23@keti.re.kr )
2. (School of Electronics and Information Engineering, Korea Aerospace University / Goyang, Korea jgkim@kau.ac.kr )

Neural network compression, NNR, NCTM, CNN, Non-linear quantization

## 1. Introduction

In recent years, deep Convolutional Neural Networks (CNNs) have shown excellent performance in various applications, such as object detection, image and video classification, and image quality enhancement. However, these CNNs require huge amounts of memory and computational resources, which hinders their deployment in resource-limited equipment such as mobile devices, and on the Internet of Things (IoT). Therefore, compression of network models that reduces the amount of weight parameters while maintaining the task performance from the trained model is being studied [1].

In general, a neural network model is composed of a number of layers, and the weight parameters of each layer can be represented in matrix form. Therefore, neural network compression aims to reduce the data size for model parameter representation. Neural network compression typically includes the following methods: pruning that eliminates the relationship between weights and nodes but does not significantly affect model performance [2]; quantization and entropy coding that reduce the number of bits to represent the weights by reducing the precision and statistical redundancy of parameters in each layer [3]; and matrix decomposition that reduces the number of weights by decomposing them into more than two matrices [4].

Recently, the Moving Picture Experts Group (MPEG) developed Neural Network coding and Representation (NNR) as the first standard for efficient compression of neural networks (NNs) [5]. NNR aims to define the compressed representation of trained neural networks in an interoperable form. In the standardization process for NNR, reference software called the Neural network Compression Test Model (NCTM) [6] was developed and released for verification of the proposed technologies. In addition, the NNR standard supports the most common neural network formats, such as PyTorch [7], TensorFlow~[8], ONNX [9], and NNEF [10].

In the NNR in particular, pruning and matrix decomposition methods are applied as preprocessing methods for parameter reduction, and quantization and entropy coding are the applied parameter coding methods.

In this paper, we propose a Local Non-linear Quantization (LNQ) method for neural network compression. The proposed method divides the overall weight matrix into an LNQ unit (LU), and then quantizes the weights in each LNQ in ternary format.

The rest of the paper is organized as follows. Section 2 introduces the NN compression methods of NCTM. The proposed local non-linear quantization method is presented in Section 3. In Section 4, the performance analysis of the proposed method, based on experimental results, are presented. Finally, we conclude this paper in Section 5.

## 2. Overview of NCTM

In the development of MPEG NNR, to efficiently represent a large-volume neural network model, various compression methods were studied [2,16]. Since most portions of the model data are weight parameters, compression of a neural network model in NNR is focused on reduction of the total amount of weight representation to be stored and/or transmitted.

Fig. 1 shows the overall architecture of the NCTM. An original input model is compressed with the encoding pipeline of the NCTM, which consists of parameter reduction, parameter approximation, and entropy coding. In advance of parameter encoding, the parameters of the input model are reduced by parameter reduction methods (such as pruning) for efficient weight encoding. Then, the weights are compressed into the bitstream by parameter quantization and entropy coding. On the NCTM decoder side, the network model is reconstructed from the compressed bitstream by entropy decoding and dequantization.

First of all, as shown in Fig. 1, pruning is applied to the input model to reduce weight parameters in the NCTM encoder [4]. To reduce the number of weights included in the model, the weights of the connections between nodes of layers are eliminated, as shown in Fig. 2 [11]. The criterion for pruning a connection is determined from the importance of each weight in terms of performance from the intended task. For instance, a very small weight is considered to have a negligible impact on task performance of the model, so the corresponding connection between the nodes is eliminated by setting the weight value to 0. In addition, to achieve the model’s target compression ratio, a threshold value that satisfies the required sparsity is derived, and the threshold is then applied to prune the weight.

Fig. 3 shows an example of applying quantization and entropy coding to the weight matrix of a neural network model that is pruned during preprocessing in general. In quantization, the precision in parameter representation is reduced. For instance, weights expressed as 32-bit floating point values in the weight matrix of a layer are mapped to integer values. Subsequently, quantized weights are converted to bitstream representations by applying entropy coding, which is lossless compression based on a probability model.

Quantization of NCTM consists of three methods: uniform quantization, codebook-based quantization that reduces quantization error with variable step sizes, and dependent scalar quantization that adaptively uses two scalar quantizers.

An entropy coding method in NCTM called DeepCABAC [12] is context-adaptive binary arithmetic coding (CABAC) based on a context model [15].

## 3. Local Non-linear Quantization

The Local Binary Pattern (LBP) and Local Ternary Pattern (LTP) methods, which are often used in conventional pattern recognition, define the central value in the local area as a threshold, and transmit the neighboring coefficient values in binary or ternary format, respectively.

In this paper, we propose a Local Non-linear Quantization method to compress weight parameters of a neural network [13]. Since the LBP or LTP defines the threshold as a value at the center position, the threshold value may not be appropriate, and the center value should be transmitted.

To address these problems, the proposed LNQ performs non-linear quantization based on k-means clustering in a local block unit and transmits the resulting codebook accordingly. Fig. 4 shows an example of the LNQ method where k-means clustering is performed on coefficient values in each LU. Then, after determining LNQ fitness (see next paragraph) for each LU, the LNQ-applied block can transmit an indicator (LNQ_flag in Fig. 4), a codebook, and binary coefficients. On the other hand, a block in which LNQ is not applied transmits the indicator and the existing quantized coefficients. Fig. 5(b) shows the processing pipeline in NCTM 3.0 where the proposed method is implemented. In applying the proposed method, it is assumed that input for LNQ is the entire weight matrix to which uniform quantization was applied in advance.

Fig. 6 is a flow chart of the proposed LNQ encoding process that illustrates the details on the proposed algorithm. For application of LNQ, a quantized weight matrix for each layer is given, and the value of lambda for the rate-distortion cost calculation, the number of nodes to be clustered, and the size of the LU to which LNQ is applied must be set. LNQ is applied only to the fully connected (FC) and convolutional layers in a neural network model to which uniform quantization was applied globally. In general, the weights of the pre-trained model are stored in 32-bit floating point values. On the other hand, quantized weights are stored as integers. In most cases, the convergence of k-means-based clustering is faster when integers are input. Therefore, we apply LNQ to uniformly quantized weights.

The detailed process of LNQ is as follows. First, the type of input layer is checked. If the corresponding layer is not a fully connected or a convolutional layer, LNQ is not applied, which is indicated and signaled by setting LNQ_Layer_Flag = 0. In a fully connected or convolutional layer, the weight matrix of the corresponding layer is partitioned into an LU without overlap. In this paper, the LU size is set to 8x8 in a fully connected layer and each filter in a convolutional layer. After partitioning, LNQ is selectively applied in the LU based on the RD cost competition. In other words, LNQ is selected if its RD cost is less than that of uniform quantization in each LU. Whether LNQ is applied or not is indicated and signaled by using LNQ_Flag. When LNQ is applied, the results, including clustering coefficients and the generated codebook, are transmitted in each LU. On the other hand, the uniform quantized coefficients are transmitted if LNQ is not applied. This process is repeated until reaching the last LU in the corresponding layer.

In addition, in order to further enhance compression efficiency, a method for grouping LUs to be transmitted (depending on whether LNQ is applied or not) in each layer was also devised. In this paper, entropy coding is performed separately depending on whether LNQ is applied within the layer. Since each layer group has a distinct range of coefficients, context modeling is updated in sequence. For instance, the blocks to which LNQ was applied only have coefficient values of -1, 0, and 1, so context modeling can be configured accordingly. Fig. 7 shows an example of entropy coding for the proposed method in which the symbols for LUs quantized by LNQ are clustered. The proposed method could provide better performance because entropy coding gain varies according to the configuration of the symbols to be coded.

## 4. Experimental Results

In order to evaluate it, the proposed LNQ was implemented on NCTM 3.0. In the experiment, three trained neural network models included in the NNR common test conditions (CTCs) [14], VGG-16, ResNet50, and MobileNetV2 for image classification, and one model, DCase, for audio genre classification, were used as input. The pruning method in NCTM 3.0 was applied as a preprocessing method. For fair comparisons in terms of performance, the compression ratio of NCTM 3.0 and the proposed method were compared at the loss of task performance within 2% over the original model. Two of the models (VGG-16 and ResNet50) were also compared with Han et al.’s approach [1].

Table 1 shows the experimental conditions for the input models. In Test 1, quantization or LNQ was applied to the original input model. Test 2 applied LNQ to the pruned models after uniform quantization, as shown in Fig. 5 [3].

Table 2 shows the parameter distribution by layer type in each model. Unlike NCTM 3.0, in which compression processing was applied regardless of the layer type, the proposed LNQ was applied only to 2D convolutional layers and fully connected layers to minimize task performance loss. Therefore, the proposed method can be effectively applied to VGG-16 and DCase models, but could be inefficient in MobileNetV2, in which LNQ is only applicable for parameters of 37%.

##### Table 1. Experimental cases.
 Input model Test 1 Original model Test 2 Pruning model [3]
##### Table 2. Parameter distributions by layer type.
 Model 2D Convolution PW & DW Convolution FC VGG-16 11% - 89% ResNet50 44% 47% 9% MobileNetV2 1% 63% 36% DCase 72% - 28%

Coding gain was evaluated based on compression ratio ($CR$), which is defined as the ratio of the original model size to the compressed bitstream size: $C R=\frac{\text { compressed } \text_{ model }_{-} \text {size }}{\text { original } \text_{ model }_{-} \text {size }} \times 100(\%)$. In addition, for more detailed analysis, LNQ selection ratio $SL$, given by (1), is presented in the experimental results in Tables 3 and 4:

##### (1)
$SL=\frac{W_{LNQ}}{W_{all}}$

where $W_{LNQ}$ is the total number of weights in the LUs for which LNQ is selected, and $W_{all}$ is the total number of weights in the overall model.

Tables 3 and 4 shows the results from compression of the neural network models, compared with the proposed method, in terms of compression ratio. Compression ratio is the percentage of the compressed model’s size compared to the original model size. In addition, the ratio of the original model size to that of the compressed model is shown in brackets. As shown in Tables 3 and 4, the proposed LNQ method improved performance compared to NCTM 3.0 in all input models. That is, the LNQ method gave overall compression ratios 1.08 and 1.29 times higher than NCTM in Test 1 and Test 2, respectively, staying within the same task performance loss of 2%.

In the summary of the experimental results in Table 5, note that the proposed method improved performance by 8% and 29% in Test 1 and Test 2, respectively, compared to NCTM 3.0. In Test 2, VGG-16 and DCase showed about 22% and 46% performance improvement, respectively. On the other hand, although it is not big, a meaningful gain of 1~8% was obtained even in the relatively small-sized models (ResNet50, MobileNetV2). The reason the small-sized models had relatively less improvement is that MobileNetV2 and ResNet50 are mainly composed of point-wise (PW) and depth-wise (DW) convolution layers, to which the proposed method is not applied. Note that when the portion of PW and DW convolutional layers in each model is too small, there is almost no effect on compression performance. In addition, the performance robustness in the PW and DW layers was lower than in convolutional and FC layers.

##### Table 3. Results from compression of neural network models with the proposed method (Test 1, Anchor NCTM 3.0).
 Task Input Model Original Accuracy (top 1) [%] Compression Method Compression Ratio CR [%] SL [%] Proposed Method over NCTM 3.0 ImageNet Classification VGG-16 69.70 Hans et al. [1] 3.71 (26.9x) - 1.07x NCTM 3.0 2.43 (41.1x) Proposed method 2.28 (43.9x) 52.4 ResNet50 76.13 NCTM 3.0 6.21 (16.1x) - 1.01x Proposed method 6.19 (16.2x) 24.3 MobileNetV2 70.69 NCTM 3.0 14.75 (7.1x) - 1.02x Proposed method 14.42 (7.5x) 14.7 DCase Classification DCase 2017 58.31 NCTM 3.0 7.31 (13.6x) - 1.24x Proposed method 5.92 (16.9x) 66.4 Overall 40.2 1.08x
##### Table 4. Results from compression of neural network models with the proposed method (Test 2, Anchor NCTM 3.0).
 Task Input Models Original Accuracy (top1) [%] Compression Method Compression Ratio CR [%] SL [%] Proposed Method over NCTM 3.0 ImageNet Classification VGG-16 69.43 Hans et al. [1] 2.05 (48.7x) - 1.22x NCTM 3.0 1.57 (63.7x) Proposed method 1.28 (78.1x) 66.8 ResNet50 76.13 Hans [1] 6.01 (16.8x) - 1.01x NCTM 3.0 5.95 (16.8x) 30.2 Proposed method 5.89 (17.0x) MobileNetV2 70.69 NCTM 3.0 14.31 (7.0x) - 1.08x Proposed method 13.32 (7.5x) 15.6 DCase Classification DCase 2017 58.31 NCTM 3.0 4.12 (24.3x) - 1.46x Proposed method 2.82 (35.5x) 70.3 Overall 45.7 1.29x

Table 6 shows the relative task performance and SL from the proposed method in Test 2 compared to Test 1. As shown in Table 6, the compression performance was improved in Test 2 (which used the pruned model as input) by about 50% compared to Test 1 (using the original model as input). In addition, we confirmed that SL in Test 2 was improved by about 16%, compared to Test 1. The pruning model of NCTM 3.0 [3,17] was derived so that the distribution of weights consisted of values close to zero, as well as negative and positive values. Therefore, the distortion performance and SL in LNQ blocks that undergo non-linear quantization in ternary form can be improved. Furthermore, since even the parameters of blocks to which LNQ was not applied had near-zero values, both blocks can have entropy coding gains, LNQ is applied.

Fig. 8 shows the classification performance of the compressed VGG-16 in Test 2. As shown in Fig. 8, compression performance from the proposed method was better than NCTM 3.0.

##### Table 5. Summary of results from the proposed method.
 Task Input Model Over NCTM 3.0 Test 1 (original model) Test2 (pruned model) ImageNet Classification VGG-16 1.07x 1.22x ResNet50 1.01x 1.01x MobileNetV2 1.02x 1.08x DCase Classification DCase 1.24x 1.46x Overall 1.08x 1.29x
##### Table 6. Results from Test 2 over Test 1 from the proposed method.
 Task Input Model Test 2 (over Test 1) Performance SL ImageNet Classification VGG-16 1.78x 1.27x ResNet50 1.05x 1.24x MobileNetV2 1.08x 1.06x DCase Classification DCase 2.09x 1.06x Overall 1.50x 1.16x

## 5. Conclusion

In this paper, we proposed a Local Non-linear Quantization method based on k-means clustering to compress the weight parameters of neural network models. The proposed method partitions the overall weight matrix of a convolutional layer into block-shaped non-overlapped units called LNQ units, and the weights are then quantized into binary or ternary values using k-means clustering in the LUs. The experimental results show that the proposed method achieved an approximate 29% compression gain in the network models, compared to NCTM 3.0, with the same negligible loss in task performance.

### ACKNOWLEDGMENTS

This work was supported in part by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2020R1F1A1068106), and in part by an Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01351, Development of Ultra Low-Power Mobile Deep Learning Semiconductor with Compression/Decom-pression of Activation/Kernel Data).

### REFERENCES

1
Han S., et al. , Feb. 2016, Deep Compression: Compressing Deep Neural Net-works with pruning, trained quantization and Huffman coding, In Proc. Int. Conf. Learning Represen. (ICLR)
2
Aytekin C., Cricri F., Wang T., Aksu. E., Mar. 2019, Response to the Call for Proposals on Neural Network Compression: Training Highly Compressible Neural Networks, ISO/IEC JTC1/SC29/WG11, m47379
3
Jung S., Son C., Lee S., Han J., Kwak Y., Hwang S., Jun. 2019, Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss, In Proc. Int. Conf. Comput. Vis. Pattern Recognit. (CVPR)
4
Lin S., Ji R., Chen C., Tao D., Luo. J., Dec. 2019., Holistic CNN Compression via Low-Rank Decomposition with Knowledge Transfer, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 41, No. 12
5
Wailer B., Apr. 2021, DoC on ISO/IEC DIS 15938-17 Compression of Neural Networks for Multimedia Content Description and Analysis, ISO/IEC JTC1/SC29/WG04, N00079
6
Bailer W., et al. , Jan. 2020, Test Model 3 of Compression of Neural Networks for Multimedia Content Description and Analysis, ISO/IEC JTC1/SC29/WG11, N18993
7
Pytorch, [Available at Online] https://pytorch.org
8
Tensorflow, [Available at Online] https://www.tensorflow.org/
9
Open Neural Network Exchange
10
Neural Network Exchange Format (The Khronos NNEF Working Group)
11
Tensorflow Model Optimization Toolkit - Pruning API [Available at Online] https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html
12
Wiedemann S., et al. , May. 2019, DeepCABAC: Context-adaptive Binary Arithmetic Coding for Deep Neural Network Compression, in Proc. Int. Conf. Mach. Learning (ICML)
13
Moon H., Kim J.-G., Kim S., Jang S., Choi B., Jun. 2020, [NNR] CE-4 Report on Neural Network compression: Local Non-linear Quantization (Method 12), ISO/IEC JTC1/SC29/WG11, m54386
14
Bailer W., et al. , July. 2019, Evaluation Framework for Compression of neural networks for multimedia content description and analysis, ISO/IEC JTC1/SC29/WG11, N18575
15
Kirchhoffer H., et al. , to be published, Overview of the Neural Network Compression and Representation (NNR) Standard, IEEE Trans. Circuits Syst. Video Technol., to be published.
16
Wiedemann S., et al. , Apr. 2020, [NNR] CE2-CE3-related: Local parameter scaling, ISO/IEC JTC1 SC29/WG11, m53517
17
Aytekin C., et al. , Feb. 2019, Compressibility Loss for Network Weights, in arXiv:1905.01044

## Author

##### HyeonCheol Moon

HyeonCheol Moon received a BSc and an MSc in electrical engineering from Korea Aerospace University, Korea, in 2018 and 2020, respectively. He is currently a researcher with the Korea Electronics Technology Institute (KETI). His research interests include image/video processing, and neural network compression.

##### Jae-Gon Kim

Jae-Gon Kim received a BSc in electronics engineering from Kyungpook National University, Korea, in 1990, and an MSc and a PhD in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Korea, in 1992 and 2005, respectively. From 1992 to 2007, he was with the Electronics and Telecommunications Research Institute (ETRI), where he was involved in the development of digital technologies. From 2001 to 2002, he was a staff associate in the Department of Electrical Engineering at Columbia University in New York. He is currently a professor in the School of Electronics and Information Engineering at Korea Aerospace University, Korea. His research interests include image/video coding, video signaling processing, and immersive video.