1. Introduction
In recent years, deep Convolutional Neural Networks (CNNs) have shown excellent performance
in various applications, such as object detection, image and video classification,
and image quality enhancement. However, these CNNs require huge amounts of memory
and computational resources, which hinders their deployment in resource-limited equipment
such as mobile devices, and on the Internet of Things (IoT). Therefore, compression
of network models that reduces the amount of weight parameters while maintaining the
task performance from the trained model is being studied [1].
In general, a neural network model is composed of a number of layers, and the weight
parameters of each layer can be represented in matrix form. Therefore, neural network
compression aims to reduce the data size for model parameter representation. Neural
network compression typically includes the following methods: pruning that eliminates
the relationship between weights and nodes but does not significantly affect model
performance [2]; quantization and entropy coding that reduce the number of bits to represent the
weights by reducing the precision and statistical redundancy of parameters in each
layer [3]; and matrix decomposition that reduces the number of weights by decomposing them
into more than two matrices [4].
Recently, the Moving Picture Experts Group (MPEG) developed Neural Network coding
and Representation (NNR) as the first standard for efficient compression of neural
networks (NNs) [5]. NNR aims to define the compressed representation of trained neural networks in an
interoperable form. In the standardization process for NNR, reference software called
the Neural network Compression Test Model (NCTM) [6] was developed and released for verification of the proposed technologies. In addition,
the NNR standard supports the most common neural network formats, such as PyTorch
[7], TensorFlow~[8], ONNX [9], and NNEF [10].
In the NNR in particular, pruning and matrix decomposition methods are applied as
preprocessing methods for parameter reduction, and quantization and entropy coding
are the applied parameter coding methods.
In this paper, we propose a Local Non-linear Quantization (LNQ) method for neural
network compression. The proposed method divides the overall weight matrix into an
LNQ unit (LU), and then quantizes the weights in each LNQ in ternary format.
The rest of the paper is organized as follows. Section 2 introduces the NN compression
methods of NCTM. The proposed local non-linear quantization method is presented in
Section 3. In Section 4, the performance analysis of the proposed method, based on
experimental results, are presented. Finally, we conclude this paper in Section 5.
2. Overview of NCTM
In the development of MPEG NNR, to efficiently represent a large-volume neural network
model, various compression methods were studied [2,16]. Since most portions of the model data are weight parameters, compression of a neural
network model in NNR is focused on reduction of the total amount of weight representation
to be stored and/or transmitted.
Fig. 1 shows the overall architecture of the NCTM. An original input model is compressed
with the encoding pipeline of the NCTM, which consists of parameter reduction, parameter
approximation, and entropy coding. In advance of parameter encoding, the parameters
of the input model are reduced by parameter reduction methods (such as pruning) for
efficient weight encoding. Then, the weights are compressed into the bitstream by
parameter quantization and entropy coding. On the NCTM decoder side, the network model
is reconstructed from the compressed bitstream by entropy decoding and dequantization.
First of all, as shown in Fig. 1, pruning is applied to the input model to reduce weight parameters in the NCTM encoder
[4]. To reduce the number of weights included in the model, the weights of the connections
between nodes of layers are eliminated, as shown in Fig. 2 [11]. The criterion for pruning a connection is determined from the importance of each
weight in terms of performance from the intended task. For instance, a very small
weight is considered to have a negligible impact on task performance of the model,
so the corresponding connection between the nodes is eliminated by setting the weight
value to 0. In addition, to achieve the model’s target compression ratio, a threshold
value that satisfies the required sparsity is derived, and the threshold is then applied
to prune the weight.
Fig. 3 shows an example of applying quantization and entropy coding to the weight matrix
of a neural network model that is pruned during preprocessing in general. In quantization,
the precision in parameter representation is reduced. For instance, weights expressed
as 32-bit floating point values in the weight matrix of a layer are mapped to integer
values. Subsequently, quantized weights are converted to bitstream representations
by applying entropy coding, which is lossless compression based on a probability model.
Quantization of NCTM consists of three methods: uniform quantization, codebook-based
quantization that reduces quantization error with variable step sizes, and dependent
scalar quantization that adaptively uses two scalar quantizers.
An entropy coding method in NCTM called DeepCABAC [12] is context-adaptive binary arithmetic coding (CABAC) based on a context model [15].
Fig. 1. Overall architecture of NCTM.
Fig. 2. Illustration of a pruning method.
Fig. 3. A conceptual example of quantization and entropy coding in NCTM[16].
3. Local Non-linear Quantization
The Local Binary Pattern (LBP) and Local Ternary Pattern (LTP) methods, which are
often used in conventional pattern recognition, define the central value in the local
area as a threshold, and transmit the neighboring coefficient values in binary or
ternary format, respectively.
In this paper, we propose a Local Non-linear Quantization method to compress weight
parameters of a neural network [13]. Since the LBP or LTP defines the threshold as a value at the center position, the
threshold value may not be appropriate, and the center value should be transmitted.
To address these problems, the proposed LNQ performs non-linear quantization based
on k-means clustering in a local block unit and transmits the resulting codebook accordingly.
Fig. 4 shows an example of the LNQ method where k-means clustering is performed on coefficient
values in each LU. Then, after determining LNQ fitness (see next paragraph) for each
LU, the LNQ-applied block can transmit an indicator (LNQ_flag in Fig. 4), a codebook, and binary coefficients. On the other hand, a block in which LNQ is
not applied transmits the indicator and the existing quantized coefficients. Fig. 5(b) shows the processing pipeline in NCTM 3.0 where the proposed method is implemented.
In applying the proposed method, it is assumed that input for LNQ is the entire weight
matrix to which uniform quantization was applied in advance.
Fig. 6 is a flow chart of the proposed LNQ encoding process that illustrates the details
on the proposed algorithm. For application of LNQ, a quantized weight matrix for each
layer is given, and the value of lambda for the rate-distortion cost calculation,
the number of nodes to be clustered, and the size of the LU to which LNQ is applied
must be set. LNQ is applied only to the fully connected (FC) and convolutional layers
in a neural network model to which uniform quantization was applied globally. In general,
the weights of the pre-trained model are stored in 32-bit floating point values. On
the other hand, quantized weights are stored as integers. In most cases, the convergence
of k-means-based clustering is faster when integers are input. Therefore, we apply
LNQ to uniformly quantized weights.
The detailed process of LNQ is as follows. First, the type of input layer is checked.
If the corresponding layer is not a fully connected or a convolutional layer, LNQ
is not applied, which is indicated and signaled by setting LNQ_Layer_Flag = 0. In
a fully connected or convolutional layer, the weight matrix of the corresponding layer
is partitioned into an LU without overlap. In this paper, the LU size is set to 8x8
in a fully connected layer and each filter in a convolutional layer. After partitioning,
LNQ is selectively applied in the LU based on the RD cost competition. In other words,
LNQ is selected if its RD cost is less than that of uniform quantization in each LU.
Whether LNQ is applied or not is indicated and signaled by using LNQ_Flag. When LNQ
is applied, the results, including clustering coefficients and the generated codebook,
are transmitted in each LU. On the other hand, the uniform quantized coefficients
are transmitted if LNQ is not applied. This process is repeated until reaching the
last LU in the corresponding layer.
In addition, in order to further enhance compression efficiency, a method for grouping
LUs to be transmitted (depending on whether LNQ is applied or not) in each layer was
also devised. In this paper, entropy coding is performed separately depending on whether
LNQ is applied within the layer. Since each layer group has a distinct range of coefficients,
context modeling is updated in sequence. For instance, the blocks to which LNQ was
applied only have coefficient values of -1, 0, and 1, so context modeling can be configured
accordingly. Fig. 7 shows an example of entropy coding for the proposed method in which the symbols for
LUs quantized by LNQ are clustered. The proposed method could provide better performance
because entropy coding gain varies according to the configuration of the symbols to
be coded.
Fig. 4. An example of LNQ.
Fig. 5. Processing pipeline: (a) NCTM 3.0; (b) proposed method (LNQ) in NCTM 3.0.
Fig. 6. A flow chart for LNQ.
Fig. 7. An example of entropy coding for the proposed method.
4. Experimental Results
In order to evaluate it, the proposed LNQ was implemented on NCTM 3.0. In the experiment,
three trained neural network models included in the NNR common test conditions (CTCs)
[14], VGG-16, ResNet50, and MobileNetV2 for image classification, and one model, DCase,
for audio genre classification, were used as input. The pruning method in NCTM 3.0
was applied as a preprocessing method. For fair comparisons in terms of performance,
the compression ratio of NCTM 3.0 and the proposed method were compared at the loss
of task performance within 2% over the original model. Two of the models (VGG-16 and
ResNet50) were also compared with Han et al.’s approach [1].
Table 1 shows the experimental conditions for the input models. In Test 1, quantization or
LNQ was applied to the original input model. Test 2 applied LNQ to the pruned models
after uniform quantization, as shown in Fig. 5 [3].
Table 2 shows the parameter distribution by layer type in each model. Unlike NCTM 3.0, in
which compression processing was applied regardless of the layer type, the proposed
LNQ was applied only to 2D convolutional layers and fully connected layers to minimize
task performance loss. Therefore, the proposed method can be effectively applied to
VGG-16 and DCase models, but could be inefficient in MobileNetV2, in which LNQ is
only applicable for parameters of 37%.
Table 1. Experimental cases.
|
Input model
|
Test 1
|
Original model
|
Test 2
|
Pruning model [3]
|
Table 2. Parameter distributions by layer type.
Model
|
2D
Convolution
|
PW & DW
Convolution
|
FC
|
VGG-16
|
11%
|
-
|
89%
|
ResNet50
|
44%
|
47%
|
9%
|
MobileNetV2
|
1%
|
63%
|
36%
|
DCase
|
72%
|
-
|
28%
|
Coding gain was evaluated based on compression ratio ($CR$), which is defined as the
ratio of the original model size to the compressed bitstream size: $
C R=\frac{\text { compressed } \text_{ model }_{-} \text {size }}{\text { original
} \text_{ model }_{-} \text {size }} \times 100(\%)
$. In addition, for more detailed analysis, LNQ selection ratio $SL$, given by (1), is presented in the experimental results in Tables 3 and 4:
where $W_{LNQ}$ is the total number of weights in the LUs for which LNQ is selected,
and $W_{all}$ is the total number of weights in the overall model.
Tables 3 and 4 shows the results from compression of the neural network models, compared with the
proposed method, in terms of compression ratio. Compression ratio is the percentage
of the compressed model’s size compared to the original model size. In addition, the
ratio of the original model size to that of the compressed model is shown in brackets.
As shown in Tables 3 and 4, the proposed LNQ method improved performance compared to NCTM 3.0 in all input models.
That is, the LNQ method gave overall compression ratios 1.08 and 1.29 times higher
than NCTM in Test 1 and Test 2, respectively, staying within the same task performance
loss of 2%.
In the summary of the experimental results in Table 5, note that the proposed method improved performance by 8% and 29% in Test 1 and Test
2, respectively, compared to NCTM 3.0. In Test 2, VGG-16 and DCase showed about 22%
and 46% performance improvement, respectively. On the other hand, although it is not
big, a meaningful gain of 1~8% was obtained even in the relatively small-sized models
(ResNet50, MobileNetV2). The reason the small-sized models had relatively less improvement
is that MobileNetV2 and ResNet50 are mainly composed of point-wise (PW) and depth-wise
(DW) convolution layers, to which the proposed method is not applied. Note that when
the portion of PW and DW convolutional layers in each model is too small, there is
almost no effect on compression performance. In addition, the performance robustness
in the PW and DW layers was lower than in convolutional and FC layers.
Table 3. Results from compression of neural network models with the proposed method (Test 1, Anchor NCTM 3.0).
Task
|
Input Model
|
Original Accuracy
(top 1) [%]
|
Compression
Method
|
Compression
Ratio CR [%]
|
SL [%]
|
Proposed Method over
NCTM 3.0
|
ImageNet
Classification
|
VGG-16
|
69.70
|
Hans et al. [1]
|
3.71
(26.9x)
|
-
|
1.07x
|
NCTM 3.0
|
2.43
(41.1x)
|
Proposed method
|
2.28
(43.9x)
|
52.4
|
ResNet50
|
76.13
|
NCTM 3.0
|
6.21
(16.1x)
|
-
|
1.01x
|
Proposed method
|
6.19
(16.2x)
|
24.3
|
MobileNetV2
|
70.69
|
NCTM 3.0
|
14.75
(7.1x)
|
-
|
1.02x
|
Proposed method
|
14.42
(7.5x)
|
14.7
|
DCase
Classification
|
DCase 2017
|
58.31
|
NCTM 3.0
|
7.31
(13.6x)
|
-
|
1.24x
|
Proposed
method
|
5.92
(16.9x)
|
66.4
|
Overall
|
40.2
|
1.08x
|
Table 4. Results from compression of neural network models with the proposed method (Test 2, Anchor NCTM 3.0).
Task
|
Input Models
|
Original Accuracy
(top1) [%]
|
Compression
Method
|
Compression
Ratio CR [%]
|
SL [%]
|
Proposed Method over
NCTM 3.0
|
ImageNet
Classification
|
VGG-16
|
69.43
|
Hans et al. [1]
|
2.05
(48.7x)
|
-
|
1.22x
|
NCTM 3.0
|
1.57
(63.7x)
|
Proposed method
|
1.28
(78.1x)
|
66.8
|
ResNet50
|
76.13
|
Hans [1]
|
6.01
(16.8x)
|
-
|
1.01x
|
NCTM 3.0
|
5.95
(16.8x)
|
30.2
|
Proposed method
|
5.89
(17.0x)
|
|
MobileNetV2
|
70.69
|
NCTM 3.0
|
14.31
(7.0x)
|
-
|
1.08x
|
Proposed method
|
13.32
(7.5x)
|
15.6
|
DCase
Classification
|
DCase 2017
|
58.31
|
NCTM 3.0
|
4.12
(24.3x)
|
-
|
1.46x
|
Proposed
method
|
2.82
(35.5x)
|
70.3
|
Overall
|
45.7
|
1.29x
|
Table 6 shows the relative task performance and SL from the proposed method in Test 2 compared
to Test 1. As shown in Table 6, the compression performance was improved in Test 2 (which used the pruned model
as input) by about 50% compared to Test 1 (using the original model as input). In
addition, we confirmed that SL in Test 2 was improved by about 16%, compared to Test
1. The pruning model of NCTM 3.0 [3,17] was derived so that the distribution of weights consisted of values close to zero,
as well as negative and positive values. Therefore, the distortion performance and
SL in LNQ blocks that undergo non-linear quantization in ternary form can be improved.
Furthermore, since even the parameters of blocks to which LNQ was not applied had
near-zero values, both blocks can have entropy coding gains, LNQ is applied.
Fig. 8 shows the classification performance of the compressed VGG-16 in Test 2. As shown
in Fig. 8, compression performance from the proposed method was better than NCTM 3.0.
Fig. 8. Classification performance of VGG-16 in Test 2.
Table 5. Summary of results from the proposed method.
Task
|
Input Model
|
Over NCTM 3.0
|
Test 1 (original model)
|
Test2 (pruned model)
|
ImageNet
Classification
|
VGG-16
|
1.07x
|
1.22x
|
ResNet50
|
1.01x
|
1.01x
|
MobileNetV2
|
1.02x
|
1.08x
|
DCase
Classification
|
DCase
|
1.24x
|
1.46x
|
Overall
|
1.08x
|
1.29x
|
Table 6. Results from Test 2 over Test 1 from the proposed method.
Task
|
Input Model
|
Test 2 (over Test 1)
|
Performance
|
SL
|
ImageNet
Classification
|
VGG-16
|
1.78x
|
1.27x
|
ResNet50
|
1.05x
|
1.24x
|
MobileNetV2
|
1.08x
|
1.06x
|
DCase
Classification
|
DCase
|
2.09x
|
1.06x
|
Overall
|
1.50x
|
1.16x
|
5. Conclusion
In this paper, we proposed a Local Non-linear Quantization method based on k-means
clustering to compress the weight parameters of neural network models. The proposed
method partitions the overall weight matrix of a convolutional layer into block-shaped
non-overlapped units called LNQ units, and the weights are then quantized into binary
or ternary values using k-means clustering in the LUs. The experimental results show
that the proposed method achieved an approximate 29% compression gain in the network
models, compared to NCTM 3.0, with the same negligible loss in task performance.
ACKNOWLEDGMENTS
This work was supported in part by a National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIT) (No. 2020R1F1A1068106), and in part by
an Institute for Information & communications Technology Promotion (IITP) grant funded
by the Korea government (MSIT) (No. 2019-0-01351, Development of Ultra Low-Power Mobile
Deep Learning Semiconductor with Compression/Decom-pression of Activation/Kernel Data).
REFERENCES
Han S., et al. , Feb. 2016, Deep Compression: Compressing Deep Neural Net-works with
pruning, trained quantization and Huffman coding, In Proc. Int. Conf. Learning Represen.
(ICLR)
Aytekin C., Cricri F., Wang T., Aksu. E., Mar. 2019, Response to the Call for Proposals
on Neural Network Compression: Training Highly Compressible Neural Networks, ISO/IEC
JTC1/SC29/WG11, m47379
Jung S., Son C., Lee S., Han J., Kwak Y., Hwang S., Jun. 2019, Learning to Quantize
Deep Networks by Optimizing Quantization Intervals with Task Loss, In Proc. Int. Conf.
Comput. Vis. Pattern Recognit. (CVPR)
Lin S., Ji R., Chen C., Tao D., Luo. J., Dec. 2019., Holistic CNN Compression via
Low-Rank Decomposition with Knowledge Transfer, IEEE Trans. Pattern Anal. Mach. Intell.,
Vol. 41, No. 12
Wailer B., Apr. 2021, DoC on ISO/IEC DIS 15938-17 Compression of Neural Networks for
Multimedia Content Description and Analysis, ISO/IEC JTC1/SC29/WG04, N00079
Bailer W., et al. , Jan. 2020, Test Model 3 of Compression of Neural Networks for
Multimedia Content Description and Analysis, ISO/IEC JTC1/SC29/WG11, N18993
Pytorch, [Available at Online] https://pytorch.org
Tensorflow, [Available at Online] https://www.tensorflow.org/
Open Neural Network Exchange
Neural Network Exchange Format (The Khronos NNEF Working Group)
Tensorflow Model Optimization Toolkit - Pruning API [Available at Online] https://blog.tensorflow.org/2019/05/tf-model-optimization-toolkit-pruning-API.html
Wiedemann S., et al. , May. 2019, DeepCABAC: Context-adaptive Binary Arithmetic Coding
for Deep Neural Network Compression, in Proc. Int. Conf. Mach. Learning (ICML)
Moon H., Kim J.-G., Kim S., Jang S., Choi B., Jun. 2020, [NNR] CE-4 Report on Neural
Network compression: Local Non-linear Quantization (Method 12), ISO/IEC JTC1/SC29/WG11,
m54386
Bailer W., et al. , July. 2019, Evaluation Framework for Compression of neural networks
for multimedia content description and analysis, ISO/IEC JTC1/SC29/WG11, N18575
Kirchhoffer H., et al. , to be published, Overview of the Neural Network Compression
and Representation (NNR) Standard, IEEE Trans. Circuits Syst. Video Technol., to be
published.
Wiedemann S., et al. , Apr. 2020, [NNR] CE2-CE3-related: Local parameter scaling,
ISO/IEC JTC1 SC29/WG11, m53517
Aytekin C., et al. , Feb. 2019, Compressibility Loss for Network Weights, in arXiv:1905.01044
Author
HyeonCheol Moon received a BSc and an MSc in electrical engineering from Korea
Aerospace University, Korea, in 2018 and 2020, respectively. He is currently a researcher
with the Korea Electronics Technology Institute (KETI). His research interests include
image/video processing, and neural network compression.
Jae-Gon Kim received a BSc in electronics engineering from Kyungpook National University,
Korea, in 1990, and an MSc and a PhD in electrical engineering from the Korea Advanced
Institute of Science and Technology (KAIST), Korea, in 1992 and 2005, respectively.
From 1992 to 2007, he was with the Electronics and Telecommunications Research Institute
(ETRI), where he was involved in the development of digital technologies. From 2001
to 2002, he was a staff associate in the Department of Electrical Engineering at Columbia
University in New York. He is currently a professor in the School of Electronics and
Information Engineering at Korea Aerospace University, Korea. His research interests
include image/video coding, video signaling processing, and immersive video.