3.1 Model Construction
Because a CNN has characteristics of automatically extracting images, it can automatically
learn and train data by imitating brain tissue, and can realize tasks such as image
classification. The CNN has been widely used in image processing. The basic components
of the CNN are the data input layer, the convolution calculation layer, the excitation
layer, the pooling layer, the fully connected layer, and the output layer [16]. The convolutional layer is the core of a CNN. The function of the convolutional
layer is to extract the features of the image and strengthen the expressive ability
of the network after a convolution operation [17]. The calculation of the convolutional layer is divided into two steps; the first
is to capture image position information, and then, feature extraction is performed
on the captured image. After the first convolution operation is the calculation formula
of the image change as seen in (1):
In (1), $H_{out}$ is the height after convolution, $W_{out}$ is the width after convolution,
and $Stride$ is the step size. Rapid development of the CNN has enabled ordinary people
to imitate the style of a famous artist’s paintings to create works of art in that
style, which is called style migration. The traditional style migration network mainly
uses the VGG-19 model to extract texture features and the content of the image. The
grid defines the content LOSS function and the style LOSS function, and the final
LOSS function is weighted by those two LOSS functions. The final LOSS function is
obtained by weighting the above two LOSS functions. The LOSS function is minimized
through continuous iterative training to obtain the final image after style rendering.
A common style migration model is shown in Fig. 1.
As seen in Fig. 1, common style migration algorithms require separate processing of blank noise each
time, and are too computationally complex. Therefore, this study improves on the traditional
style rendering technique by fusing a CNN algorithm and a VGG-19 network while using
a TensorFlow function to implement a convolution operation to build a fast style-rendering
model. The fast style-rendering model based on the improved CNN is shown in Fig. 2.
Fig. 1. Traditional style-migration model.
Fig. 2. Fast style-rendering model with the improved CNN algorithm.
As can be seen in Fig. 2, the fast style-rendering model is divided into two main parts: the generative model
and the loss model. In the generative model, the original image is input, and after
a series of operations, the final output is a similarly styled image. The generative
model is essentially a convolutional neural network structure consisting of a convolutional
layer, a residual layer, and a deconvolution layer. The loss model is essentially
a pre-trained VGG-19 network structure that does not require weight updates during
training, but is only used to calculate the loss values for content and style, and
then to update the weights of the previous generation of the generative model through
back-propagation. The fast style-rendering model is trained by selecting a style image,
Ys, and a content image, Yc, during the training phase, then training the different
style and content images into different network models. In order to calculate the
difference between resulting image Y and the sample image, the LOSS model is used
to extract the information of these images in different convolutional layers and compare
them. Then, the weights are changed by back-propagation so that the resulting image
Y is close to Ys in terms of style, and close to Yc in terms of content. The weights
are then recorded to obtain a fast style-rendering model for that style. The LOSS
function for the fast style-rendering model is defined as follows:
In Eq. (2), $\phi $ is the trained VGG-19 model, $i$ is the number of convolutional layers,
$\phi _{i}\left(M\right)$ represents the activation value of the image at layer $M$$i$
of the $\phi $ model, $\hat{M}$ represents the generated image after the model update,
and $M$ is the starting input image. $C_{i}$ in $C_{i}H_{i}W_{i}$ represents the number
of channels of the feature image at layer $i$, $H_{i}$ represents the height of the
feature image at layer $i$, and $W_{i}$ represents the width of the feature image
at layer $i$. In addition, the Gram matrix is also used, and is given in (3):
In Eq. (3), $F$ represents matrix parameterization, and $G\begin{array}{l}
\varphi \\
i
\end{array}\left(M\right)$ is the matrix of the activation values of the picture
at level $M$$i$ in the $\phi $ model, which is defined in (4):
In Eq. (4), $G\begin{array}{l}
\varphi \\
i
\end{array}\left(M\right)_{c,c'}$ is the matrix correlation between the two channels
in image $c,c'$$M$, and $\phi _{i}\left(M\right)_{h,w,c}$ is the height and width
of the activation values of image $M$ at layer $i$ in the $\phi $ model and the channel
coordinate values. The total LOSS of the fast style-rendering model is defined in
Eq. (5):
In Eq. (5), the total loss value of the model is obtained by weighting the style and content
loss values. In order to avoid subjective evaluation with too strong a sense of personal
preference and emotion, leading to inaccurate evaluation results, the evaluation criteria
of the final rendered images are mainly based on objective evaluation methods, and
the data from indicators such as information entropy, mean squared error (MSE), peak
signal-to-noise ratio (PSNR), and average gradient are used to make a comprehensive
evaluation of the model.
Eq. (6) is an expression for information entropy, where $j$ is the grey value, $p_{j}$ represents
the proportion of pixels with a grey value of $j$ in the image, and $L$ is the total
grey value. The higher the IE, the higher the quality of the rendered image.
Eq. (7) is an expression for MSE. $M\times N$ indicates the size of the image, and a smaller
value indicates a higher-quality rendered image.
Eq. (8) is an expression for PSNR in which $k$ is a binary number and the default is 8. A
higher PSNR means less distortion and a better visual appearance of the image.
Eq. (9) is the expression for average gradient in which $\frac{\partial f}{\partial X}$
and $\frac{\partial f}{\partial Y}$ represent the horizontal and vertical gradients,
respectively. The higher the G value, the clearer the image.
Eq. (10) is an expression for the correlation coefficient (R), where a higher R value indicates
higher correlation between the rendered image and the sample.
Eq. (11) is an expression for mutual information (MI) of the rendered image and sample image
$P_{\hat{M}M}\left(a,b\right)$. A larger value for MI indicates higher correlation
between the images.
Eq. (12) is an expression for spatial frequency; a higher SF indicates a more spatially active
image, i.e. a clearer image.
Eq. (13) is an expression for the horizontal direction frequency in the spatial frequency.
Eq. (14) is an expression for the frequency in the vertical direction of the spatial frequency.
Eq. (15) is an expression for structural similarity index $\mu _{A},\mu _{B}$, which is the
mean value of the rendered image and the sample image, respectively; ${\sigma ^{2}}_{A},{\sigma
^{2}}_{B}$ denotes variance in the rendered image and the sample image, respectively,
while $\sigma _{AB}$ is the covariance, with $k_{1},k_{2}$ as constants. A higher
SSIM indicates a higher degree of similarity between the two images.
3.2 Front- and Back-end Network Construction
Based on the fast style-rendering model constructed in the previous section, this
section combines Python algorithms and the Python Web framework to build the server-side
back end of the system, allowing users to access the style rendering system via the
web, upload their own images, and complete real-time rendering of the images in the
selected style [18]. The system's server front end uses the Bootstrap developmental framework to improve
adaptability to different browsers. The system server back end is divided into three
main parts. The first part is the Uniform Resource Locator (URL) module, which receives
URL requests from the front end and feeds them into the target function for execution.
The second part is the logic processing module, which mainly performs image processing,
including functions such as transcoding and biasing [19]. The third part is the fast style-rendering algorithm module, which completes the
style conversion of the image so that the input image can be rendered into the target
style according to instructions and can be presented to the user smoothly, as shown
in Fig. 3.
Fig. 3 is a flow chart of the style rendering system. The URL module makes the whole system
more stable, and all instructions from the front end need to be filtered by the URL
module first. If you need to add new functions, you only need to write the function
and forward it from the URL, which can greatly reduce runtime and the threshold for
use of the algorithm. A flow chart of the entire rendering request is shown in Fig. 4.
As seen in Fig. 4, since the format of the image content transmitted in the network is relatively special,
it is generally necessary to first encode and decode the image content sent by the
client. If it can be successfully decoded, the logic processing module will pass the
image to the fast style-rendering model to execute the rendering algorithm until completion,
and will then present the rendered image to the user. Multiple computations during
the course of operations are executed concurrently, and they potentially interact.
In addition, there are quite a few operating paths that the system can take, and results
may be uncertain. Therefore, after receiving the rendering request in the background,
the system hands over the request to the process scheduling function, which arranges
the rendering tasks according to the situation in the scheduling pool. The process
scheduling pool is shown in Fig. 5.
We can see from Fig. 5 that the front-end page of the rendering system adopts the Bootstrap framework, which
is not only simple and efficient but also has outstanding performance. Bootstrap lowers
the threshold for user access from mobile devices by rewriting most of the HTML controls
[20]. The main functions of the front end are as follows. The user uploads photos first,
then selects the style they want rendered. The system performs rendering operations
with the selected photos and styles, and displays the finished image. If the image
is uploaded successfully, the back end will signal that the front end image has been
successfully received, and it issues an instruction to render the image. After the
front end receives the instruction, it will continuously ask the back end whether
the rendering operation is complete. Upon completion, it retrieves the rendered image
and displays it. The entire front-end page-rendering process is shown in Fig. 6.
As seen in Fig. 6, the entire front-end operation is divided into four steps. First, pictures are uploaded
to request rendering; then, the process waits in the background. When results arrive,
progress is queried based on the instructions. Finally, reference pictures are returned
to the style picture library to complete the front-end rendering operation.
Fig. 3. Flow chart of the style rendering system model.
Fig. 4. Flowchart of the style rendering request module.
Fig. 5. Flow chart of the process pool operation.
Fig. 6. Flow chart of front-end rendering operations.