To evaluate image aesthetics in advertising design, an aesthetic evaluation model
based on CNN and data aggregation is proposed. Firstly, a binary CNN network is proposed,
and the first subnetwork extracts the image features. The second network is a multi-scale
information subnetwork. Visual Geometry Group and RESNET-50 are selected as feature
extraction networks, and several small convolutional networks and ResNet are used
as multi-scale information fusion networks. After establishing the aesthetic evaluation
model, the research found that the accuracy training result of the model was not ideal
during the training process, so a new training optimization method was established
based on data aggregation to further improve the performance of the aesthetic evaluation
model.
3.1. CNN-based Image Aesthetic Assessment for Advertising Design
Aesthetic evaluation of graphic advertising design images can further improve the
aesthetic quality of advertising design. Therefore, this paper proposes an aesthetic
evaluation model of advertising design images based on CNN and data aggregation. In
this paper, a convolutional network based binary path image aesthetic evaluation network
is proposed. Visual Geometry Group and RESNET-50 are selected as the feature extraction
network, and multi-scale convolution layer and ResNet are selected as the multi-scale
information fusion network. The aesthetics of advertising design were evaluated through
these two networks. After the model is established, a model training optimization
method based on data aggregation is proposed to further determine the model optimization
direction and improve the aesthetic evaluation accuracy. CNN is a deep learning model
that is particularly suitable for processing image data. It extracts features from
images through multilayer convolution and pooling operations and performs classification
or regression tasks through fully connected layers [19,
20]. Therefore, the study proposes a CNN-based IAA method for AD as shown in Fig. 1.
Fig. 1. Aesthetic evaluation method of advertising design image based on CNN.
The IAA method for AD of CNN employs two path sub-networks aimed at extracting features
of the image and modeling them for aesthetic assessment of the image for AD. The first
sub-network is the region of interest sub-network. In order to extract the features
of these regions more flexibly, the study determines the degree of interest of a local
region based on the information density of the region, and a region with higher information
density may be more appealing. The study designed a network to model multiple local
regions with high information density by extracting them. This network can be trained
end-to-end without manual labeling, thus avoiding the interference of subjective noise.
The second path sub-network is a multi-scale information sub-network designed to provide
rich and diverse global descriptive features to further enhance the performance of
the model. The study combines shallow and deep features through a multi-layer information
fusion network structure to support the decision-making process. Finally, the decision
results from the two networks are fused to produce a final decision judgment. This
method can effectively assess the aesthetic quality of AD images with high accuracy
and reliability. The prediction function for AD image quality assessment in the region
of interest sub-network is shown in Eq. (1).
In Eq. (1), the prediction function is $\phi$, and the output conditional probability distribution
is $P(\hat{y}^{(i)} | Z^{(i)})$. The prediction result is $\hat{y}^{(i)}$, the output
region of interest variable feature is $Z^{(i)}$, and the deep learning feature vector
of the image object region is $F^{(i)}$. The region of interest subnetwork structure
is shown in Fig. 2.
Fig. 2. Region of interest subnetwork structure.
In the design of the region of interest sub-network, the study chooses Visual Geometry
Group (VGG) and Residual Network-50 (ResNet-50) as the feature extraction network,
which have excellent performance in the field of image recognition. The VGG network
can capture the fine features of the image through its deep convolutional layer, while
the ResNet-50 solves the problem of gradient disappearance in deep network training
through residual connections, making the network deeper and thus extracting richer
features. The local features and global features of images are extracted through these
two networks respectively. Meanwhile, the study employs multilayer perceptron as a
predictive classifier for image aesthetics in order to achieve high-precision evaluation
of image aesthetics. The center point in the image object is calculated as shown in
Eq. (2).
In Eq. (2), the centroid of the object region is $\{c_j\}$, the centroid detected by the target
is $\{c_{BBi}\}$, the centroid of the connected subgraph computed by the binarized
mapping map is $\{c_{SALi}\}$, and the sum of the number of region points is $n_1+n_2$,
$n_1$ and $n_2$ are the number of region points entered by the two convolutional networks,
respectively. The final set of subject regions obtained after the introduction of
the anchor point mechanism is shown in Eq. (3).
In Eq. (3), the set of subject regions is $s\{c_j\}$, the number of anchor boundaries is $k$,
and the number of sets of subject regions is $k \times (n_1 + n_2)$. the cost function
for semantic evaluation is shown in Eq. (4).
In Eq. (4), the number of regions contained in the subject region set is $M$, one of the regions
is $p_i$, and the corresponding length and width of the region are $h_i$ and $\omega_i$.
The significant sum of all the pixel points in the region is $S(p_i)$, the evaluation
function is $H$, and the weight coefficients of the three functions are $\alpha$,
$\beta$, and $\gamma$, respectively. The feature expression of the original image
extracted by the first four layers of convolution is shown in Eq. (5).
In Eq. (5), the feature extraction function is $Ext$, the original image input is $I^{(i)}$,
and the corresponding deep convolutional features of the region are $P^{(i)}$. The
transformation relationship between a given set of feature vectors and the feature
time of the region of interest is shown in Eq. (6).
Fig. 3. Multi-scale Information subnetwork structure.
In Eq. (6), the given set of feature vectors is $F$, the feature vector weights are $W$, the
region of interest features are $Z$, the softmax function is $\sigma$, the activation
function is $\psi$, and the multiplication operation is $\otimes$. The structure of
the multi-scale information sub-network is shown in Fig. 3.
In the multiscale information subnetwork structure, different scale features are adjusted
by small convolutional layers, and the transverse connection output is shown in Eq.
(7).
In Eq. (7), the output features of the small convolutional block are $f_{b2}$, $f_{b3}$, and
$f_{b4}$, whose feature weights are $W_{b2}$, $W_{b3}$, and $W_{b4}$, respectively.
the laterally connected outputs are $f'_{b2}$, $f'_{b3}$, and $f'_{b4}$, whose feature
weights are $W'_{b2}$, $W'_{b3}$, and $W'_{b4}$, respectively. the shallow features
are generated as shown in Eq. (8).
In Eq. (8), the shallow feature is $f_{shallow}$. The deep feature output from the last layer
of the network is shown in Eq. (9).
In Eq. (9), the output of the last layer of the network is $f_{b5}$. During the feature encoding
process, the study uses two stacked $3 \times 3$ convolutional layers to extract the
middle layer features of the image, and ensures the output feature size and the number
of channels by adjusting the convolutional layer step size and adding $1 \times 1$
convolutional layers to realize the shallow feature noise reduction and the image
down sampling. In the feature fusion stage, the three low-level features are merged
into shallow features by global average pooling of feature layers, as well as dimensional
compression and splicing, while the original ResNet model is used to output deep features.
The overall loss of model training is shown in Eq. (10).
In Eq. (10), the overall loss is $L_M$, the shallow and deep feature losses are $L_s$ and $L_d$,
respectively, and the weights of the two parts of the loss are $\lambda_s$ and $\lambda_d$,
respectively.
3.2. Training Strategies Based on Data Aggregation
After the aesthetic evaluation model based on CNN is established, a training strategy
based on data aggregation is proposed. Through this training strategy, the model performance
is further optimized and the aesthetic evaluation accuracy is improved. In order to
better mine the sparsely distributed samples in the dataset, the study utilizes feature
similarity as the classification basis and proposes a DA-based dataset partitioning
method. Subsequently, a training method that combines sparse data that are correctly
dispersed with compact samples is devised to enhance the model's generalization ability
and provide the model an optimization direction. The DA-based training strategy is
shown in Fig. 4.
Fig. 4. Training strategy based on data aggregation.
In the DA-based training strategy, a CNN model is first trained using advertisement
image aesthetic data to generate high-level deep features. This approach captures
the semantic and abstract information in the images and is therefore more suitable
for assessing the semantic similarity of the images. Next, a density clustering-based
segmentation method is used to classify the dataset for semantic similarity. This
method calculates the local densities and distances of the samples in a high-dimensional
space in order to achieve an aggregated representation of the dataset and thus a better
understanding of the semantic associations between the samples. On this basis, the
study further adopts the Compact-to-Sparse training strategy to divide the learning
process into a start-up phase and a retraining phase. In the start-up phase, an initialization
model is first trained using the entire dataset, and then the model is used to extract
features and divide the dataset into three subsets. Subsequently, learning starts
from the compact subset to obtain the startup model. Upon entering the retraining
phase, the features are reextracted and the three subsets are re-divided. In this
phase, sparse subsets are added for learning to achieve further improvement of the
model performance. Such a training approach makes it possible to use DA to enhance
the model's functionality and increase its accuracy and dependability when handling
jobs pertaining to advertising aesthetics. The Euclidean distance between features
in the dataset division is calculated as shown in Eq. (11).
In Eq. (11), the Euclidean distance between features $f(P_i)$ and $f(P_j)$ is $D_{ij}$. the local
density of each map is calculated as shown in Eq. (12).
In Eq. (12), the local density of the image is $\rho_i$, the number of points in the image is
$j$, and the constant associated with the sequence of alignments is $d_c$. the density
taking function is $X(d)$, as shown in Eq. (13).
In Eq. (13), the density function takes the value of 1 when the unit distance $d$ is less than
1, and in other cases it takes the value of 0. The distance $\theta_i$ for each image
is defined as shown in Eq. (14).
Fig. 5. Compact to sparse training strategy.
In Eq. (14), the clustering center is selected based on the distance between images and local
density, and the principle of selection is to find the image that has the highest
local density and the largest distance from other images. These photos make good clustering
centers because they show points with high local density encircled by points with
low local density. At the same time, points with large distances and high local densities
can be considered as anomalies. The Compact-to-Sparse training strategy is shown in
Fig. 5.
The goal of the training process for aesthetic assessment of advertising images is
to discover potential image features and aesthetic rules. Due to the huge parameter
search space of the CNN model, the initial learning direction is crucial for the model
to converge to a better local optimum. The study designed a Compact-to-Sparse training
strategy to learn the segmented dataset in stages. The aggregation of samples in the
dataset is partitioned using a density clustering-based algorithm before training
begins. A CNN model is first trained on the full training data, then high-level features
are extracted and the dataset is divided. In the startup phase, a compactly distributed
subset from the delineated dataset is taken and a new model is trained to learn the
regular aesthetic features and rules. After the model has converged, the degree of
aggregation of the samples in the training set is re-evaluated and delineated using
the new model and fine-tuned based on it. The study also added sparsely distributed
images to the dataset but gave them less weight to allow the model to learn more unique
and complex aesthetic rules. Compared to standard neural network training, the Compact-to-Sparse
training strategy learns relatively simple but effective decision boundaries. The
overall loss function during data training is shown in Eq. (15).
In Eq. (15), the overall loss is $L$, and the losses for each stage of data are $L_0$, $L_1$,
and $L_2$, whose corresponding weights are $\omega_0$, $\omega_1$, and $\omega_2$,
respectively.