(Xianfeng Zeng)
1*
-
(College of Art and Creativity, Anhui University of Applied Technology, Hefei, 230011,
China)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Text detection, Text recognition, Semantic segmentation, Convolutional neural network, Sequence encoding
1. Introduction
The development of poster text is closely related to online e-commerce and internet
applications. But the release of a large number of poster copies tests the requirements
for reviewing poster text and market environment development. But manual review cannot
cope with the bombardment of a large amount of poster information. So designing a
poster text information review system can reduce review pressure. The precision of
detecting images is a key research direction to complete the audit operation. Scholars
have explored different methods for scene text recognition and detection, including
the use of differentiable binary modules combined with adaptive scaling methods [1] and feature pyramid networks [2]. There are also novel means of using converter models [3]. Taking Tencent Cloud’s intelligent content security audit solution as an example,
it can quickly establish an intelligent content security audit platform by covering
multimedia scenes. However, existing research often uses a single deep learning text
detection and recognition technique, which makes it difficult to achieve the expected
performance when dealing with complex text detection and recognition tasks. To this
end, this study combines semantic segmentation (SS) technology, convolutional neural
network (CNN), attention mechanism (AM), and multi-scale sequence encoding algorithms
were used. This aims to construct a multi-scale sequence encoding model with attention,
to more comprehensively handle complex situations in text detection and recognition
tasks, addressing the limitations of existing research. And it can provide technical
references for the development of poster copywriting and the recognition and review
of printed text information.
The main contribution of the research lies in: (1) proposing a multi-scale sequence
encoding model that combines multiple technologies and algorithms, providing new technical
ideas and methods for text detection and recognition tasks. (2) Optimized the character
detection, text recognition, and keyword extraction processes in the optical character
recognition system, reducing audit costs and improving work efficiency. (3) This provides
a certain reference and inspiration for the research of intelligent review of poster
text information, which helps to promote the development of this field. (4) By using
convolutional SS networks to detect text images and establishing SS channels to fuse
semantic information, the accuracy of image segmentation has been improved.
This study was conducted from four parts. Firstly, an explanation of the current detection
and recognition models and systems will be provided. Secondly, a text detection model
based on SS and CNN was constructed, and performance advantages were compared using
a dataset. The third part is a multi-scale sequence encoding recognition algorithm
that integrates AM, and combined with Optical Character Recognition (OCR) pose to
perform performance testing on the scene and type of text recognition. Finally, a
narrative summary of the entire study was provided.
2. Related Words
The extraction of poster text information is an important product of the development
of information technology. Accurately locating, detecting, and recognizing text information
is currently an important research hotspot in text review systems. Scholars have conducted
a lot of research on this. Phan et al. proposed the use of edge detection algorithms
combined with CNN to construct a classification model for Vietnamese character recognition,
thereby improving the effectiveness of the model [4]. Liu et al. proposed a model combining CNN for visa and passport recognition to extract
passport image information, achieving high detection and recognition rates [5]. Liu et al. proposed an adaptive Bessel curve network for end-to-end text localization,
thereby improving model recognition accuracy [6]. Hu et al. proposed combining retrieval methods to construct an adaptive language
model for handwritten text recognition, thereby improving recognition performance
[7]. Ghazal et al. proposed a handwritten document recognition system that combines CNN
training for image processing and character segmentation to verify the high accuracy
of the system [8]. Ma et al. proposed CNN and multi-channel multi-scale for text localization in character
recognition, thereby proving the high recognition rate of the model [9]. Oluwasammi et al. proposed using deep learning image segmentation to obtain semantic
information for text features, resulting in excellent semantic image segmentation
methods [10]. In this study, different models and systems were trained on text detection, recognition,
and other aspects. And the best performance indicators were obtained in text recognition
and detection for different application objects.
Based on research methods, network models and system methods are constructed for detection
and recognition in other fields, providing technical support for practical applications
in related fields. Wang et al. proposed improving pyramid converters in the field
of vision to improve transformer performance [11]. Diwan et al. proposed using You Only Look Once (YOLO) and architecture successors
to improve object detectors and improve detection accuracy for object detection problems
[12]. Karthika et al. proposed combining CNN and YOLO to detect traffic signs in road
scene recognition, thereby improving the system’s detection accuracy [13]. Biswass et al. proposed combining deep learning methods with object recognition
to extract information, in order to obtain accurate text image recognition [14]. Jia et al. proposed using AM and multimodal naming recognition to improve the performance
of visual basic models for information extraction [15]. Guo et al. proposed using bidirectional converters and machine learning classification
methods to improve the detection ability of the model for spam detection [16]. Wu et al. proposed a feature pyramid aggregation network for the application of
SS to fuse different levels of features, thereby obtaining high accuracy [17]. Gao et al. proposed the use of 3D SS and deep learning technology for robot autonomous
driving, in order to analyze the dataset and explore future research directions [18]. Zhao et al. proposed expanding CNN for ocean exploration to obtain SS and sonar
images, thereby improving model accuracy [19].
In summary, although previous scholars have established many models and systems for
extracting text information and achieved good results in specific application scenarios.
However, advertising text recognition still lacks extensive data, and most existing
poster text information extraction methods rely heavily on training data, requiring
a large amount of annotated data to train the model and achieve good generalization
ability. The accuracy in complex real-world scenarios needs to be improved. Therefore,
studying a multi-scale sequence encoding recognition model that combines SS and CNN
attention has important practical application value, which can help improve the text
recognition rate of the model in complex application scenarios.
3. Optimization of Algorithm Technology in OCR System
The recognition and detection of poster text information includes three parts: character
detection, text recognition, and text keyword extraction. The text detection part
is based on the convolutional SS network for text image detection, which has unclear
segmentation information and affects the recognition of text image information. The
applied text detection and recognition algorithms are relatively rich, including image
preprocessing technology, CNN and text classification recognition technology, and
other related technologies. And OCR is combined to optimize the technology and algorithms.
3.1. Character Detection Network and Model Construction
The character detection module, as an important part of the OCR system, mainly calibrates
input images or text. Firstly, a deep learning method based on convolution is used
to locate image text. Secondly, image text information is processed, and the image
is output to the network in tensor form. Finally, the segmented image is obtained.
The study adopts a character detection model based on SS to focus the network only
on the differences between text and background, to reduce the computational complexity
of the model and reduce network training costs in Fig. 1.
Fig. 1. Character detection network based on semantic segmentation.
Fig. 2. Attention Hole Convolutional Module.
From Fig. 1, its network model includes multi-scale AM for image segmentation task learning,
which is a multi-scale segmentation CNN with attention. The structure includes feature
extraction stage, segmentation stage, and semantic feature fusion stage. This model
removes the fully connected layer of the residual network to assist in subsequent
segmentation stages. In order to reduce computational complexity, the network input
is fixed to maintain the complete text of the cropped image. Due to the fact that
both the input and output of segmentation network are images, a multi-level feature
fusion pyramid structure is adopted on the output, which can fuse high-level and low-level
semantic features. The output features are then subjected to a convolution and then
superimposed and dimensionally reduced to output two channels. One is the center segmentation
map of the text area, and the other is the network boundary segmentation output feature
map. The fusion and binarization of these two were performed to obtain the final segmentation
feature map, which was then processed to obtain the text coordinate box. To ensure
that the network structure can retain rich global information and reflect rich features
in local information, this network is improved in the feature extraction section in
Fig. 2.
From Fig. 2, the hollow convolution module with AM is embedded in the third and fourth layers
of the feature extraction stage. In the third residual block, dilation convolution
is used to replace the original convolution, and in the fourth residual block, a mixed
dilation convolution module with attention is included. Then, AM is added to each
hollow convolutional branch to assign learning weights, enabling the model to actively
learn and filter important channel feature maps and their information, thereby enhancing
the network’s detection performance at multiple scales. In the segmentation stage,
a multi-scale network structure was adopted to construct a segmentation channel composed
of upsampling and convolution. The channel includes three convolutional layers that
receive feature extraction images of the same size. After feature fusion, the SS stage
feature fusion is used to output rich and complete segmentation images. Finally, loss
value is calculated to complete the image segmentation task. In segmentation networks,
the calculation metric using Intersection Over Union (IOU) can result in differences
between the predicted network’s output foreground segmentation map and the actual
label. So to address its shortcomings, a traditional edge detection operator is used
to increase the penalty edge, thereby making the predicted image more consistent with
the labeled image. Eq. (1) is the loss function used.
In Eq. (1), $f$ represents the input image. $K_{laplace}$ is the operator $Laplace$. $\delta$
is the Relu activation function. $\Delta Y$ and $\Delta P$ represent predicted segmentation
maps and label segmentation maps, respectively. By utilizing cross entropy, the edge
gradient of the image label and the predicted segmented image edge label were calculated.
So the segmentation boundary was optimized to avoid sticking of the segmentation network
boundary.
3.2. Text Detection Algorithms and Evaluation Indicators
Another approach is to use the Bivariate Cross Entropy function (BCE) and Dai’s loss
function for pixel foreground and background segmentation at a single pixel level
and class level, respectively. However, due to the lack of connection between adjacent
pixels, Structural Similarity (SSIM) needs to be introduced to solve the problem of
IOU, with Eq. (2) as its loss function.
In Eq. (2), $\alpha_x$ and $\alpha_y$ represent the average and standard deviation of the labels,
respectively. $\beta_x$ and $\beta_y$ are the predicted mean and standard deviation,
respectively. To avoid zero mean and standard deviation, two smaller constants are
added, namely $C_1$ and $C_2$. Furthermore, BCE loss function in Eq. (3) was used.
In Eq. (3), $\Delta Y$ and $\Delta P$ represent the predicted segmentation map and label segmentation
map, respectively. Finally, the image segmentation task and image edges are fused
to calculate the final loss function in Eq. (4).
In Eq. (4), $L_{center}$ represents the text center loss function, also known as the Dai’s loss
function. $A$, $B$, and $O$ are hyperparameters set to 0.7, 0.2, and 0.1, indicating
the importance of loss function to network. To reduce overfitting, a smooth Dai’s
loss function is often used in Eq. (5).
In Eq. (5), $P_{center}(i)$ and $T_{center}(i)$ are the predicted $i$-th pixel value for text
center segmentation and the $i$-th pixel value for text center label, respectively.
This function can avoid the situation where both labels and segmentation graph elements
are zero. The combination of multiple loss functions accelerates the convergence of
network training, and different functions calculate different channels to alleviate
the gradient flattening problem in the later stage of network training, thereby promoting
network learning. Long Shot-Term Memory (LSTM) is a type of Recurrent Neural Network
(RNN), and Fig. 3 shows its structural units.
From Fig. 3, RNN structure is prone to gradient vanishing and exploding problems. Therefore,
LSTM is proposed to increase the forgetting gate mechanism to ensure effective propagation
of error information during network training. The main focus of network training is
on the production of data labels. The annotation of text data uses quadrangles, which
are four coordinate points, to represent the text area. The character detection method
based on segmentation shrinks the text box inward to form the center area of the text.
Binary processing of image to create segmented image labels can reduce the error of
manual annotation. To avoid text region stickiness caused by segmentation, a label
making method is adopted in Eq. (6).
In Eq. (6), $D(p_i, P_j)$ represents the distance between two vertices $p_i$ in the label. Two
long points’ edges are first reduced, and then two short points’ edges are reduced.
Each edge is moved inward along the edge to two endpoints, and each pair of opposite
edges is determined by comparing their average length. After improving the label production,
a fixed ratio was used to reduce each side length $(p_i, p_{(i(\text{mod}~4))+1})$.
However, the presence of long and short texts is prone to breakage issues in Eq. (7).
In Eq. (7), $R$ is the scaling ratio of the short side. $l$ and $s$ represent the long and short
sides of the text. As length-width ratio increases, the change in long side decreases.
In network training, two labeling methods are combined and two semantic information
are fused to locate the segmented text area, thereby introducing new information to
improve the effectiveness of network training. According to the text detection module,
the detection indicators for characters and text were calculated, including Recall
(R), Precision (P), Accuracy (A), and F1 score. R is the proportion of samples selected
as positive examples in the model, which is the correct rate of selecting the target
value in Eq. (8).
In Eq. (8), $TP$ is the correctly labeled positive sample, which is the predicted text sample.
$FN$ indicates mislabeling positive samples as negative samples. The text area of
the positive sample serves as the background area of the negative sample. $P$ refers
to the proportion of correctly divided positive samples among all positive samples
in Eq. (9).
In Eq. (9), $FP$ is the positive sample for model error labeling. The model uses background
prediction as text. A is the average detection A for all test sets in Eq. (10).
In Eq. (10), $FN$ represents mislabeling positive samples as negative samples, which is the text
area predicted by the model. And F1 score is the harmonic mean of $A$ and $R$ in Eq.
(11).
In Eq. (11), $TP$ is the correctly labeled positive sample. $FP$ is a positive sample for model
error labeling.
Fig. 3. RNN cell unit diagram.
3.3. Character Recognition Network and Model Construction
However, text recognition module is also an important part of OCR system, mainly based
on character detection and converting text images into electronic documents that can
be saved. In text recognition networks, there are usually three recognition methods:
image correction based, AM based, and multi-directional encoding. According to existing
recognition technology research, attention based multi-directional character recognition
networks are used in Fig. 4.
From Fig. 4, feature encoding refers to feature extraction. The residual network is used as the
feature extraction network to extract two-dimensional spatial features of image. The
feature graph of network learning is then updated with attention parameters, and finally
summed with feature elements for sequence encoding. This model is based on the two-dimensional
spatial properties of images and long sequence features of text to train a sequence
encoding with attention to decode text recognition algorithms. In the recognition
task, two-dimensional AM undergoes image feature encoding using CNN, and then outputs
a two-dimensional feature map. So a two-dimensional attention structure was used to
update the positional parameters of characters in Eq. (12).
In Eq. (12), $v_{ij}$ represents the feature vector formed at the same position in the feature
map of all channels. $N_{ij}$ is the eight adjacent pixels of coordinate $(i, j)$.
$h'_i$ is the hidden layer state of LSTM. $W_v$ and $W_h$ are learnable weights. Eq.
(13) represents coordinate weight.
In Eq. (13), $\phi_{ij}$ is the weight of coordinate $(i, j)$. Eq. (14) represents the weight of local features.
In Eq. (14), $c_t$ is the local weight of coordinate $(i, j)$. It is also necessary to meet the
conditions in Eq. (15).
In Eq. (15), $i$ and $j$ are the horizontal and vertical values of the coordinates, respectively.
$H$ and $W$ are natural constants. To better extract image features and compare them
with other methods, ResNet50 was used in the experiment, while removing the final
fully connected layer and pooling layer of the residual network. And LSTM structure
was used in sequence feature extraction and decoding stages to better extract features.
The added AM can select local two-dimensional information of the image during decoding
to improve recognition ability. Finally, when referencing the attention module in
feature extraction and decoding, attention is automatically aligned with the image
feature regions and fused with the convolutional feature map, thereby improving feature
response and promoting feature selection for encoding. Finally, the overall system
was designed based on the above modules. Due to the need for more computing resources
for text detection and recognition, both need to be run on the server. To reduce the
pressure on backend computing devices, text auditing is set as the client, so that
the recognition structure can be transmitted to the client for keyword detection.
Fig. 5 shows the interaction between the front and back ends.
From Fig. 5, the approximate operating interfaces at the front and back ends of system are used
to connect text detection and recognition processes, to select models with better
performance. Fig. 6 shows the system sequence diagram.
From Fig. 6, user selects images and uploads them, and then queries the category front-end server
for text detection. The image information is sent to text detection run for text recognition
results, and finally returned to the client using the keyword extraction module to
display the final result. The extraction and recognition of poster text content used
in this study contains complex and rich text information, which is then processed
using deep learning technology for Chinese and English detection content. The text
recognition model can recognize uncertain length text image information in both Chinese
and English. Finally, a recognition function for some special symbols is added to
this module.
Fig. 4. Text recognition network.
Fig. 5. System front and rear interaction diagram.
Fig. 6. System timing diagram.
4. Model Experiments and System Evaluation
In the multi-scale segmentation CNN with attention, ResNet18 classification network
is used as the feature extraction network, and the pre trained dataset is a synthesized
text detection dataset consisting of a mixture of Chinese and English. Then, image
normalization processing is used to enter real data training, and the training methods
include data enhancement, which includes random cropping, blurring, filtering, and
rotation. When training the dataset, it is necessary to calculate losses and update
parameters, and each data undergoes a process while keeping the network parameters
unchanged. Finally, the test set is used to validate A, and its highest parameter
is the final model. According to the experimental environment, the real dataset was
iterated. Fig. 7 shows the relationship between the ICDAR2015 real dataset and loss function, as well
as the binarization loss of model.
From Fig. 7, there is a basic negative correlation between the real dataset and iteration of
training loss and validation loss. When iteration is 1000, both the training and validation
data have converged, and the training loss of the model is 0.2, while the validation
loss is 0.8. Then, the parameters at this time are retained, and these segmentation
results are binarized. The network segmentation used in character detection model
is to reduce model complexity and simplify the process after segmentation. In the
post-processing, a built-in function is used to directly obtain the contour of the
connected domain legal entity. Then, the smallest external polygon connected to the
domain, usually a rectangle, is obtained to obtain the coordinate expression of text
area in image. According to the algorithm formula, the model performance was tested,
trained using ICDAR2015 dataset and algorithm indicators were detected, and compared
with the benchmark model of ICDAR2013 dataset in Fig. 8.
From Fig. 8, the results of the model algorithm detection on a test set consisting of 500 images
are P=85.9%, A=83.1%, R=81.4%, and F1 score=81.8%, respectively. Afterwards, the performance
comparison of the main character detection models will be conducted in Fig. 9.
From Fig. 9, DB Net has the highest score of 86.3% in P. The highest PAN in R is 81.9%, while
the highest PAN in F1 score is 82.9%. Overall, PAN performs well. Afterwards, the
performance of multi-scale models was improved, and both multi-scale and non-multi-scale
structural models were analyzed from a visual perspective. These results confirm that
the multi-scale structural model has good detection performance, which can balance
large-scale and small-scale texts, and combine semantic features at different levels
to display rich feature information. This further indicates the superiority of multi-scale
models in character detection. Afterwards, in the text recognition task, pre training
is used to construct the model and use real data for training and adjustment, in order
to obtain the best recognition model. Model training includes model pre-training and
fine-tuning training. The former’s dataset is artificially synthesized and only removes
channel averages, while the latter requires preprocessing of the real dataset. In
model pre training, to avoid overfitting training, iterations are added to observe
in Fig. 10.
From Fig. 10, the synthesized dataset includes 1 million images with a learning rate of 0.0008
and a training iteration of 2400. It is confirmed that the learning rate decreases
in a stepwise manner with the increase of iterations, the training loss decreases
accordingly, and the validation loss first decreases and then stabilizes. At this
point, the model has converged. Afterwards, according to A, the entire dataset was
subjected to character recognition testing. There are mainly four types of datasets,
and ICDAR2017RCTW scene text recognition dataset mainly consists of Chinese images,
with most of the recognized backgrounds and text being relatively clear. Baidu scene
text recognition dataset includes both Chinese and English. Due to equipment issues,
the imaging effect varies. The CUTE80 dataset is mainly in English, with images mostly
recognized as clear backgrounds and text. The SVTP dataset is English data captured
by Google Background, with distorted perspectives. Visualization experiments were
conducted on ICDAR2017RCTW scene text recognition dataset and Baidu scene text recognition
dataset. The recognition network model that considers two-dimensional characteristics
of text images and text sequences’ characteristics is robust in recognizing curved
and distorted texts. Then, Baidu scene text recognition dataset, CUTE80 dataset, and
SVTP dataset were used for performance testing of the recognition model, and model’s
recognition rates for Chinese and English were compared in Fig. 11.
From Fig. 11, the recognition accuracies of the three are 86%, 89%, and 93%, respectively. Due
to the fact that both CUTE80 dataset and the SVTP dataset are pure English datasets,
and the recognized background and text are relatively clear. In addition, CUTE80 dataset
has a small amount of curved text, while SVTP dataset does not have the phenomenon
of curved text. Finally, performance comparisons were made for other different methods
in Table 1.
From Table 1, the detection performance of different methods varies, with the highest A, R, and
F1 scores of 96%, 84%, and 88% for DB Net, respectively. The overall level of other
methods varies. So more data are needed for experiments on text detection to obtain
the optimal solution. Based on deep learning technology adopted by OCR system framework,
a multi-scale network structure that takes into account both global and local information
is designed. And good results are obtained by applying it to the dataset, thereby
maximizing model’s performance in system detection and recognition.
Fig. 7. Change of Loss function and binarization loss.
Fig. 8. Result graph of performance indicators on the dataset.
Fig. 9. Performance comparison of different models.
Fig. 10. Changes in Learning rate, training loss and verification loss.
Fig. 11. Compare data recognition performance and recognition types.
Table 1. Comparison of Detection Performance of Different Methods.
|
Method
|
Accuracy (%)
|
Recall (%)
|
F1 score (%)
|
|
DB-Net
|
96
|
84
|
88
|
|
Deep-Text
|
84
|
81
|
88
|
|
CTPN
|
93
|
83
|
88
|
|
Text-FCN
|
67
|
75
|
70
|
|
Faster-RCNN
|
75
|
71
|
73
|
|
Baseline
|
81
|
71
|
75
|
|
EAST
|
83
|
78
|
81
|
|
CE-Net
|
86
|
79
|
83
|
|
FTPN
|
69
|
78
|
73
|
|
FOTS
|
89
|
82
|
85
|
5. Conclusion
For the extraction and recognition of poster text information, research is conducted
using SS technology and CNN and OCR systems to construct corresponding models or algorithms
for character detection, text recognition, and text keyword extraction. In text detection,
SS is used to optimize the label production method, and the loss function is used
to train the model. When iteration is 1000 times, the training loss is 0.2, and the
validation loss is 0.8. Using ICDAR2015 dataset again, the model indicators were P=85.9%,
A=83.1%, R=81.4%, and F1 score=81.8%, respectively. In text recognition, an improved
algorithm model based on sequence encoding is combined with two-dimensional spatial
features to establish a multi-scale sequence encoding and decoding text recognition
algorithm with attention. The recognition rates for different scenes are 86%, 89%,
and 93%, respectively, and the recognition rates for different types are 97.7%, 98.5%,
98.9%, and 98.6%, respectively. Finally, based on the OCR system framework and its
application on both the server and client, a text detection and recognition module
was integrated and keyword extraction was output to the client, thus conducting performance
testing on the entire system. Finally, it was proven that the multi-scale sequence
encoding detection and recognition model based on SS for AM has superiority in extracting
poster text information. However, research still lacks a large amount of experimental
data and multifaceted testing in practical application environments. Sakshi et al.
constructed patterns for handwritten symbol recognition to obtain features and improve
classification and recognition capabilities [20]. Therefore, further research and improvement are needed in subsequent research.
Acknowledgment
The research is supported by: Anhui Province Higher Education Science Research Project
(Philosophy and Social Sciences): Research on the Protection and Utilization of Intangible
Cultural Heritage in the Huizhou Region from the Perspective of Rural Revitalization:
A Case Study of Shexian Woodcarving Inheritance (No.2022AH052033).
Disclosure statement
The author reports there are no competing interests to declare.
References
M. Liao , Z. Zou , Z. Wan , C. Yao , X. Bai , Real-time scene text detection
with differentiable binarization and adaptive scale fusion, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. 45, No. 1, pp. 919-931, 2022

G. Wu , Z. Zhang , Y. Xiong , CarveNet: a channel-wise attention-based network
for irregular scene text recognition, International Journal on Document Analysis and
Recognition, Vol. 25, No. 3, pp. 177-186, 2022

M. Li , T. Lv , J. Chen , L. Cui , Y. Lu , D. Florencio , C. Zhang ,
Z. Li , F. Wei , Trocr: transformer-based optical character recognition with pre-trained
models, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No.
11, pp. 13094-13102, 2023

T. H. Phan , D. C. Tran , M. F. Hassan , Vietnamese character recognition based
on CNN model with reduced character classes, Bulletin of Electrical Engineering and
Informatics, Vol. 10, No. 2, pp. 962-969, 2021

Y. C. Liu , H. Joren , O. Gupta , D. Raviv , MRZ code extraction from visa
and passport documents using convolutional neural networks, International Journal
on Document Analysis and Recognition, Vol. 25, No. 1, pp. 29-39, 2022

Y. Liu , C. Shen , L. Jin , H. Tong , P. Chen , C. Liu , H. Chen ,
Abcnet v2: adaptive bezier-curve network for real-time end-to-end text spotting, IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, No. 11, pp. 8048-8064,
2021

S. Hu , Q. Wang , K. Huang , M. Wen , F. Coenen , Retrieval-based language
model adaptation for handwritten Chinese text recognition, International Journal on
Document Analysis and Recognition, Vol. 26, No. 2, pp. 109-119, 2023

T. M. Ghazal , Convolutional neural network based intelligent handwritten document
recognition, Computers, Materials & Continua, Vol. 70, No. 3, pp. 4563-4581, 2022

X. Ma , H. Xu , X. Zhang , H. Wang , An improved deep learning network structure
for multitask text implication translation character recognition, Complexity, Vol.
2021, No. 5, pp. 901-911, 2021

A. Oluwasammi , M. U. Aftab , Z. Qin , S. T. Ngo , T. V. Doan , S. B. Nguyen
, S. H. Nguyuen , G. H. Nguyen , Features to text: a comprehensive survey of
deep learning on semantic segmentation and image captioning, Complexity, Vol. 2021,
No. 8, pp. 1-19, 2021

W. Wang , E. Xie , X. Li , D. Fan , K. Song , L. Ding , T. Lu , P.
Luo , L. Shao , Pvt v2: improved baselines with pyramid vision transformer, Computational
Visual Media, Vol. 8, No. 3, pp. 415-424, 2022

T. Diwan , G. Anirudh , J. V. Tembhurne , Object detection using YOLO: challenges,
architectural successors, datasets and applications, Multimedia Tools and Applications,
Vol. 82, No. 6, pp. 9243-9275, 2023

R. Karthika , L. Parameswaran , A novel convolutional neural network based architecture
for object detection and recognition with an application to traffic sign recognition
from road scenes, Pattern Recognition and Image Analysis, Vol. 32, No. 2, pp. 351-362,
2022

S. Biswas , P. Riba , J. Lladós , U. Pal , Beyond document object detection:
instance-level segmentation of complex layouts, International Journal on Document
Analysis and Recognition, Vol. 24, No. 3, pp. 269-281, 2021

M. Jia , L. Shen , X. Shen , L. Liao , M. Chen , X. He , Z. Chen ,
J. Li , Mner-qg: an end-to-end mrc framework for multimodal named entity recognition
with query grounding, Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 37, No. 7, pp. 8032-8040, 2023

Y. Guo , Z. Mustafaoglu , D. Koundal , Spam detection using bidirectional transformers
and machine learning classifier algorithms, Journal of Computational and Cognitive
Engineering, Vol. 2, No. 1, pp. 5-9, 2023

Y. Wu , J. Jiang , Z. Huang , Y. Tian , FPANet: feature pyramid aggregation
network for real-time semantic segmentation, Applied Intelligence, Vol. 52, No. 3,
pp. 3319-3336, 2022

B. Gao , Y. Pan , C. Li , S. Geng , H. Zhao , Are we hungry for 3D LiDAR
data for semantic segmentation? A survey of datasets and methods, IEEE Transactions
on Intelligent Transportation Systems, Vol. 23, No. 7, pp. 6063-6081, 2022

X. Zhao , R. Qin , Q. Zhang , F. Yu , Q. Wang , B. He , DcNet: dilated
convolutional neural networks for side-scan sonar image semantic segmentation, Journal
of Ocean University of China, Vol. 20, No. 5, pp. 1089-1096, 2021

Sakshi , V. Kukreja , A retrospective study on handwritten mathematical symbols
and expressions: classification and recognition, Engineering Applications of Artificial
Intelligence, Vol. 103, 2021

Xianfeng Zeng obtained his master’s degree in Art Design (2015) from Wuhan University
of Technology. Presently, he is working as an Associate Professor and deputy dean
of the College of Art and Creativity, Anhui University of Applied Technology. He was
invited as a peer reviewer for the university’s journal. He has published articles
in more than 10 professional domestic journals, with 12 patents granted. His areas
of interest include visual design, intangible cultural heritage preservation, digital
media, and related fields.