Mobile QR Code QR CODE

2025

Reject Ratio

81.5%


  1. (College of Art and Creativity, Anhui University of Applied Technology, Hefei, 230011, China)



Text detection, Text recognition, Semantic segmentation, Convolutional neural network, Sequence encoding

1. Introduction

The development of poster text is closely related to online e-commerce and internet applications. But the release of a large number of poster copies tests the requirements for reviewing poster text and market environment development. But manual review cannot cope with the bombardment of a large amount of poster information. So designing a poster text information review system can reduce review pressure. The precision of detecting images is a key research direction to complete the audit operation. Scholars have explored different methods for scene text recognition and detection, including the use of differentiable binary modules combined with adaptive scaling methods [1] and feature pyramid networks [2]. There are also novel means of using converter models [3]. Taking Tencent Cloud’s intelligent content security audit solution as an example, it can quickly establish an intelligent content security audit platform by covering multimedia scenes. However, existing research often uses a single deep learning text detection and recognition technique, which makes it difficult to achieve the expected performance when dealing with complex text detection and recognition tasks. To this end, this study combines semantic segmentation (SS) technology, convolutional neural network (CNN), attention mechanism (AM), and multi-scale sequence encoding algorithms were used. This aims to construct a multi-scale sequence encoding model with attention, to more comprehensively handle complex situations in text detection and recognition tasks, addressing the limitations of existing research. And it can provide technical references for the development of poster copywriting and the recognition and review of printed text information.

The main contribution of the research lies in: (1) proposing a multi-scale sequence encoding model that combines multiple technologies and algorithms, providing new technical ideas and methods for text detection and recognition tasks. (2) Optimized the character detection, text recognition, and keyword extraction processes in the optical character recognition system, reducing audit costs and improving work efficiency. (3) This provides a certain reference and inspiration for the research of intelligent review of poster text information, which helps to promote the development of this field. (4) By using convolutional SS networks to detect text images and establishing SS channels to fuse semantic information, the accuracy of image segmentation has been improved.

This study was conducted from four parts. Firstly, an explanation of the current detection and recognition models and systems will be provided. Secondly, a text detection model based on SS and CNN was constructed, and performance advantages were compared using a dataset. The third part is a multi-scale sequence encoding recognition algorithm that integrates AM, and combined with Optical Character Recognition (OCR) pose to perform performance testing on the scene and type of text recognition. Finally, a narrative summary of the entire study was provided.

2. Related Words

The extraction of poster text information is an important product of the development of information technology. Accurately locating, detecting, and recognizing text information is currently an important research hotspot in text review systems. Scholars have conducted a lot of research on this. Phan et al. proposed the use of edge detection algorithms combined with CNN to construct a classification model for Vietnamese character recognition, thereby improving the effectiveness of the model [4]. Liu et al. proposed a model combining CNN for visa and passport recognition to extract passport image information, achieving high detection and recognition rates [5]. Liu et al. proposed an adaptive Bessel curve network for end-to-end text localization, thereby improving model recognition accuracy [6]. Hu et al. proposed combining retrieval methods to construct an adaptive language model for handwritten text recognition, thereby improving recognition performance [7]. Ghazal et al. proposed a handwritten document recognition system that combines CNN training for image processing and character segmentation to verify the high accuracy of the system [8]. Ma et al. proposed CNN and multi-channel multi-scale for text localization in character recognition, thereby proving the high recognition rate of the model [9]. Oluwasammi et al. proposed using deep learning image segmentation to obtain semantic information for text features, resulting in excellent semantic image segmentation methods [10]. In this study, different models and systems were trained on text detection, recognition, and other aspects. And the best performance indicators were obtained in text recognition and detection for different application objects.

Based on research methods, network models and system methods are constructed for detection and recognition in other fields, providing technical support for practical applications in related fields. Wang et al. proposed improving pyramid converters in the field of vision to improve transformer performance [11]. Diwan et al. proposed using You Only Look Once (YOLO) and architecture successors to improve object detectors and improve detection accuracy for object detection problems [12]. Karthika et al. proposed combining CNN and YOLO to detect traffic signs in road scene recognition, thereby improving the system’s detection accuracy [13]. Biswass et al. proposed combining deep learning methods with object recognition to extract information, in order to obtain accurate text image recognition [14]. Jia et al. proposed using AM and multimodal naming recognition to improve the performance of visual basic models for information extraction [15]. Guo et al. proposed using bidirectional converters and machine learning classification methods to improve the detection ability of the model for spam detection [16]. Wu et al. proposed a feature pyramid aggregation network for the application of SS to fuse different levels of features, thereby obtaining high accuracy [17]. Gao et al. proposed the use of 3D SS and deep learning technology for robot autonomous driving, in order to analyze the dataset and explore future research directions [18]. Zhao et al. proposed expanding CNN for ocean exploration to obtain SS and sonar images, thereby improving model accuracy [19].

In summary, although previous scholars have established many models and systems for extracting text information and achieved good results in specific application scenarios. However, advertising text recognition still lacks extensive data, and most existing poster text information extraction methods rely heavily on training data, requiring a large amount of annotated data to train the model and achieve good generalization ability. The accuracy in complex real-world scenarios needs to be improved. Therefore, studying a multi-scale sequence encoding recognition model that combines SS and CNN attention has important practical application value, which can help improve the text recognition rate of the model in complex application scenarios.

3. Optimization of Algorithm Technology in OCR System

The recognition and detection of poster text information includes three parts: character detection, text recognition, and text keyword extraction. The text detection part is based on the convolutional SS network for text image detection, which has unclear segmentation information and affects the recognition of text image information. The applied text detection and recognition algorithms are relatively rich, including image preprocessing technology, CNN and text classification recognition technology, and other related technologies. And OCR is combined to optimize the technology and algorithms.

3.1. Character Detection Network and Model Construction

The character detection module, as an important part of the OCR system, mainly calibrates input images or text. Firstly, a deep learning method based on convolution is used to locate image text. Secondly, image text information is processed, and the image is output to the network in tensor form. Finally, the segmented image is obtained. The study adopts a character detection model based on SS to focus the network only on the differences between text and background, to reduce the computational complexity of the model and reduce network training costs in Fig. 1.

Fig. 1. Character detection network based on semantic segmentation.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig1.png

Fig. 2. Attention Hole Convolutional Module.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig2.png

From Fig. 1, its network model includes multi-scale AM for image segmentation task learning, which is a multi-scale segmentation CNN with attention. The structure includes feature extraction stage, segmentation stage, and semantic feature fusion stage. This model removes the fully connected layer of the residual network to assist in subsequent segmentation stages. In order to reduce computational complexity, the network input is fixed to maintain the complete text of the cropped image. Due to the fact that both the input and output of segmentation network are images, a multi-level feature fusion pyramid structure is adopted on the output, which can fuse high-level and low-level semantic features. The output features are then subjected to a convolution and then superimposed and dimensionally reduced to output two channels. One is the center segmentation map of the text area, and the other is the network boundary segmentation output feature map. The fusion and binarization of these two were performed to obtain the final segmentation feature map, which was then processed to obtain the text coordinate box. To ensure that the network structure can retain rich global information and reflect rich features in local information, this network is improved in the feature extraction section in Fig. 2.

From Fig. 2, the hollow convolution module with AM is embedded in the third and fourth layers of the feature extraction stage. In the third residual block, dilation convolution is used to replace the original convolution, and in the fourth residual block, a mixed dilation convolution module with attention is included. Then, AM is added to each hollow convolutional branch to assign learning weights, enabling the model to actively learn and filter important channel feature maps and their information, thereby enhancing the network’s detection performance at multiple scales. In the segmentation stage, a multi-scale network structure was adopted to construct a segmentation channel composed of upsampling and convolution. The channel includes three convolutional layers that receive feature extraction images of the same size. After feature fusion, the SS stage feature fusion is used to output rich and complete segmentation images. Finally, loss value is calculated to complete the image segmentation task. In segmentation networks, the calculation metric using Intersection Over Union (IOU) can result in differences between the predicted network’s output foreground segmentation map and the actual label. So to address its shortcomings, a traditional edge detection operator is used to increase the penalty edge, thereby making the predicted image more consistent with the labeled image. Eq. (1) is the loss function used.

(1)
$ \begin{cases} \Delta f = absolute~value(\delta(conv(f,K_{laplace}))), \\ L_b = \sum_{i=0}^{size(Y)} (\Delta Y_i \log(\Delta P_i)) \\ \qquad + (1-\Delta Y_i) \times \log(1-\Delta P_i). \end{cases} $

In Eq. (1), $f$ represents the input image. $K_{laplace}$ is the operator $Laplace$. $\delta$ is the Relu activation function. $\Delta Y$ and $\Delta P$ represent predicted segmentation maps and label segmentation maps, respectively. By utilizing cross entropy, the edge gradient of the image label and the predicted segmented image edge label were calculated. So the segmentation boundary was optimized to avoid sticking of the segmentation network boundary.

3.2. Text Detection Algorithms and Evaluation Indicators

Another approach is to use the Bivariate Cross Entropy function (BCE) and Dai’s loss function for pixel foreground and background segmentation at a single pixel level and class level, respectively. However, due to the lack of connection between adjacent pixels, Structural Similarity (SSIM) needs to be introduced to solve the problem of IOU, with Eq. (2) as its loss function.

(2)
$ I_{ssim} = 1 - \frac{(2\alpha_x\alpha_y + C_1) \times (2\beta_{xy} + C_2)}{(\alpha_x^2 + \alpha_y^2 + C_1) \times (\beta_x^2 + \beta_y^2 + C_2)}. $

In Eq. (2), $\alpha_x$ and $\alpha_y$ represent the average and standard deviation of the labels, respectively. $\beta_x$ and $\beta_y$ are the predicted mean and standard deviation, respectively. To avoid zero mean and standard deviation, two smaller constants are added, namely $C_1$ and $C_2$. Furthermore, BCE loss function in Eq. (3) was used.

(3)
$ L_{BCE} = \sum_{i=0}^{size(Y)} (\Delta Y_i \log(\Delta P_i)) + (1-\Delta Y_i) \times \log(1-\Delta P_i). $

In Eq. (3), $\Delta Y$ and $\Delta P$ represent the predicted segmentation map and label segmentation map, respectively. Finally, the image segmentation task and image edges are fused to calculate the final loss function in Eq. (4).

(4)
$ L = A(L_{center} + L_{BCE}) + B \times L_{ssim} + O \times L_b. $

In Eq. (4), $L_{center}$ represents the text center loss function, also known as the Dai’s loss function. $A$, $B$, and $O$ are hyperparameters set to 0.7, 0.2, and 0.1, indicating the importance of loss function to network. To reduce overfitting, a smooth Dai’s loss function is often used in Eq. (5).

(5)
$ L_{center} = 1 - \frac{1 + 2\sum_i P_{center}(i) \times T_{center}(i)}{1 + \sum_i P_{center}(i)^2 + \sum_i T_{center}(i)^2}. $

In Eq. (5), $P_{center}(i)$ and $T_{center}(i)$ are the predicted $i$-th pixel value for text center segmentation and the $i$-th pixel value for text center label, respectively. This function can avoid the situation where both labels and segmentation graph elements are zero. The combination of multiple loss functions accelerates the convergence of network training, and different functions calculate different channels to alleviate the gradient flattening problem in the later stage of network training, thereby promoting network learning. Long Shot-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN), and Fig. 3 shows its structural units.

From Fig. 3, RNN structure is prone to gradient vanishing and exploding problems. Therefore, LSTM is proposed to increase the forgetting gate mechanism to ensure effective propagation of error information during network training. The main focus of network training is on the production of data labels. The annotation of text data uses quadrangles, which are four coordinate points, to represent the text area. The character detection method based on segmentation shrinks the text box inward to form the center area of the text. Binary processing of image to create segmented image labels can reduce the error of manual annotation. To avoid text region stickiness caused by segmentation, a label making method is adopted in Eq. (6).

(6)
$ r_i = \text{minimum}(D(p_i, p_{(i~(\text{mod}~4))+1}),\nonumber\\ \hskip 4pc D(p_i, p_{((i-1)~(\text{mod}~4))+1})). $

In Eq. (6), $D(p_i, P_j)$ represents the distance between two vertices $p_i$ in the label. Two long points’ edges are first reduced, and then two short points’ edges are reduced. Each edge is moved inward along the edge to two endpoints, and each pair of opposite edges is determined by comparing their average length. After improving the label production, a fixed ratio was used to reduce each side length $(p_i, p_{(i(\text{mod}~4))+1})$. However, the presence of long and short texts is prone to breakage issues in Eq. (7).

(7)
$ R = \begin{cases} 0.35, & \frac{s}{l} > 5, \\ 0.15, & \frac{s}{l} \le 5. \end{cases} $

In Eq. (7), $R$ is the scaling ratio of the short side. $l$ and $s$ represent the long and short sides of the text. As length-width ratio increases, the change in long side decreases. In network training, two labeling methods are combined and two semantic information are fused to locate the segmented text area, thereby introducing new information to improve the effectiveness of network training. According to the text detection module, the detection indicators for characters and text were calculated, including Recall (R), Precision (P), Accuracy (A), and F1 score. R is the proportion of samples selected as positive examples in the model, which is the correct rate of selecting the target value in Eq. (8).

(8)
$ Recall = TP / (TP + FN). $

In Eq. (8), $TP$ is the correctly labeled positive sample, which is the predicted text sample. $FN$ indicates mislabeling positive samples as negative samples. The text area of the positive sample serves as the background area of the negative sample. $P$ refers to the proportion of correctly divided positive samples among all positive samples in Eq. (9).

(9)
$ Precision = TP / (TP + FP). $

In Eq. (9), $FP$ is the positive sample for model error labeling. The model uses background prediction as text. A is the average detection A for all test sets in Eq. (10).

(10)
$ Accuracy = (TP + TN) / (TP + TN + FP + FN). $

In Eq. (10), $FN$ represents mislabeling positive samples as negative samples, which is the text area predicted by the model. And F1 score is the harmonic mean of $A$ and $R$ in Eq. (11).

(11)
$ F1 = 2TP / (2TP + FP + FN)\nonumber\\ = \frac{2 \times precision \times recall}{precsion + recall}. $

In Eq. (11), $TP$ is the correctly labeled positive sample. $FP$ is a positive sample for model error labeling.

Fig. 3. RNN cell unit diagram.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig3.png

3.3. Character Recognition Network and Model Construction

However, text recognition module is also an important part of OCR system, mainly based on character detection and converting text images into electronic documents that can be saved. In text recognition networks, there are usually three recognition methods: image correction based, AM based, and multi-directional encoding. According to existing recognition technology research, attention based multi-directional character recognition networks are used in Fig. 4.

From Fig. 4, feature encoding refers to feature extraction. The residual network is used as the feature extraction network to extract two-dimensional spatial features of image. The feature graph of network learning is then updated with attention parameters, and finally summed with feature elements for sequence encoding. This model is based on the two-dimensional spatial properties of images and long sequence features of text to train a sequence encoding with attention to decode text recognition algorithms. In the recognition task, two-dimensional AM undergoes image feature encoding using CNN, and then outputs a two-dimensional feature map. So a two-dimensional attention structure was used to update the positional parameters of characters in Eq. (12).

(12)
$ g_{ij} = \tanh \left( W_v \times v_{ij} + \sum_{N_{ij}} W_{p-i,q-i} \times \tilde{v}_{pq} + W_h \times h'_t \right). $

In Eq. (12), $v_{ij}$ represents the feature vector formed at the same position in the feature map of all channels. $N_{ij}$ is the eight adjacent pixels of coordinate $(i, j)$. $h'_i$ is the hidden layer state of LSTM. $W_v$ and $W_h$ are learnable weights. Eq. (13) represents coordinate weight.

(13)
$ \phi_{ij} = softmax(w_g^T \times g_{ij}). $

In Eq. (13), $\phi_{ij}$ is the weight of coordinate $(i, j)$. Eq. (14) represents the weight of local features.

(14)
$ c_t = \sum_{i, j} \phi_{ij} \times v_{ij}. $

In Eq. (14), $c_t$ is the local weight of coordinate $(i, j)$. It is also necessary to meet the conditions in Eq. (15).

(15)
$ \begin{cases} i = 1,~...,~H, \\ j = 1,~...,~W. \end{cases} $

In Eq. (15), $i$ and $j$ are the horizontal and vertical values of the coordinates, respectively. $H$ and $W$ are natural constants. To better extract image features and compare them with other methods, ResNet50 was used in the experiment, while removing the final fully connected layer and pooling layer of the residual network. And LSTM structure was used in sequence feature extraction and decoding stages to better extract features. The added AM can select local two-dimensional information of the image during decoding to improve recognition ability. Finally, when referencing the attention module in feature extraction and decoding, attention is automatically aligned with the image feature regions and fused with the convolutional feature map, thereby improving feature response and promoting feature selection for encoding. Finally, the overall system was designed based on the above modules. Due to the need for more computing resources for text detection and recognition, both need to be run on the server. To reduce the pressure on backend computing devices, text auditing is set as the client, so that the recognition structure can be transmitted to the client for keyword detection. Fig. 5 shows the interaction between the front and back ends.

From Fig. 5, the approximate operating interfaces at the front and back ends of system are used to connect text detection and recognition processes, to select models with better performance. Fig. 6 shows the system sequence diagram.

From Fig. 6, user selects images and uploads them, and then queries the category front-end server for text detection. The image information is sent to text detection run for text recognition results, and finally returned to the client using the keyword extraction module to display the final result. The extraction and recognition of poster text content used in this study contains complex and rich text information, which is then processed using deep learning technology for Chinese and English detection content. The text recognition model can recognize uncertain length text image information in both Chinese and English. Finally, a recognition function for some special symbols is added to this module.

Fig. 4. Text recognition network.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig4.png

Fig. 5. System front and rear interaction diagram.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig5.png

Fig. 6. System timing diagram.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig6.png

4. Model Experiments and System Evaluation

In the multi-scale segmentation CNN with attention, ResNet18 classification network is used as the feature extraction network, and the pre trained dataset is a synthesized text detection dataset consisting of a mixture of Chinese and English. Then, image normalization processing is used to enter real data training, and the training methods include data enhancement, which includes random cropping, blurring, filtering, and rotation. When training the dataset, it is necessary to calculate losses and update parameters, and each data undergoes a process while keeping the network parameters unchanged. Finally, the test set is used to validate A, and its highest parameter is the final model. According to the experimental environment, the real dataset was iterated. Fig. 7 shows the relationship between the ICDAR2015 real dataset and loss function, as well as the binarization loss of model.

From Fig. 7, there is a basic negative correlation between the real dataset and iteration of training loss and validation loss. When iteration is 1000, both the training and validation data have converged, and the training loss of the model is 0.2, while the validation loss is 0.8. Then, the parameters at this time are retained, and these segmentation results are binarized. The network segmentation used in character detection model is to reduce model complexity and simplify the process after segmentation. In the post-processing, a built-in function is used to directly obtain the contour of the connected domain legal entity. Then, the smallest external polygon connected to the domain, usually a rectangle, is obtained to obtain the coordinate expression of text area in image. According to the algorithm formula, the model performance was tested, trained using ICDAR2015 dataset and algorithm indicators were detected, and compared with the benchmark model of ICDAR2013 dataset in Fig. 8.

From Fig. 8, the results of the model algorithm detection on a test set consisting of 500 images are P=85.9%, A=83.1%, R=81.4%, and F1 score=81.8%, respectively. Afterwards, the performance comparison of the main character detection models will be conducted in Fig. 9.

From Fig. 9, DB Net has the highest score of 86.3% in P. The highest PAN in R is 81.9%, while the highest PAN in F1 score is 82.9%. Overall, PAN performs well. Afterwards, the performance of multi-scale models was improved, and both multi-scale and non-multi-scale structural models were analyzed from a visual perspective. These results confirm that the multi-scale structural model has good detection performance, which can balance large-scale and small-scale texts, and combine semantic features at different levels to display rich feature information. This further indicates the superiority of multi-scale models in character detection. Afterwards, in the text recognition task, pre training is used to construct the model and use real data for training and adjustment, in order to obtain the best recognition model. Model training includes model pre-training and fine-tuning training. The former’s dataset is artificially synthesized and only removes channel averages, while the latter requires preprocessing of the real dataset. In model pre training, to avoid overfitting training, iterations are added to observe in Fig. 10.

From Fig. 10, the synthesized dataset includes 1 million images with a learning rate of 0.0008 and a training iteration of 2400. It is confirmed that the learning rate decreases in a stepwise manner with the increase of iterations, the training loss decreases accordingly, and the validation loss first decreases and then stabilizes. At this point, the model has converged. Afterwards, according to A, the entire dataset was subjected to character recognition testing. There are mainly four types of datasets, and ICDAR2017RCTW scene text recognition dataset mainly consists of Chinese images, with most of the recognized backgrounds and text being relatively clear. Baidu scene text recognition dataset includes both Chinese and English. Due to equipment issues, the imaging effect varies. The CUTE80 dataset is mainly in English, with images mostly recognized as clear backgrounds and text. The SVTP dataset is English data captured by Google Background, with distorted perspectives. Visualization experiments were conducted on ICDAR2017RCTW scene text recognition dataset and Baidu scene text recognition dataset. The recognition network model that considers two-dimensional characteristics of text images and text sequences’ characteristics is robust in recognizing curved and distorted texts. Then, Baidu scene text recognition dataset, CUTE80 dataset, and SVTP dataset were used for performance testing of the recognition model, and model’s recognition rates for Chinese and English were compared in Fig. 11.

From Fig. 11, the recognition accuracies of the three are 86%, 89%, and 93%, respectively. Due to the fact that both CUTE80 dataset and the SVTP dataset are pure English datasets, and the recognized background and text are relatively clear. In addition, CUTE80 dataset has a small amount of curved text, while SVTP dataset does not have the phenomenon of curved text. Finally, performance comparisons were made for other different methods in Table 1.

From Table 1, the detection performance of different methods varies, with the highest A, R, and F1 scores of 96%, 84%, and 88% for DB Net, respectively. The overall level of other methods varies. So more data are needed for experiments on text detection to obtain the optimal solution. Based on deep learning technology adopted by OCR system framework, a multi-scale network structure that takes into account both global and local information is designed. And good results are obtained by applying it to the dataset, thereby maximizing model’s performance in system detection and recognition.

Fig. 7. Change of Loss function and binarization loss.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig7.png

Fig. 8. Result graph of performance indicators on the dataset.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig8.png

Fig. 9. Performance comparison of different models.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig9.png

Fig. 10. Changes in Learning rate, training loss and verification loss.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig10.png

Fig. 11. Compare data recognition performance and recognition types.

../../Resources/ieie/IEIESPC.2026.15.3.323/fig11.png

Table 1. Comparison of Detection Performance of Different Methods.

Method Accuracy (%) Recall (%) F1 score (%)
DB-Net 96 84 88
Deep-Text 84 81 88
CTPN 93 83 88
Text-FCN 67 75 70
Faster-RCNN 75 71 73
Baseline 81 71 75
EAST 83 78 81
CE-Net 86 79 83
FTPN 69 78 73
FOTS 89 82 85

5. Conclusion

For the extraction and recognition of poster text information, research is conducted using SS technology and CNN and OCR systems to construct corresponding models or algorithms for character detection, text recognition, and text keyword extraction. In text detection, SS is used to optimize the label production method, and the loss function is used to train the model. When iteration is 1000 times, the training loss is 0.2, and the validation loss is 0.8. Using ICDAR2015 dataset again, the model indicators were P=85.9%, A=83.1%, R=81.4%, and F1 score=81.8%, respectively. In text recognition, an improved algorithm model based on sequence encoding is combined with two-dimensional spatial features to establish a multi-scale sequence encoding and decoding text recognition algorithm with attention. The recognition rates for different scenes are 86%, 89%, and 93%, respectively, and the recognition rates for different types are 97.7%, 98.5%, 98.9%, and 98.6%, respectively. Finally, based on the OCR system framework and its application on both the server and client, a text detection and recognition module was integrated and keyword extraction was output to the client, thus conducting performance testing on the entire system. Finally, it was proven that the multi-scale sequence encoding detection and recognition model based on SS for AM has superiority in extracting poster text information. However, research still lacks a large amount of experimental data and multifaceted testing in practical application environments. Sakshi et al. constructed patterns for handwritten symbol recognition to obtain features and improve classification and recognition capabilities [20]. Therefore, further research and improvement are needed in subsequent research.

Acknowledgment

The research is supported by: Anhui Province Higher Education Science Research Project (Philosophy and Social Sciences): Research on the Protection and Utilization of Intangible Cultural Heritage in the Huizhou Region from the Perspective of Rural Revitalization: A Case Study of Shexian Woodcarving Inheritance (No.2022AH052033).

Disclosure statement

The author reports there are no competing interests to declare.

References

1 
M. Liao , Z. Zou , Z. Wan , C. Yao , X. Bai , Real-time scene text detection with differentiable binarization and adaptive scale fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45, No. 1, pp. 919-931, 2022DOI
2 
G. Wu , Z. Zhang , Y. Xiong , CarveNet: a channel-wise attention-based network for irregular scene text recognition, International Journal on Document Analysis and Recognition, Vol. 25, No. 3, pp. 177-186, 2022DOI
3 
M. Li , T. Lv , J. Chen , L. Cui , Y. Lu , D. Florencio , C. Zhang , Z. Li , F. Wei , Trocr: transformer-based optical character recognition with pre-trained models, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No. 11, pp. 13094-13102, 2023DOI
4 
T. H. Phan , D. C. Tran , M. F. Hassan , Vietnamese character recognition based on CNN model with reduced character classes, Bulletin of Electrical Engineering and Informatics, Vol. 10, No. 2, pp. 962-969, 2021DOI
5 
Y. C. Liu , H. Joren , O. Gupta , D. Raviv , MRZ code extraction from visa and passport documents using convolutional neural networks, International Journal on Document Analysis and Recognition, Vol. 25, No. 1, pp. 29-39, 2022DOI
6 
Y. Liu , C. Shen , L. Jin , H. Tong , P. Chen , C. Liu , H. Chen , Abcnet v2: adaptive bezier-curve network for real-time end-to-end text spotting, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, No. 11, pp. 8048-8064, 2021DOI
7 
S. Hu , Q. Wang , K. Huang , M. Wen , F. Coenen , Retrieval-based language model adaptation for handwritten Chinese text recognition, International Journal on Document Analysis and Recognition, Vol. 26, No. 2, pp. 109-119, 2023DOI
8 
T. M. Ghazal , Convolutional neural network based intelligent handwritten document recognition, Computers, Materials & Continua, Vol. 70, No. 3, pp. 4563-4581, 2022DOI
9 
X. Ma , H. Xu , X. Zhang , H. Wang , An improved deep learning network structure for multitask text implication translation character recognition, Complexity, Vol. 2021, No. 5, pp. 901-911, 2021DOI
10 
A. Oluwasammi , M. U. Aftab , Z. Qin , S. T. Ngo , T. V. Doan , S. B. Nguyen , S. H. Nguyuen , G. H. Nguyen , Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning, Complexity, Vol. 2021, No. 8, pp. 1-19, 2021DOI
11 
W. Wang , E. Xie , X. Li , D. Fan , K. Song , L. Ding , T. Lu , P. Luo , L. Shao , Pvt v2: improved baselines with pyramid vision transformer, Computational Visual Media, Vol. 8, No. 3, pp. 415-424, 2022DOI
12 
T. Diwan , G. Anirudh , J. V. Tembhurne , Object detection using YOLO: challenges, architectural successors, datasets and applications, Multimedia Tools and Applications, Vol. 82, No. 6, pp. 9243-9275, 2023DOI
13 
R. Karthika , L. Parameswaran , A novel convolutional neural network based architecture for object detection and recognition with an application to traffic sign recognition from road scenes, Pattern Recognition and Image Analysis, Vol. 32, No. 2, pp. 351-362, 2022DOI
14 
S. Biswas , P. Riba , J. Lladós , U. Pal , Beyond document object detection: instance-level segmentation of complex layouts, International Journal on Document Analysis and Recognition, Vol. 24, No. 3, pp. 269-281, 2021DOI
15 
M. Jia , L. Shen , X. Shen , L. Liao , M. Chen , X. He , Z. Chen , J. Li , Mner-qg: an end-to-end mrc framework for multimodal named entity recognition with query grounding, Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, No. 7, pp. 8032-8040, 2023DOI
16 
Y. Guo , Z. Mustafaoglu , D. Koundal , Spam detection using bidirectional transformers and machine learning classifier algorithms, Journal of Computational and Cognitive Engineering, Vol. 2, No. 1, pp. 5-9, 2023DOI
17 
Y. Wu , J. Jiang , Z. Huang , Y. Tian , FPANet: feature pyramid aggregation network for real-time semantic segmentation, Applied Intelligence, Vol. 52, No. 3, pp. 3319-3336, 2022DOI
18 
B. Gao , Y. Pan , C. Li , S. Geng , H. Zhao , Are we hungry for 3D LiDAR data for semantic segmentation? A survey of datasets and methods, IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 7, pp. 6063-6081, 2022DOI
19 
X. Zhao , R. Qin , Q. Zhang , F. Yu , Q. Wang , B. He , DcNet: dilated convolutional neural networks for side-scan sonar image semantic segmentation, Journal of Ocean University of China, Vol. 20, No. 5, pp. 1089-1096, 2021DOI
20 
Sakshi , V. Kukreja , A retrospective study on handwritten mathematical symbols and expressions: classification and recognition, Engineering Applications of Artificial Intelligence, Vol. 103, 2021DOI
Xianfeng Zeng
../../Resources/ieie/IEIESPC.2026.15.3.323/au1.png

Xianfeng Zeng obtained his master’s degree in Art Design (2015) from Wuhan University of Technology. Presently, he is working as an Associate Professor and deputy dean of the College of Art and Creativity, Anhui University of Applied Technology. He was invited as a peer reviewer for the university’s journal. He has published articles in more than 10 professional domestic journals, with 12 patents granted. His areas of interest include visual design, intangible cultural heritage preservation, digital media, and related fields.