Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (School of Foreign Studies, Xi’an Medical University, Xi’an, 710021, China)



Deep belief networks, Support vector machines, Pronunciation classification, Online learning, Error checking models

1. Introduction

With the diversification of teaching forms, Online Learning (OL) has gradually become an important way to acquire knowledge. As an important branch of English learning, online English learning can provide students with more flexible and convenient learning methods. However, this freedom and openness to learning also brings new problems, especially in the assessment and correction of pronunciation accuracy. In the traditional Pronunciation Classification (PC) methods, the problems of accuracy and efficiency are particularly prominent. These methods often rely on manual feature extraction, such as acoustic features like the Maier frequency cepstrum coefficient, which is complex and sensitive to parameter tuning, resulting in limited classifier performance [1, 2]. In addition, these methods are computationally inefficient when dealing with large amounts of data, making it difficult to provide real-time feedback on learners’ pronunciation problems. For example, although the hidden Markov model has achieved some success in speech recognition, it is often difficult to adapt to groups of learners with large individual differences in Pronunciation Error Detection (PED) due to its high model initialization and parameter estimation requirements [3, 4]. In an OL environment, learners need immediate and accurate feedback to improve their pronunciation. Although there are many online English learning platforms that provide pronunciation exercises and error correction functions, these systems often provide only simple error prompts and cannot provide accurate error types [5, 6]. Therefore, how to optimize the PED function of online English learning platform and improve the pronunciation accuracy of learners has become an urgent problem to be solved. Deep Belief Network (DBN) as an unsupervised learning method can effectively extract high-level features of data, and Support Vector Machine (SVM) as a supervised learning method, data can be accurately classified, which will have a great impact on students’ English learning and even affect their oral English communication. In order to improve the accuracy and efficiency of PC error detection, this study proposes an innovative PC error detection model. It aims to solve the challenge of pronunciation accuracy assessment in online English learning through the organic integration of DBN and SVM. DBN is responsible for extracting high-level features from complex pronunciation data, while SVM uses these features for accurate classification and error detection. This modular approach not only improves the accuracy of PC, but also optimizes the efficiency of the error detection process, providing an effective pronunciation improvement tool for online English learners.

The study’s first section provides an overview of the state of research on online ELL, DBN networks, and PC SVM. The second section of the study firstly improves the DNB network structure based on RBM, then uses the improved DBN network to categorize the pronunciation types. It finally introduces SVM into the study, which is used to integrate DNB and SVM for the construction of PED categorization model building. In the third section, application analysis and simulation tests are used to confirm the Pronunciation Categorization Error Checking Model’s (PCECM) effectiveness. The experimental results are compiled in the fourth section, which also examines the benefits and drawbacks of the study’s methodology.

2. Related Works

With the diversification of English teaching methods, online ELL has gained widespread popularity. In ELL, oral pronunciation accuracy and fluency are crucial indicators of learners’ abilities. However, traditional pronunciation assessment methods that rely on manual scoring are time-consuming, costly, and not scalable for large-scale ELL. As a result, automated pronunciation evaluation and error detection systems have received considerable attention. Numerous experts have conducted research in this area. For example, Okyar H analyzed the effectiveness of online ELL during the COVID-19 pandemic and found that many students felt loneliness and anxiety, which reduced their learning effectiveness [7]. Cahyaningsih P D used Symmetric Multi-Processing (SMP) to study online teaching strategies for English teachers and concluded that teachers need to apply critical thinking to improve teaching effectiveness [8]. Alzamil A conducted a longitudinal study comparing OL and traditional classroom learning and found that classroom learning produced better results [9]. Wei F proposed a feature extraction-based anomaly detection method for online ELL, using frequent mining and SVM for classification [10]. Jiao F et al. used SVM for error detection in oral practice and constructed a speech recognition module using MATLAB, with positive results [11]. Lou Z et al. proposed a labeling rule based on speech recognition to evaluate the accuracy of spoken pronunciation, with an accuracy rate of up to 86.67% [12]. Paul B et al. used Mel-frequency cepstral coefficients (MFCC) and Deep Neural Networks (DNN) to improve spoken PC, achieving an average accuracy of 89.21% for vowels and 88.56% for diphthongs [13]. Hai Y combined SVM with prior adaptation algorithms to construct a syllable and rhyme feature recognition model, which showed better recognition performance than traditional models [14].

In summary, it is of great significance to study spoken PC error checking for online English Language Learning (ELL) using DBN and SVM. Through summarizing the research of related experts and scholars, it is found that spoken PC error checking has an important impact on spoken English learning, and if the performance of PC error checking can be significantly improved, the learning effect of spoken English can be enhanced to a certain extent. Therefore, the study integrates DBN with SVM to construct PCECM can optimize online ELL and provide learners with a more personalized, efficient and interesting learning experience.

3. DBN and SVM Based Pronunciation Error Detection Model in Learning

To meet the requirement of timely correction of pronunciation in the online ELL process, the study uses DBN and SVM to classify and check the pronunciation errors of online ELL speech training as a way to improve the efficiency of the OL process. It can also provide targeted pronunciation instruction in a timely, efficient, and convenient manner, and is not subject to the time and space limitations of traditional face-to-face instruction. The effect of online ELL is optimized by integrating DBN and SVM.

3.1. Model Design for Pronunciation Classification Based on Improved DBN Network

To improve the recognition of pronunciation errors in online ELLs, it is necessary to categorize and identify the types of online ELL pronunciation to improve the overall recognition and error detection. In deep neural networks, DBN can use multilayer neural network with feature detection for hierarchical learning. The hierarchical learning by detecting neurons is able to obtain the corresponding features, which in turn are able to perform backpropagation in the learning process, thus achieving global optimization. Therefore, it is studied to utilize DBN for articulatory speech features for spectrogram features and acoustic features [15- 17]. The network is a kind of deep neural network built on statistical mechanics, which is able to complete the description of the speech spectrogram and acoustic features through the energy function and probability function. The energy function is mainly used to represent the energy values in different PC error detection cases, and the optimal PC results can be determined by comparing the energy values in different cases. The probability function, on the other hand, is used to calculate the probability values of different PC results, and by comparing the magnitude of the probability of different results, the final PC result can be determined [18, 19]. Multiple Restricted Boltzmann Machines (RBMs) are stacked to construct the DBN network. In RBM, $v$ denotes the input speech spectrogram, $h$ denotes the probability values, and $n$ denotes the hidden layer. The study can independently activate the visible layer unit if the hidden layer unit’s status is set in the RBM. Currently, the energy in this RBM may be computed by defining the formula. Eq. (1) illustrates the energy definition calculation formula.

(1)
$E(v,h;\theta) = a_iv_i - b_jh_j - v_iW_{ij}h_j.$

In Eq. (1), $\theta$ denotes the set of all parameters in the RBM. $a_i$ and $b_j$ denote the bias values in the corresponding cells $i$ and $j$ in the visible cell $v_i$ and hidden cell $h_j$, respectively. $W_{ij}$ denotes the connection weight value in visible cell $v_i$ and hidden cell $h_j$. When the energy-defined parameters are determined, their corresponding joint state distributions can be expressed by Eq. (2).

(2)
$\left\{\begin{aligned} P(v,h;\theta) = \frac{e^{-E(v,h;\theta)}}{Z}, \\ Z(\theta) = \sum_{v,h} e^{-E(v,h;\theta)}. \end{aligned}\right.$

In Eq. (2), $Z(\theta)$ denotes the allocation function. The activation probability of the visible units at this point in the online ELL speech classification process can be described by Eq. (3), since the RBM is activated in the state of distinct hidden units given the visible units.

(3)
$P_1(h_j = 1; v,\theta) = sigmoid(b_j + \sum_i v_iW_{ij}).$

Eq. (4) can be used to determine the nodes of the hidden layer based on the estimated activation probability of the visible units and voice sampling. This yields the output probability of the hidden layer.

(4)
$P_2(v_i = 1; h,\theta) = sigmoid(b_j + \sum_i h_jW_{ij}).$

When performing pronunciation training for online ELL, the study utilizes the RBM model to capture deep features in the speech data. The activation states of each neuron in the RBM can be determined by computing the output probability of the hidden layer. These states are essentially a coded representation of the input speech data. To maximize the model’s performance, the study trains the RBM on feature data, attempting to teach it the statistical laws found in the speech data. The likelihood function has a very important role in this process. The likelihood function describes the probability that the model produces the observed data and is the basis for evaluating the merits of the model parameters. By maximizing the likelihood function and using it as a basis, the optimal set of parameters can be found. The model fits the training data more closely when the ideal parameter set is used, and the training fit to the online ELL pronunciation data is made possible by choosing the right parameter set [20- 22]. To increase the accuracy of data classification, the settings of the English speech classification training procedure can be updated after the training fit is finished. At this point, Eq. (5) can be used to represent the new parameter set training criterion if the number of samples in the English PC training is T.

(5)
$\theta^* = \arg\max_\theta \sum_{i=1}^T \log P(v_i;\theta).$

Fig. 1. Flowchart of DBN online English learning pronunciation classification model based on RBM improvement.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig1.png

On the new parameter set training criterion, the study utilizes the stochastic gradient method to deal with the values of the likelihood function on the English pronunciation training data, at which time the corresponding updated parameter criterion can be expressed by Eq. (6).

(6)
$\left\{\begin{aligned} \Delta a_i = \varepsilon((v_i)_{data} - (v_i)_{recon}), \\ \Delta b_j = \varepsilon((h_j)_{data} - (h_j)_{recon}), \\ \Delta W_{ij} = \varepsilon((v_ih_j)_{data} - (v_ih_j)_{recon}). \end{aligned}\right.$

In Eq. (6), $\varepsilon$ denotes the updated pronunciation feature learning efficiency. Taken together, the above studies using the RBM revealed that the process of training online ELL speech PCs in a DBN network first requires initialization using the BRM. The updated variables for each parameter are retrieved from the sampled data and used as the basis for parameter refreshing. The initialization serves as the basis for faster sampling. The final step is to repeat the training cycle to obtain a model that can be used for classification.

Since the feature data obtained in the English PC recognition process are all real-type data, it is insufficient to use only RBM for DBN bottom layer construction. Using a single RBM to construct the lower layer of the DBN network will reduce the performance and effectiveness of the DBN network. This is because a single RBM can only capture local features and cannot capture global features well, while DBN networks need to capture global features for better feature learning and classification tasks [23, 24]. Therefore, to increase the network’s efficiency and performance, the research must build the DBN network using multi-layer RBM. Fig. 1 displays the flowchart of the study’s enhanced DBN online ELLPC model, which is based on RBM.

Combined with Fig. 1, it can be noted that when using the improved DBN to classify pronunciation, the data is passed from top to bottom through the RBM in the top layer. By utilizing the top RBM in the DBN network to alternate sampling with the RBMs of other layers, the balance of the pronunciation data can be significantly improved. This not only improves the robustness of the model, but also improves the overall running speed of the model.

3.2. Construction of Pronomination Categorization Error Checking Model Based on DBN Network and SVM

The analysis of the improved DBN network model reveals that the model can effectively classify the pronunciation, and the next step is to conduct an error detection study on the classification. In online ELLPC error detection, PED is based on the phonemes and observation vectors of the pronunciation to decide whether the detected pronunciation is normal or not. The detection process can be defined by a formalization, and the definition can be represented by Eq. (7).

(7)
$\text{given } q \text{ and } o_1^T \text{ deside observation } \in \left\{\begin{aligned} 1, \\ 0. \end{aligned}\right.$

In Eq. (7), $q$ denotes isolated articulatory phonemes and $o_1^T$ observation vector. 1 denotes correct pronunciation and 0 denotes mispronunciation. The study conducts PC error checking is to decide whether $o_1^T$ and $q$ are pronounced correctly during the articulation process. At this point the study formalizes the PED pattern classification process, which can be expressed in Eq. (8).

(8)
$d(xp) = \max_{d(\cdot)\in\vartheta} P(\omega_{index(d(xp))}xp).$

Combined with the analysis of Eq. (8), it can be noted that the acoustic layer needs to be modeled when modeling the correct and incorrect articulation of articulatory phonemes, which will increase the labeling workload. To reduce the difficulty of building the error detection model, the study reduces the spatial complexity of the features by transforming the dimensionality of the collected phoneme features to meet the requirements of error detection. The process of reducing the spatial complexity of the features can be expressed in Eq. (9).

(9)
$xp = (O_1^T) \Rightarrow yp = f(xp).$

In Eq. (9), $xp$ denotes the model acoustic computational parameters and $yp$ denotes the modeled features. After completing the feature training using the DBN model, the study introduced SVM into the model. That is, SVM is used as the final PED classifier in the top layer of the DBN model. PED using SVM is to model each type of pronunciation error one by one, and after the modeling is completed, the posterior probabilities (PostPs) of all classification models are calculated, and the type of error with the largest a PostP value is used as the representative model of the current error [25, 26]. The study first obtains the score domain after transforming the feature domain by the classification model, and then uses SVM to train in the score domain using manually labeled pronunciation error types. Fig. 2 is the schematic diagram of SVM-based score region division.

Fig. 2. Schematic diagram of score region division based on SVM.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig2.png

Based on the DBN classification, when using SVM for classification error checking, it is studied that the classification error checking is defined to be performed in all classifiers as a way to improve the coverage of classification error checking. The distance from the point to the classification surface must be precisely described in the parameter solving procedure [27, 28], and Eq. (10) can be used to express this distance. This process uses margin to obtain the classification error detection parameter.

(10)
$d(x, y(x) = 0) = \frac{t_ny(x_n)}{\|w\|}.$

In Eq. (10), $t_n$ denotes the category information corresponding to the definition process, $x_n$ denotes the training samples, and $w$ denotes the training classifier parameters. To meet the criteria for judging the distance between the points and the classification surface distance, the correct scaling factor must also be selected during the classification error checking process. This scaling factor can be expressed using Eq. (11).

(11)
$t_{\hat{n}}(y(x_{\hat{n}}) + b) = 1.$

In Eq. (11), $t_{\hat{n}}$ is the characteristics of the point and $x_{\hat{n}}$ is the target category. By determining the distance between the point and the categorization surface in the process of classification error detection, it is possible to more accurately determine the correct and incorrect pronunciation. A schematic diagram of Margin-based SVM classification error detection is shown in Fig. 3.

Fig. 3. Schematic diagram of SVM classification error detection based on margin.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig3.png

After completing the task of maximum classification error detection, the study utilized Margin to transform the score domain features to be able to accurately detect the set of samples that need to be PED. In the process of classification error detection using the DBN-SVM model, it is first necessary to extract the acoustic feature values corresponding to each phoneme from the given articulatory data. These eigenvalues can accurately reflect the key information of pronunciation such as sound quality, sound length and pitch. These features are then detected using the trained DBN-SVM model and the corresponding score domain features are calculated. Lastly, manual labeling must be combined with classification error detection to assure the correctness of the latter. This involves carefully evaluating and validating each sample’s labeling findings. This not only ensures the accuracy of the training data, but also provides corresponding data support for the subsequent formation of an effective error detection classifier [29, 30]. The training data can be used to create an efficient classifier for error detection, as shown in the training flowchart of the PED classifier in Fig. 4.

Fig. 4. Training flowchart of pronunciation error detection classifier.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig4.png

The analysis of the PED classifier shows that in the PC fault detection process, due to the presence of false alarms and omissions, this leads to an increase in the type of faults. To solve this problem, the study adds a heuristic strategy to the DBN-SVM model, which utilizes a heuristic decision criterion for the calculation of the posterior probability, and the calculation process can be represented by Eq. (12).

(12)
$P(A,B) = P(B,A) * \frac{P(A)}{P(B)}.$

In Eq. (12), $P(A,B)$ is the a PostP of error $A$ for categorization check error data $B$. $P(B,A)$ is the a PostP of error $B$ for categorization checking wrong data $A$. $P(A)$ is the a priori probability (PriP) of error $A$. $P(B)$ is the a PriP of detecting classified errors, $B$. After addressing false alarms and missed detections using heuristic strategies, the study redefined the differentiation function for error detection and classification. The redefined function can be expressed by Eq. (13).

(13)
$\text{discriminant}(x) > 0 \left\{\begin{aligned} Y, \quad d(x) = \omega_1, \\ N, \quad d(x) = \omega_0. \end{aligned}\right.$

In Eq. (13), $Y$ denotes the value of feature dimension and $N$ is the samples in the model. The definition of the differentiation function will be able to realize the error checking problem in different false alarms and omissions, thus improving the reliability of error checking. Fig. 5 displays the PCECM flowchart, which is based on DBN-SVM.

Fig. 5. Flowchart of pronunciation classification error detection model based on DBN-SVM.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig5.png

Combined with Fig. 5, it can be noted that the process of error detection for online ELLPC using the model is as follows: firstly, data acquisition, which obtains the data from the online ELL pronunciation library and aligns the speech with sentences, words and phonemes as a way of obtaining the PED categorization dataset. Second, data pretreatment aims to normalize the data so that the data used for error detection fall within a certain range and the variability between data sets is minimized. Again, it is to design the DBN structure using RBM: DBN construction using RBM can determine the hidden layer, visible layer, processing parameters and other information of DBN. Next is parameter training for RBM: parameter training for RBM can help improve the accuracy and performance of the model. The next step involves training the network’s RBMs one layer at a time to extract data features from the deep network. Continue training the parameters until each RBM in the DBN network has completed its training. The assessment of the model’s performance comes last, when the metrics are assessed in light of the test findings.

4. Performance Analysis of Pronunciation Categorization Error Checking Model with DBN-SVM Fusion

To validate the performance of PCECM constructed by fusing DBN-SVM, the study uses the data processing performance and application performance of the model as validation objects for performance analysis. Meanwhile, the model training loss value, Bilingual Evaluation Understudy (BLEU) score, accuracy, F1 value, capture accuracy, and error detection score are used as validation metrics for model specific performance analysis.

4.1. Data Processing Performance Analysis of Pronunciation Categorization Error Checking Model for DBM-SVM

In order to fully evaluate the performance of the fusion DBN-SVM PC error detection model, Hidden Markov Models (HMM) and Mel-frequency cepstral coefficients-Support Vector Machines (Mel-frequency Cepstral Coefficients-Support Vector Machines (MFCC-SVM) are studied. These models are chosen as benchmarks for comparison because of their wide application and acceptance in the field of speech recognition and PC. HMM is widely used to model speech signals because of its advantages in processing time series data, although it has limitations in dealing with individual pronunciation differences and dynamic environmental changes. Mfcc-svm combines the generality of MFCC features with the classification capabilities of SVM, although MFCC features may not fully capture all complex pronunciation features, and SVM is sensitive to feature selection and preprocessing steps. The study carefully controlled the experimental conditions and used Matlab 7 simulation software as the test platform to ensure the fairness and accuracy of the comparison. To guarantee the correctness and integrity of the audio stream, the study fixed the number of sample points at 120. To capture the minute variations in the pronunciation process, 100 kHz is chosen as the frequency of pronunciation feature extraction. The study gathers a total of 1026 audio data segments of online ELL pronunciation in order to guarantee the validity of the simulation experiment outcomes. After screening, 983 audio segments finally meet the experimental requirements and are used to construct the dataset for evaluating the PCECM of DBN-SVM.

To validate the data processing ability of the DBN-SVM model in the PED process, the study uses the training function loss value and BLEU score as validation metrics, and uses HMM, MFCC-SVM, and DBN-SVM for comparison. Fig. 6 displays the findings of the comparison between the three models’ BLEU scores and training loss values.

Fig. 6. Comparison results of training loss values and BLEU scores for three models.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig6.png

In Fig. 6(a), the loss values of the three models in training increase significantly with oscillations as the training progresses. Among them, the loss value of DBN-SVM tends to stabilize at 869 iterations and no longer changes significantly, the loss value of MFCC-SVM tends to stabilize at 1357 iterations, and the loss value of HMM tends to stabilize at 1403 iterations. In Fig. 6(b), all three models have good scores in data BLEU, the score of DBN-SVM is 0.85, and the BLEU scores of MFCC-SVM and HMM are 0.81 and 0.79, respectively. This displays that the DBN-SVM model constructed by the study has a higher reliability in data processing of PC error checking, and it has a certain degree of superiority in data processing. To improve the data processing performance of the DBN-SVM model, the penalty parameter can regulate the complexity of the model, prevent overfitting, and improve the accuracy and generalization ability of the model. On the other hand, the vector base parameter might influence the model’s stability and rate of convergence, assisting it in finding the best solution more quickly. Fig. 7 illustrates the DBN-SVM model’s data processing accuracy as a function of various vector basis and penalty parameters.

Fig. 7. Data processing accuracy of DBN-SVM model under different influence of basis and penalty parameters.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig7.png

Fig. 7’s comparison analysis reveals that the accuracy rate of data processing exhibits a tendency of growing and subsequently falling with a rise in the vector base parameter and the penalty parameter. The maximum accuracy rate of data processing occurs when the vector base parameter value is 0.3 and the penalty parameter is 18. This indicates that by adjusting these two parameters, the model can be better adapted to the actual situation and improve the effect and performance of PCECM. The study uses the F1 value and error detection accuracy rate as validation indices to look into the models’ data processing performance. The findings of this comparison of the three models’ F1 values and detection accuracy rates are displayed in Fig. 8.

Fig. 8. Comparison results of detection accuracy and F1 value of three models.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig8.png

In Fig. 8(a), there is a certain gap between the error detection ability of the three models in the pronunciation data error detection process. The highest error detection accuracy is DBN-SVM, followed by MFCC-SVM and HMM, and the error detection accuracies of the three kinds are 91.69%, 85.19%, and 80.66%, respectively. The data processing ability and robustness of the research-constructed model are demonstrated by the F1 values of DBN-SVM, MFCC-SVM, and HMM in the data error detection process, as illustrated in Fig. 8(b), which are 0.83, 0.79, and 0.72, respectively. The study employs the PED capture accuracy as a validation metric for performance analysis in order to further validate the DBN-SVM model’s performance. Fig. 9 displays the comparison findings of the three models’ PED capture accuracy and capture error.

Fig. 9. Comparison results of pronunciation error detection capture accuracy and capture error of three models.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig9.png

In Fig. 9(a), in the comparison of data capture accuracy for PC error checking, all three models achieve 100% capture accuracy at the end of iterations. However, the DBN-SVM model has accomplished 100% capture accuracy at the end of 206 iterations. The MFCC-SVM reaches 100% capture accuracy at the end of 273 iterations. the HMM accomplishes 100% capture accuracy only at the end of 350 iterations. In Fig. 9(b), the capture errors of DBN-SVM, MFCC-SVM, and HMM are 7.31%, 10.93%, and 14.37%, respectively, in the capture of PED errors. This indicates that the model constructed in the study is able to accomplish the data acquisition in the PC error checking process with a smaller number of iterations, and the acquisition accuracy can fully meet the daily requirements.

4.2. Performance Analysis of Pronunciation Categorization Error Checking Model Application for DBM-SVM

To validate the application performance of the PCECM with DBN-SVM constructed by the study, the study validates the error detection performance for monosyllables, disyllables and polysyllables in English pronunciation. The comparison results of the three types of syllables’ error detection accuracy and error rate in the DBN-SVM model are displayed in Fig. 10.

Fig. 10. Comparison results of error detection accuracy and error rate of three syllables in the DBN-SVM model.

../../Resources/ieie/IEIESPC.2026.15.1.42/fig10.png

In Fig. 10(a), the error detection efficiency of polysyllabic words is lower than that of disyllabic and monosyllabic words in the PC error detection process for the three syllables. This is because multisyllabic words tend to contain more syllables and phonemes, leading to more confusion and errors in the pronunciation recognition and classification process. In addition, polysyllabic words may contain more phonetic variations and rules of alliteration, making error detection more difficult. In contrast, disyllabic and monosyllabic words tend to be simpler and more straightforward, with more fixed and stable pronunciation rules, making it easier to achieve high efficiency in PC error detection. The accuracy rates of monosyllabic, disyllabic and polysyllabic words are 95.07%, 93.72%, and 88.69%, respectively. In Fig. 10(b), in the comparison of error detection error rates for the three syllables, monosyllables have the lowest error detection rate. The error rates of detection for monosyllables, disyllables, and polysyllables are 4.39%, 6.13%, and 8.92%, respectively. This indicates that the model constructed in the study can effectively accomplish the error detection of multiple syllables. Further, more advanced Convolutional Neural Networks (CNNs) and Transformer models are added for comparison, and the results are shown in Table 1.

Table 1. Comparative experimental results.

Model Accuracy (%) Recall (%) F1 value Training time (s) Testing time (s) Number of parameters
DBN-SVM 94.12 93.56 0.92 3500 0.48 200K
HMM 84.21 83.11 0.78 2900 0.27 100K
MFCC-SVM 88.45 87.67 0.85 3200 0.39 150K
CNN 91.89 90.12 0.89 7800 0.67 300K
Transformer 92.56 91.34 0.90 11500 0.95 1000K

In Table 1, the DBN-SVM model has shown excellent performance in all key performance indicators, with an accuracy rate of 94.12%, a recall rate of 93.56%, and an F1 value of 0.92, all of which are better than other comparison models, including traditional HMM and MFCC-SVM. As well as advanced CNN and Transformer models. The DBN-SVM model not only has excellent performance in classification accuracy, but also has obvious advantages in training efficiency, with a training time of 3500 s and a test time of 0.48 s, which is much lower than the CNN and Transformer models, making DBN-SVM more attractive in OL environments that require fast response. In addition, the DBN-SVM model has 200K parameters. Compared with CNN and Transformer models with a large number of parameters, its model complexity is lower, which helps to reduce the risk of overfitting and improve the generalization ability of the model. These results show that the DBN-SVM model provides an excellent balance between accuracy and efficiency in the PC error detection task, which is particularly suitable for online English learning platform applications.

The study integrates many pronunciation variables for model performance verification in order to further confirm the model’s performance. The comparison results of pronunciation accuracy for various feature combinations are displayed in Table 2.

Table 2. Comparison results of pronunciation accuracy for different feature combinations.

Feature combination Training set accuracy/% Test set accuracy/%
Basic feature 94.36 93.49
Word clustering 95.09 94.29
Statistical information of unlabelled corpus 96.17 95.06
Word pronunciation 96.05 95.18
Word clustering+word pronunciation 96.15 93.59
Word pronunciation+unlabeled corpus statistics 96.28 93.48
Word clustering+word pronunciation+unlabeled corpus statistics 96.89 94.73

As can be observed from the comparison in Table 2, the accuracy of PC error detection can be significantly improved by feature combination. Among them, the accuracy rate of pronunciation basic features in the PC error detection training set is the lowest, and the accuracy rate of word clustering+word pronunciation+unlabeled corpus statistics is the highest, which is 94.36% and 96.89%, respectively. In the PC error checking test set, the accuracy rates of different feature combinations are somewhat decreased compared with the training set. The lowest error detection accuracy is word clustering+word pronunciation+unlabeled corpus statistics word clustering + word pronunciation+unlabeled corpus statistics, and the highest is word pronunciation, with accuracies of 93.48% and 95.18%, respectively. This suggests that the lexical labeling ability in the PC error detection process can be greatly improved by the PCECM constructed in the study. The improvement of lexical standardization ability can not only enhance the reliability of PED in the online ELL process, but also provide a reliable basis for pronunciation correction. The study uses the error detection of pronunciation speed, intonation, rhythm, and emotion as the indexes and scores the pronunciation of ten distinct OL students with the final scores in order to validate the effectiveness of the DBN-SVM model in the online ELL PC error detection process.

Table 3. Comparison results of scores for different error detection scoring indicators.

Voice sequence number Speech speed error detection rating Pronunciation error detection score Rhythm error detection rating Emotional rating Final rating
1 7.3 8.6 8.3 6.8 7.9
2 7.9 8.4 7.9 7.1 7.8
3 8.1 8.2 8.1 6.8 8
4 7.6 8.5 7.6 6.6 7.6
5 6.2 8.1 7.7 7.0 7.9
6 7.4 7.9 7.6 6.1 7.8
7 6.9 8.2 7.9 6.5 8
8 8.1 8.1 8.0 6.6 7.9
9 7.6 8.6 7.6 6.8 7.7
10 7.8 8.3 7.9 7.1 7.6

According to Table 3’s analysis, the speed of speech mistake checking score has a maximum rating of 8.1 and a lowest rating of 6.2. Pitch error checking score has a maximum grade of 8.6 and a minimum rating of 7.6. The highest rating in the pronunciation emotion score is 7.1, and the lowest rating is 6.1. Through the validation analysis of the samples, it is found that the voice samples have good performances in the areas of speed of speech, pitch, rhythm and emotion, but further improvement may be needed in emotion expression and rhythm control. This indicates that the model can effectively evaluate the pronunciation of online ELLs and provide corresponding optimization directions, which can provide more professional technical support for online ELLs.

5. Conclusion

Aiming at the problems of poor classification accuracy and low computational efficiency of PC methods in traditional ELL, the study organically integrates DBN and SVM to construct a PCECM. and conducts an in-depth study on its application in online ELL. The study found that the DBN-SVM model accurately detected errors in monosyllabic, bisyllabic, and polysyllabic online ELL pronunciation with an accuracy rate of 95.07%, 93.72%, and 88.69%, respectively. Additionally, the model had an average scoring speed of 46.3 s and an accuracy rate of 92.08% in PC error detection. The model incorporating DBN-SVM achieved high accuracy and efficiency in PC and error detection tasks, with a high level of error detection in multiple feature combinations. The accuracy in word clustering+word pronunciation+unlabeled corpus statistics reached 96.89%, and the highest score in pitch error detection scoring was 8.6. This indicated that it was able to identify and correct students’ pronunciation errors more accurately. Applying the PCECM constructed by the study to online ELL could effectively optimize the learning process and help students master English pronunciation to improve their efficiency and performance. Despite achieving better results, the study still has some shortcomings. During the study of PED, little research is conducted on the influence of environmental factors such as background noise, and in the next step, to improve the overall effect of the error detection model, the processing of environmental factors such as noise reduction can be taken as a research direction to improve the reliability of PCECM.

Acknowledgements

Disclosure of potential conflicts of interest

The author declareS that there is no conflict of interest.

Research involving Human Participants and/or Animals

None.

Informed consent

The author agreed that the paper could be published

References

1 
Inayati N. , Rachmadhani R. A. , Utami B. N. , 2021, Students’ strategies in online autonomous English language learning, Journal of English Education Studies, Vol. 6, No. 1, pp. 59-67DOI
2 
Alodwan T. , 2021, Online learning during the COVID-19 pandemic from the perspectives of English as foreign language students, Educational Research and Reviews, Vol. 16, No. 7, pp. 279-288DOI
3 
Choudhuri S. , Adeniye S. , Sen A. , 2023, Distribution alignment using complement entropy objective and adaptive consensus-based label refinement for partial domain adaptation, Artificial Intelligence Advances, Vol. 1, No. 1, pp. 43-51DOI
4 
Aisyah R. N. , Istiqomah D. M. , Muchlisin M. , 2021, Rising English students’ motivation in online learning platform: Telegram apps support Utamax, Jurnal Teknologi Pendidikan, Vol. 3, No. 2, pp. 90-96DOI
5 
Duan R. , Kawahara T. , Dantsuji M. , 2019, Cross-lingual transfer learning of non-native acoustic modeling for pronunciation error detection and diagnosis, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28, pp. 391-401Google Search
6 
Miqawati A. H. , 2020, Pronunciation learning, participation, and attitude enhancement through mobile assisted language learning, Journal of English Education, Vol. 8, No. 2, pp. 211-218DOI
7 
Okyar H. , 2023, University-level EFL students’ views on learning English online: A qualitative study, Education and Information Technologies, Vol. 28, No. 1, pp. 81-107DOI
8 
Cahyaningsih D. P. , 2021, Teaching strategies used by English teachers in online learning, Universal Publishing Group, Vol. 5, No. 2, pp. 79-83DOI
9 
Alzamil A. , 2022, L2 learning of English conditionals: Online versus traditional classroom teaching, Sino-US English Teaching, Vol. 19, No. 3, pp. 79-87DOI
10 
Wei F. , 2023, Study on behaviour anomaly detection method of English online learning based on feature extraction, International Journal of Intelligent Systems, Vol. 15, No. 1, pp. 41-47DOI
11 
Jiao F. , Song J. , Zhao X. , 2021, A spoken English teaching system based on speech recognition and machine learning, International Journal of Emerging Technologies in Learning, Vol. 16, No. 14, pp. 68-82DOI
12 
Lou Z. , Ren Y. , 2021, Investigating issues with machine learning for accent classification, Journal of Physics: Conference Series, Vol. 1738, No. 1, pp. 1-11DOI
13 
Paul B. , Phadikar S. , 2023, A novel pre-processing technique of amplitude interpolation for enhancing the classification accuracy of Bengali phonemes, Multimedia Tools and Applications, Vol. 82, No. 5, pp. 7735-7755DOI
14 
Hai Y. , 2020, Computer-aided teaching mode of oral English intelligent learning based on speech recognition and network assistance, Journal of Intelligent and Fuzzy Systems, Vol. 39, No. 4, pp. 5749-5760DOI
15 
Chittaragi N. B. , Koolagudi S. G. , 2020, Automatic dialect identification system for Kannada language using single and ensemble SVM algorithms, Language Resources and Evaluation, Vol. 54, No. 2, pp. 553-585DOI
16 
Zhang Z. , Wang Y. , Yang J. , 2021, Text-conditioned transformer for automatic pronunciation error detection, Speech Communication, Vol. 130, pp. 55-63DOI
17 
Liu Y. , Quan Q. , 2022, AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning, Journal of Knowledge Management, Vol. 21, No. 2, pp. 1-19DOI
18 
Zhu Y. , 2021, Development of wireless sensor device for machine English oral pronunciation noise detection, Journal of Sensors, Vol. 2021DOI
19 
Shi Y. , Ko Y. C. , 2022, Construction of English pronunciation judgment and detection model based on deep learning neural networks data stream fusion, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 36, No. 6DOI
20 
Li M. , Bai R. , 2021, Recognition of English information and semantic features based on support vector machine and machine learning, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2205-2215Google Search
21 
Fouz-González J. , 2020, Using apps for pronunciation training: an empirical evaluation of the English File Pronunciation app, International Journal of Educational Research Review, Vol. 5, No. 1, pp. 62-85DOI
22 
Widya W. , Agustiana E. , 2020, English vowels pronunciation accuracy: an acoustic phonetics study with Praat, Journal of English Language Teaching, Vol. 4, No. 2, pp. 112-120DOI
23 
Tejedor-García C. , Escudero-Mancebo D. , Cámara-Arenas E. , 2020, Assessing pronunciation improvement in students of English using a controlled computer-assisted pronunciation tool, IEEE Transactions on Learning Technologies, Vol. 13, No. 2, pp. 269-282DOI
24 
Yang L. , Fu K. , Zhang J. , Shinozaki T. , 2020, Pronunciation erroneous tendency detection with language adversarial represent learning, Proceedings of Interspeech 2020, pp. 3024-3028Google Search
25 
Fang C. , 2021, Intelligent online English teaching system based on support vector machine algorithm and complex network, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2709-2719Google Search
26 
Wang J. , 2020, Speech recognition of oral English teaching based on deep belief network, International Journal of Emerging Technologies in Learning, Vol. 15, No. 10, pp. 100-112DOI
27 
Yang L. , 2021, Research on the realization path of college English education based on the support vector machine algorithm model under the background of cloud computing and wireless communication, Scientific Programming, Vol. 2021Google Search
28 
Liu Z. , Xu Y. , 2023, Deep learning assessment of syllable affiliation of intervocalic consonants, The Journal of the Acoustical Society of America, Vol. 153, No. 2, pp. 848-866DOI
29 
Hou Q. , Li C. , Kang M. , 2021, Intelligent model for speech recognition based on support vector machine: a case study on English language, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2721-2731Google Search
30 
Yarra C. , Ghosh P. K. , 2022, Automatic syllable stress detection under non-parallel label and data condition, Speech Communication, Vol. 138, pp. 80-87DOI
Rong Zhang
../../Resources/ieie/IEIESPC.2026.15.1.42/au1.png

Rong Zhang was born in May 1983, female, native of Yulin, Shaanxi Province, China, of Han ethnicity. She obtained a bachelor’s degree in English linguistics from Xi’an International Studies University in 2007 and a master’s degree in teaching English to speakers of other languages (TESOL) from the University of York, UK, in 2009. Her research focuses on English linguistics, English teaching, and intercultural communication. Since 2010, she has been a lecturer at the School of Foreign Studies, Xi’an Medical University. She has published one monograph, presided over and participated in three research projects, and led one teaching reform project. Additionally, she has published more than ten academic papers.