To meet the requirement of timely correction of pronunciation in the online ELL process,
the study uses DBN and SVM to classify and check the pronunciation errors of online
ELL speech training as a way to improve the efficiency of the OL process. It can also
provide targeted pronunciation instruction in a timely, efficient, and convenient
manner, and is not subject to the time and space limitations of traditional face-to-face
instruction. The effect of online ELL is optimized by integrating DBN and SVM.
3.1. Model Design for Pronunciation Classification Based on Improved DBN Network
To improve the recognition of pronunciation errors in online ELLs, it is necessary
to categorize and identify the types of online ELL pronunciation to improve the overall
recognition and error detection. In deep neural networks, DBN can use multilayer neural
network with feature detection for hierarchical learning. The hierarchical learning
by detecting neurons is able to obtain the corresponding features, which in turn are
able to perform backpropagation in the learning process, thus achieving global optimization.
Therefore, it is studied to utilize DBN for articulatory speech features for spectrogram
features and acoustic features [15-
17]. The network is a kind of deep neural network built on statistical mechanics, which
is able to complete the description of the speech spectrogram and acoustic features
through the energy function and probability function. The energy function is mainly
used to represent the energy values in different PC error detection cases, and the
optimal PC results can be determined by comparing the energy values in different cases.
The probability function, on the other hand, is used to calculate the probability
values of different PC results, and by comparing the magnitude of the probability
of different results, the final PC result can be determined [18,
19]. Multiple Restricted Boltzmann Machines (RBMs) are stacked to construct the DBN network.
In RBM, $v$ denotes the input speech spectrogram, $h$ denotes the probability values,
and $n$ denotes the hidden layer. The study can independently activate the visible
layer unit if the hidden layer unit’s status is set in the RBM. Currently, the energy
in this RBM may be computed by defining the formula. Eq. (1) illustrates the energy definition calculation formula.
In Eq. (1), $\theta$ denotes the set of all parameters in the RBM. $a_i$ and $b_j$ denote the
bias values in the corresponding cells $i$ and $j$ in the visible cell $v_i$ and hidden
cell $h_j$, respectively. $W_{ij}$ denotes the connection weight value in visible
cell $v_i$ and hidden cell $h_j$. When the energy-defined parameters are determined,
their corresponding joint state distributions can be expressed by Eq. (2).
In Eq. (2), $Z(\theta)$ denotes the allocation function. The activation probability of the visible
units at this point in the online ELL speech classification process can be described
by Eq. (3), since the RBM is activated in the state of distinct hidden units given the visible
units.
Eq. (4) can be used to determine the nodes of the hidden layer based on the estimated activation
probability of the visible units and voice sampling. This yields the output probability
of the hidden layer.
When performing pronunciation training for online ELL, the study utilizes the RBM
model to capture deep features in the speech data. The activation states of each neuron
in the RBM can be determined by computing the output probability of the hidden layer.
These states are essentially a coded representation of the input speech data. To maximize
the model’s performance, the study trains the RBM on feature data, attempting to teach
it the statistical laws found in the speech data. The likelihood function has a very
important role in this process. The likelihood function describes the probability
that the model produces the observed data and is the basis for evaluating the merits
of the model parameters. By maximizing the likelihood function and using it as a basis,
the optimal set of parameters can be found. The model fits the training data more
closely when the ideal parameter set is used, and the training fit to the online ELL
pronunciation data is made possible by choosing the right parameter set [20-
22]. To increase the accuracy of data classification, the settings of the English speech
classification training procedure can be updated after the training fit is finished.
At this point, Eq. (5) can be used to represent the new parameter set training criterion if the number of
samples in the English PC training is T.
Fig. 1. Flowchart of DBN online English learning pronunciation classification model
based on RBM improvement.
On the new parameter set training criterion, the study utilizes the stochastic gradient
method to deal with the values of the likelihood function on the English pronunciation
training data, at which time the corresponding updated parameter criterion can be
expressed by Eq. (6).
In Eq. (6), $\varepsilon$ denotes the updated pronunciation feature learning efficiency. Taken
together, the above studies using the RBM revealed that the process of training online
ELL speech PCs in a DBN network first requires initialization using the BRM. The updated
variables for each parameter are retrieved from the sampled data and used as the basis
for parameter refreshing. The initialization serves as the basis for faster sampling.
The final step is to repeat the training cycle to obtain a model that can be used
for classification.
Since the feature data obtained in the English PC recognition process are all real-type
data, it is insufficient to use only RBM for DBN bottom layer construction. Using
a single RBM to construct the lower layer of the DBN network will reduce the performance
and effectiveness of the DBN network. This is because a single RBM can only capture
local features and cannot capture global features well, while DBN networks need to
capture global features for better feature learning and classification tasks [23,
24]. Therefore, to increase the network’s efficiency and performance, the research must
build the DBN network using multi-layer RBM. Fig. 1 displays the flowchart of the study’s enhanced DBN online ELLPC model, which is based
on RBM.
Combined with Fig. 1, it can be noted that when using the improved DBN to classify pronunciation, the
data is passed from top to bottom through the RBM in the top layer. By utilizing the
top RBM in the DBN network to alternate sampling with the RBMs of other layers, the
balance of the pronunciation data can be significantly improved. This not only improves
the robustness of the model, but also improves the overall running speed of the model.
3.2. Construction of Pronomination Categorization Error Checking Model Based on DBN
Network and SVM
The analysis of the improved DBN network model reveals that the model can effectively
classify the pronunciation, and the next step is to conduct an error detection study
on the classification. In online ELLPC error detection, PED is based on the phonemes
and observation vectors of the pronunciation to decide whether the detected pronunciation
is normal or not. The detection process can be defined by a formalization, and the
definition can be represented by Eq. (7).
In Eq. (7), $q$ denotes isolated articulatory phonemes and $o_1^T$ observation vector. 1 denotes
correct pronunciation and 0 denotes mispronunciation. The study conducts PC error
checking is to decide whether $o_1^T$ and $q$ are pronounced correctly during the
articulation process. At this point the study formalizes the PED pattern classification
process, which can be expressed in Eq. (8).
Combined with the analysis of Eq. (8), it can be noted that the acoustic layer needs to be modeled when modeling the correct
and incorrect articulation of articulatory phonemes, which will increase the labeling
workload. To reduce the difficulty of building the error detection model, the study
reduces the spatial complexity of the features by transforming the dimensionality
of the collected phoneme features to meet the requirements of error detection. The
process of reducing the spatial complexity of the features can be expressed in Eq.
(9).
In Eq. (9), $xp$ denotes the model acoustic computational parameters and $yp$ denotes the modeled
features. After completing the feature training using the DBN model, the study introduced
SVM into the model. That is, SVM is used as the final PED classifier in the top layer
of the DBN model. PED using SVM is to model each type of pronunciation error one by
one, and after the modeling is completed, the posterior probabilities (PostPs) of
all classification models are calculated, and the type of error with the largest a
PostP value is used as the representative model of the current error [25,
26]. The study first obtains the score domain after transforming the feature domain by
the classification model, and then uses SVM to train in the score domain using manually
labeled pronunciation error types. Fig. 2 is the schematic diagram of SVM-based score region division.
Fig. 2. Schematic diagram of score region division based on SVM.
Based on the DBN classification, when using SVM for classification error checking,
it is studied that the classification error checking is defined to be performed in
all classifiers as a way to improve the coverage of classification error checking.
The distance from the point to the classification surface must be precisely described
in the parameter solving procedure [27,
28], and Eq. (10) can be used to express this distance. This process uses margin to obtain the classification
error detection parameter.
In Eq. (10), $t_n$ denotes the category information corresponding to the definition process,
$x_n$ denotes the training samples, and $w$ denotes the training classifier parameters.
To meet the criteria for judging the distance between the points and the classification
surface distance, the correct scaling factor must also be selected during the classification
error checking process. This scaling factor can be expressed using Eq. (11).
In Eq. (11), $t_{\hat{n}}$ is the characteristics of the point and $x_{\hat{n}}$ is the target
category. By determining the distance between the point and the categorization surface
in the process of classification error detection, it is possible to more accurately
determine the correct and incorrect pronunciation. A schematic diagram of Margin-based
SVM classification error detection is shown in Fig. 3.
Fig. 3. Schematic diagram of SVM classification error detection based on margin.
After completing the task of maximum classification error detection, the study utilized
Margin to transform the score domain features to be able to accurately detect the
set of samples that need to be PED. In the process of classification error detection
using the DBN-SVM model, it is first necessary to extract the acoustic feature values
corresponding to each phoneme from the given articulatory data. These eigenvalues
can accurately reflect the key information of pronunciation such as sound quality,
sound length and pitch. These features are then detected using the trained DBN-SVM
model and the corresponding score domain features are calculated. Lastly, manual labeling
must be combined with classification error detection to assure the correctness of
the latter. This involves carefully evaluating and validating each sample’s labeling
findings. This not only ensures the accuracy of the training data, but also provides
corresponding data support for the subsequent formation of an effective error detection
classifier [29,
30]. The training data can be used to create an efficient classifier for error detection,
as shown in the training flowchart of the PED classifier in Fig. 4.
Fig. 4. Training flowchart of pronunciation error detection classifier.
The analysis of the PED classifier shows that in the PC fault detection process, due
to the presence of false alarms and omissions, this leads to an increase in the type
of faults. To solve this problem, the study adds a heuristic strategy to the DBN-SVM
model, which utilizes a heuristic decision criterion for the calculation of the posterior
probability, and the calculation process can be represented by Eq. (12).
In Eq. (12), $P(A,B)$ is the a PostP of error $A$ for categorization check error data $B$. $P(B,A)$
is the a PostP of error $B$ for categorization checking wrong data $A$. $P(A)$ is
the a priori probability (PriP) of error $A$. $P(B)$ is the a PriP of detecting classified
errors, $B$. After addressing false alarms and missed detections using heuristic strategies,
the study redefined the differentiation function for error detection and classification.
The redefined function can be expressed by Eq. (13).
In Eq. (13), $Y$ denotes the value of feature dimension and $N$ is the samples in the model.
The definition of the differentiation function will be able to realize the error checking
problem in different false alarms and omissions, thus improving the reliability of
error checking. Fig. 5 displays the PCECM flowchart, which is based on DBN-SVM.
Fig. 5. Flowchart of pronunciation classification error detection model based on DBN-SVM.
Combined with Fig. 5, it can be noted that the process of error detection for online ELLPC using the model
is as follows: firstly, data acquisition, which obtains the data from the online ELL
pronunciation library and aligns the speech with sentences, words and phonemes as
a way of obtaining the PED categorization dataset. Second, data pretreatment aims
to normalize the data so that the data used for error detection fall within a certain
range and the variability between data sets is minimized. Again, it is to design the
DBN structure using RBM: DBN construction using RBM can determine the hidden layer,
visible layer, processing parameters and other information of DBN. Next is parameter
training for RBM: parameter training for RBM can help improve the accuracy and performance
of the model. The next step involves training the network’s RBMs one layer at a time
to extract data features from the deep network. Continue training the parameters until
each RBM in the DBN network has completed its training. The assessment of the model’s
performance comes last, when the metrics are assessed in light of the test findings.