3.1 Speech Recognition based on an HMM Model
The automatic scoring model of English interpreters built by a research institute
is essentially a process of converting speech into text and scoring the content. The
specific process is shown in Fig. 1.
In Fig. 1, speech recognition technology decodes speech signals to generate text. This process
can be transformed into a mathematical problem by using the Bayesian statistical modeling
framework, and the expression of this mathematical problem is shown in Formula (1) [14-16]:
Fig. 1. Structure of the Automatic Scoring Model for English Interpretations.
In Formula (1), $O$ represents the acoustic feature vector of the input speech; $W$ represents the
subsequence corresponding to the acoustic feature vector; $p(W\left| O\right.)$ represents
a posteriori probability, which is defined as the probability that a sequence of words
occurs in a particular acoustic feature vector; and $p(O\left| W\right.)$ represents
the matching degree between acoustic feature vector and word sequence. This model
is called an acoustic model (AM) in which $p(W)$ represents the probability of word
sequences in the text and $p(O)$ represents a constant term independent of the sequence
of words. At present, the AM is constructed using the HMM, and its state can only
be inferred by the observed vector rather than observed directly [17-19]. Each observed vector in the HMM is controlled by a certain probability density distribution.
The DNN model has an advantage in obtaining the HMM probability distribution density
by using the Bayesian formula. The HMM parameter is denote $\phi $, the observed vector
is denoted $O$, and its sequence expression is $O=\left\{o_{1},o_{2},\cdot \cdot \cdot
,o_{T}\right\}$. The problem in solving the model was transformed into one generating
the observed appropriate likelihood degree, which is expressed as $p(O\left| \phi
\right.)$. The likelihood degree can be solved recursively by the forward algorithm,
and the rigid method can reduce the time complexity generated in the calculation of
$p(O\left| \phi \right.)$. The solution equation for the HMM state sequence is Formula
(2):
Editor - Highlight - Is this intended (relating to, or derived by, reasoning from
observed facts)? Or do we mean posterior probability (used more than 30 times in this
paper)? Please confirm all such instances in this paper.
Editor - Highlight - Is this intended? (Not ``a revised probability that takes into
account new available information'' instead?) Please confirm (see above). Source:
https://www.statistics.com/glossary/posterior-probability/
In Formula (2), $\hat{S}$ represents the state sequence. The calculation of the AM is Formula (3):
In Formula (3), $b_{s(t)}$ represents the output probability density distribution; $s_{\left(1\right)}$
indicates the status at a certain time; and $a_{s(t)}$ indicates the probability of
state transition at a given time. If parameter $\phi $ is the maximum likelihood score
of the observed vector, Formula (4) can be obtained according to the recursion formula:
Editor - Highlight - This exact term is not in Formula (3). Is that OK?
In Formula (4), $\phi _{j}(t)$ is the maximum likelihood score located in the state at time $t$.
The underflow of the likelihood value can be prevented by taking the logarithm of
the likelihood score. By solving the constructed AM and the maximum likelihood score,
speech recognition can be achieved successfully.
3.2 Construction of the DNN Model
The basic speech recognition method is obtained by AM construction, but the pronunciation
state is not reflected in the speech recognition task, so the DNN model is introduced
into the HMM. Suppose the constructed DNN model has hidden layer (HL) $L$ and output
layer (OL) 1. Input data are output by the DNN model in the expression $p(s\left|
o\right.)$. The DNN model’s process is shown in Formula (5):
In Formula (5), $W^{l}$ represents weight; $bias^{l}$ indicates network bias; $\sigma $ represents
a sigmoid function, which uses the softmax function in the OL. This study uses the
cross-entropy criterion to update the DNN model. Stochastic Gradient Descent (SGD)
is used to concatenate some data information. Due to the strong randomness of SGD,
fluctuation of gradient updating causes great changes. Therefore, an impulse factor
can be introduced to reduce the fluctuation caused by randomness, and a weight attenuation
factor can be added at the same time to avoid the phenomenon of overfitting by punishing
data with too much weight. In the DNN model, the matrix multiplication of the SGD
process takes a lot of time. In terms of reducing operation time, the parallel computing
capability of a graphics processor can be used to achieve it. In the HMM model, the
expression of probability distribution density is expressed in Formula (6):
In Formula (6), $\pi $ represents the initial state; $N(o\left| \mu _{ki},\sum ki\right.)$ represents
the mean vector; and $p(o\left| s_{i}\right.)$ represents the DNN. The Bayesian formula
is used to transform the DNN so that Formula (7) is obtained:
Formula (7) is the posterior probability of a node state when the DNN model inputs a vector.
In (7), $p(s_{i})$ represents the prior probability of the HMM state, which can be obtained
through the training set, and $p(o)$ is a constant. The DNN form constructed by Formula
(7) and the DNN architecture applied to the HMM are shown in Fig. 2.
Fig. 2 shows the framework of the DNN-HMM model. Under this framework, the performance of
interpreting English pronunciation can be reflected by calculating the logarithmic
posterior probability. The result of decoding by the speech recognizer is used as
reference data. When phoneme z is decoded, assuming that the observed vector decoded
by Viterbi is $O=\left\{o_{1},o_{2},\cdot \cdot \cdot ,o_{N}\right\}$, then the posterior
probability corresponding to phoneme z is $pp(z\left| O\right.)$. The posterior probability
expression is shown in Formula (8):
Editor - Highlight - Is this the intended term? If not, please adjust as needed. (See
earlier query, please.)
Fig. 2. The DNN-HMM acoustic model frame structure.
There are two approximations in Formula (8). The first is to ignore the prior probabilities of all phonemes, and the second is
to retain the largest term in the denominator of the formula to simplify the calculation.
$Q_{z}$ represents the denominator calculation space, which consists of mispronunciation
of phoneme z, making the posterior probability of phoneme z more specific. After obtaining
the posterior probability of the phoneme, its logarithmic cumulative sum is shown
in Formula (9):
Editor - Highlight - Is this the intended term? Please see earlier queries about a
posteriori versus posterior, and reconcile as needed.
Formula (9) ignores the transition probability of the HMM, and this study assumes that the transition
probability itself does not need to calculate the likelihood score. Substituting the
DNN formula into Formula (9), Formula (10) is obtained:
According to Formula (10), factors $p(o)$ and $p(s_{i}\left| o\right.)$, which have no influence on the likelihood
calculation, can be ignored; $p(o)$ and $p(s_{i}\left| o\right.)$ can be obtained
directly from the model output; and 555 can be obtained from the training set. Then,
Formula (10) is the final calculation of likelihood degree in the DNN framework.
Editor - Highlight - Is this intended? (Just asking{\ldots}.)
Suppose there are $N_{k}$ sentences in a certain segment of an English interpretation,
and each sentence is decoded to obtain a different phoneme. Then, the final posterior
probability feature of this speech is calculated with Formula (11):
Editor—Highlight—Please confirm per earlier queries.
In Formula (11), $WPP(k)$ represents a posterior probability feature, which calculates the average
of the posterior probability of all phonemes in each sentence in this speech segment,
and then calculates the average of all sentences to obtain the estimated value of
the posterior probability feature of this speech segment. This study designs a scoring
model for English interpreters, which also gives corresponding points (plus or minus)
according to the interpreter’s state. In addition, when the interpreter has less content
representation but better pronunciation, the posterior probability feature will also
have a better performance. However, for manual scoring, the score should be reduced.
Therefore, $S$ versus $WPP(k)$ is used in this study to adjust the mismatch in man-machine
scoring. The weighting method is shown in (12):
Editor - Highlight - Is this right term, per earlier queries?
Editor - Highlight - Is this the right term?
The matching degree between the posterior probability feature and the human score
after retrograde weighting is higher, and the correlation degree of the man-machine
score is stronger.
Editor - Highlight - Is this the intended term? (See earlier queries, too.)
3.3 English Interpretation Scoring via RNN
In this study, a posterior probability feature estimation model is constructed, which
is highly dependent on the recognition effect of the model in English interpreter
assessment scoring. If an identification error occurs, the posterior probability of
a subsequent calculation cannot provide useful information for interpretation scoring.
Moreover, when the model identifies interpreted sentences, the score of a word is
only determined by the first two or three words, which is hardly scientific. If the
language model can see more of the historical information, the interpretation scoring
model is more reliable, and the recognition effect of the model can be improved, so
that the posterior probability calculation is more accurate. Therefore, researchers
have used the Recurrent Neural Network to construct speech models [20-22]. The RNN structure has different data transmission modes. An RNN splices the output
data of the HL at the current moment, with the vector describing word information
in the next moment and the retrograde to form new input data to be transmitted in
the structure. The data output of each transmission mode retains historical information
on the data, so the model introduces more of the data information in the training
process. The network structure from the RNN language model is shown in Fig. 3.
Fig. 3. Basic Structure of a DNN Speech Model.
In Fig. 3, $w(t)$ represents the vector form of the current input data; $s(t)$ represents the
output of the HL at time $t$; $s(t-1)$ represents HL output at the previous moment
and is the input for the present moment; $y(t)$ represents the output vector of the
model; and $c(t)$ represents the clustering of words, which is mainly used to accelerate
the training of the model. At this time, output $y(t)$ is used as the softmax function
to ensure that probability of the occurrence of prediction words is within the range
0 to 1 to avoid the complicated backoff smoothing operation in the model. Assuming
the dimension of $c(t)$ is $M$, the pre-trained words will be divided into $M$ categories,
and the sum of word frequencies in each category is basically the same. While training
the model, it is necessary to update the weight of $c(t)$ and $y(t)$ in the same category.
The RNN speech model has some problems in decoding efficiency. The model cannot directly
decode once. Therefore, in this study, score re-estimation is carried out in the decoding
process of the RNN speech model, and the first scored sentence after re-estimation
is taken to be the new recognition result. Some experiments showed that an RNN speech
model has better performance after n-gram speech model interpolation, so the speech
model score after re-estimation is also obtained after interpolation. The re-estimation
score of candidate sentences is calculated by Formula (13):
Editor\textemdash{}Highlight\textemdash{}Please adjust to two lines instead.
In Formula (13), $Score_{k}$ represents the score of the re-estimation; $AcScore_{k}$ represents
the score of the acoustic model; $W_{k}$ represents the words in the entire sentence;
$C$ is word punishment; $\lambda $ denotes the interpolation coefficient; $lm_{ngram}^{k}$
and $lm_{RNN}^{k}$ indicate the scores of their respective models; $lmScale$ is the
score scaling factor of the speech model during decoding. Formula (13) is used to calculate the sentence with the largest re-estimation score, and the sentence
is taken as the new reference data to re-estimate the posterior probability. Its flow
chart is Fig. 4.
Fig. 4. Flow Chart of RNN Re-estimation Identification Results.
In Fig. 4, for the candidate data decoded once, the score of the acoustic model of the candidate
data is kept unchanged, and the score of the original model is replaced by the score
after interpolation to calculate a new score for the sentence. The data with the highest
new score are selected from the candidate data as the new recognition results.
Editor - Highlight — Column flow differs on this page. Please reformat so columns
flow properly.