Mobile QR Code QR CODE

2025

Reject Ratio

81.5%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 15, No. 01, p.67-82

ISSN (online) :

2287-5255

Received : 19 August 2024Accepted : 27 October 2024

DOI :

https://doi.org/10.5573/IEIESPC.2026.15.1.67

Regular Paper

The Impact of DTW-SVR Algorithm for Acoustic Phonetic Features on English Pronunciation Evaluation

Ying Zhang^1,^* Xiaoqian Liang¹ Shenning Yue¹

(College of Xiangsihu, Guangxi Minzu University, Nanning, 530225, China)

^* Corresponding Author: Ying Zhang, zhang_ying0208@outlook.com

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Artificial intelligence and machine learning enhance speech evaluation technology for objective analysis. Strengthening the development of English pronunciation evaluation techniques or tools plays an important role in helping learners correct pronunciation errors and improve their oral proficiency, which cannot be ignored. Traditional evaluation methods have subjective limitations and the selection of data information is one-sided. So this study proposes a recognition model that integrates acoustic phonetics features, taking into account the language habits and pronunciation characteristics of different learners. Time warping and support vector regression are introduced to improve the original Gaussian mixture model-hidden Markov model for speech recognition. And a method is designed for detecting easily confused pronunciation errors, achieving effective fusion of evaluation features from different dimensions. The results confirm that the proposed method achieves a maximum rating accuracy of 90.140% in four aspects of English pronunciation quality. The Pearson correlation coefficient value between it and manual scoring approaches 0.934, which is much higher than the comparison algorithm. Its pronunciation fluency and quality level are both good, and its resource consumption is relatively small. This pronunciation evaluation method can provide technical tools for evaluating students' oral English proficiency and improving teaching quality.

Keywords

Acoustic phonetics features, Time regulation, Support vector regression, English, Reading aloud, Pronunciation quality, Manual rating

1. Introduction

English, as a globally recognized language, greatly facilitates communication between people from different countries. With the development of globalization and the popularization of English education, the importance of strengthening the assessment of English pronunciation is becoming increasingly prominent. For non-native English learners, pronunciation accuracy is crucial for them to acquire fluent English. Correct pronunciation is beneficial for improving their communication skills and language fluency ^[1]. The phonetic characteristics of English are complex and varied, and its pronunciation accuracy is influenced by various factors, such as tone, stress, vowel linking, etc. The traditional evaluation methods for reading pronunciation are usually based on manual judgment or speech recognition technology. This inevitably involves subjectivity or discrimination errors, making it difficult to grasp the subtle differences in English pronunciation, such as the pronunciation position and acoustic characteristics of long and short vowels ^[2, ^3]. At present, the evaluation of pronunciation quality is mainly reflected in its relevance to the text content, reading aloud based on the given text. Therefore, this study proposes an acoustic phonetics feature alignment method based on Dynamic Time Warping (DTW) and Support Vector Regression (SVR). And considering the English learning habits of Chinese learners, multi-dimensional acoustic features are extracted based on the phonetic characteristics and differences of English pronunciation to better improve English oral evaluation ability. The insufficient mastery of phonetics knowledge and the influence of mother tongue pronunciation habits have led Chinese students to rely solely on the duration of phoneme pronunciation to identify perceptual differences in spoken language ^[4]. As a sequence processing technique, DTW can measure the similarity between speech signals while considering the time dimension, while SVR can ensure high data processing accuracy while avoiding the limitation of input space dimension on computational complexity ^[5]. DTW-SVR can evaluate the accuracy of pronunciation and effectively evaluate the quality of pronunciation by comparing the similarity between feature sequences in speech signals. This may have theoretical and practical significance for promoting the development of speech evaluation technology and improving the quality of English learning and teaching. The innovation of the research lies in the comparison of the calculated phoneme standard score and decision threshold based on the speech recognition framework when conducting pronunciation error detection. In response to the pronunciation standard characteristics of different phonemes and the common problem of easily confused phonemes in Chinese students' English pronunciation, it is proposed to design an information search network, Viterbi state sequence search, and maximum likelihood clustering supervision analysis based on the GMM-HMM acoustic model to better extract acoustic phonetics features. At the same time, when evaluating the quality of English pronunciation, we do not only rely on evaluation dimension indicators for analysis, but also use DTW-SVR algorithm to comprehensively evaluate the quality of English pronunciation from different dimensions such as fluency and intonation, achieving effective integration of feature data. When conducting English speech recognition research, it is not simply using traditional algorithms for feature sequence analysis, but improving key technologies in pronunciation error detection and pronunciation quality evaluation, organically integrating the two, and constructing a complete English pronunciation quality evaluation model. The research content can effectively evaluate the quality of pronunciation, providing reference value for the improvement of computer-aided language learning systems and the improvement of English reading pronunciation quality evaluation models for Chinese students. The research mainly analyzes English pronunciation evaluation from four aspects. Firstly, a review and discussion are conducted on the current evaluation techniques for English reading, speaking, or pronunciation. Then, the acoustic phonetics characteristics and the design of a pronunciation quality evaluation model are elaborated. The third part is the exploration of the application results of methods in pronunciation evaluation. Finally, there is a summary of the entire article.

2. Related Works

The English pronunciation evaluation method is an important tool for evaluating the accuracy and fluency of individual English pronunciation. Traditional methods include subjective and objective evaluations. Subjective evaluation relies on the subjective judgment of experts or teachers, which is easily influenced by subjective attitudes and evaluation criteria, lacking objectivity. Objective evaluation is based on computer algorithms and automation technology. Cao D proposed a speech recognition technology based on fuzzy measures to evaluate English speaking, which was achieved by extracting different feature parameters and designing automatic learning rules. These results confirmed that this method had higher recognition effectiveness compared to traditional algorithms ^[6]. Wang Y et al. proposed the development of a computer evaluation plugin based on computer-aided synthesis technology that could automatically recognize and rate spoken English. This plugin included different modules such as voice evaluation and oral dialogue, which could provide learners with timely learning feedback. This method effectively provided new technical means for the content related to English oral error correction scoring ^[7]. Ran D et al. proposed using artificial intelligence speech recognition technology to correct English accent pronunciation, and they analyzed phoneme level speech and acoustic models. Control experiments confirmed that the correction model had significant application effects ^[8]. Yuan Z et al. proposed using fuzzy algorithms to improve the quality of English translation systems. They use Gaussian processing to complete image input and recognition in English translation, preserving pixel edge information of the image. These results confirmed that the algorithm could effectively denoise English translated images and had high recognition accuracy ^[9]. In response to the difficulty of regional accent recognition, Cetin O proposed using convolutional neural networks to extract spectral image features of speech signals. He improved model performance through fast Fourier transform, logarithmic function compression, spectral image processing, and data transfer learning. These results confirmed that the accuracy demonstrated by this method was over 90% and effectively overcame the interference of heterogeneous data on algorithm performance ^[10]. Song Z used deep neural networks to design English speech recognition algorithms and used linear feature fusion algorithms to achieve framework design for English speech features and attributes. These results confirmed that the deep speech recognition algorithm could extract speech features from different dimensions and effectively improve system performance ^[11].

Acoustic models are often obtained through pronunciation training using the standards of native speakers of a second language. The measurement of confidence can evaluate the similarity between learner pronunciation and standard pronunciation. As a phonetic unit, when evaluating pronunciation quality, Chan used logarithmic probability scores for comparison. These results confirmed a good correlation between this score and manual scoring ^[12]. In response to the limitations of posterior probability measurement in English speech recognition, Gang Z proposed using artificial emotion recognition and high-speed hybrid models to filter out speech quality clutter and improve the quality of phonemes in speech acoustic modeling. And based on the characteristics of clutter distribution, a targeted approach was proposed. These results confirmed that the clutter suppression technology could effectively overcome its problems in traditional speech detection systems and demonstrate good application performance in English teaching experiments ^[13]. Aissiou M et al. proposed a genetic algorithm improved acoustic and speech decoding model for speech problems. They divided standard Arabic vowels by classifying and discriminating speech continuum parameter vectors, dividing vowel categories, and extracting vocal parameter coefficients. These results confirmed that the average classification accuracy of this method for corpus phonemes under noisy environmental conditions exceeded 98% ^[14]. Fang Y proposed an intelligent English speaking assessment system based on DTW and designed hardware and software modules separately. These results confirmed that the evaluation system exhibited an accuracy of over 65% and a response time of less than 15ms, greatly improving the accuracy and efficiency of intelligent English speaking evaluation ^[15]. Fan Y utilized deep learning and mobile platforms to design a speech evaluation system. These results confirmed that the matching rate between this evaluation system and artificial speech evaluation exceeded 85%, and the correct speech recognition rate was much greater than 90%. It greatly improved the learning efficiency of learners and had high application value ^[16].

At present, research on pronunciation quality evaluation is mainly text related. In acoustic feature detection, some scholars classify and detect known types of pronunciation errors. For example, Cao D et al. introduced fuzzy measures and used different feature parameters and designed automatic learning rules to achieve English speech recognition. This type of method is difficult to define pronunciation errors outside the scope, and the accuracy effect is difficult to guarantee. Some scholars have proposed training classifiers separately for different types of pronunciation errors, such as Aissiou M and others who proposed genetic algorithm improved acoustic and speech decoding models for speech problems, using speech parameter vector classification discrimination and vocal tract parameter coefficient extraction to improve classification accuracy. Ran D et al. analyzed phoneme level speech and acoustic models to achieve correction of English accent pronunciation. Scholar Gang Z utilized artificial emotion recognition and high-speed hybrid models to evaluate the quality of phonemes in speech acoustic modeling. The above content has a significant recognition effect, but it requires more training time. A small number of scholars are considering designing speech evaluation systems, but it is difficult to take into account all dimensions related to pronunciation quality. Different from previous research approaches, this study proposes multi-dimensional speech feature recognition and fusion, as well as algorithmic processing, under the mainstream speech recognition framework, taking into account differences in the characteristics of learners' mother tongue accents, rhythm playing, or other phonemes. The pronunciation evaluation technology proposed in the study can detect pronunciation errors and fully consider acoustic phonetics characteristics, which is in line with the pronunciation habits of Chinese learners in English speaking.

3. Design of English Pronunciation Evaluation Method Based on Acoustic Speech Recognition Model

Acoustic phonetics features refer to various characteristics in speech signals, such as fundamental frequency, resonance peak frequency, etc. These features can well reflect the sound quality and pronunciation accuracy of speech. Understanding the speech features in different reading results is important for distinguishing pronunciation effects ^[17]. The study is based on acoustic speech features to evaluate and analyze the quality of English reading pronunciation, which is implemented from three aspects: feature extraction, pronunciation error detection design, and fusion algorithm quality design. The aim is to provide technical means for evaluating the quality of English speaking.

3.1. Language Model Design for Acoustic Feature Extraction

Hidden Markov Model (HMM), as a temporal statistical model, is widely used in fields such as speech recognition, bioinformatics, and pattern recognition due to its robust statistical foundation and scalability. HMM is commonly used to model stochastic processes with hidden states and infer hidden state sequences based on a series of observed data. The hidden state and observed values are two important random variables for it ^[18]. HMM consists of an initial state probability distribution, a state transition probability matrix, and an observation probability matrix. Its dual randomness can effectively describe the time-varying characteristics of speech signals and analyze the acoustic characteristics of signals using short-term speech frames. Therefore, the study considers using HMM for acoustic modeling. When designing a pronunciation quality evaluation system, speech signals are preprocessed, including pre-emphasis, framing, windowing, etc., to achieve correction of speech signals and extraction of acoustic features. In the generation of speech signals, the high-frequency part is more susceptible to attenuation due to factors such as lip shape and external noise. Research is conducted to use pre-emphasis to compensate for the high-frequency component to achieve balance of signals in different frequency bands. The speech signal is pre-emphasized through a first-order high pass filter by using Eq. (1).

(1)

$ y(n) = x(n) - \mu(n - 1). $

In Eq. (1), $y(n)$ is the output signal. $\mu$ represents the pre-emphasis coefficient. $x(n)$ represents a voice signal. Due to the non-stationary nature of speech signals, their spectral characteristics change rapidly over time. Therefore, this study divides speech signals into short-term speech frames for analysis after pre-emphasizing continuous signals. And windowing processing is used to reduce the interference of frame processing on the statistical characteristics of speech signals. Eq. (2) is a windowed function.

(2)

$ w(n) = 0.54 - 0.46 \cos\left(\frac{2\pi n}{N}\right). $

In Eq. (2), $n$ represents the time, $w(n)$ represents the Hamming window, and $N$ represents the length of the Hamming window. Speech signals are transformed into parameter forms of feature vectors during acoustic feature extraction. The Mel Frequency Cepstrum Coefficients (MFCC) are more in line with human ear perception characteristics and have lower computational complexity ^[19]. MFCC achieves the conversion from linear frequency to Mel frequency by setting a set of triangular bandpass filters with non-uniform distribution along the linear frequency axis direction, and then processes the speech signal in the Mel frequency domain. The triangular bandpass filter can obtain the logarithmic energy output of each filter, and calculating the discrete Fourier inverse transform of the logarithmic filter's energy can achieve the effect of improving speech processing performance. The conversion relationship of MFCC signal frequency can be expressed as Eq. (3).

(3)

$ m = 2595 \lg\left(1 + \frac{f}{700}\right). $

In Eq. (3), $m$ is the linear frequency. $f$ is the Mel frequency. Fig. 1 shows the feature extraction process of MFCC.

Fig. 1. MFCC feature extraction flowchart.

The preprocessed speech signal is subjected to fast Fourier transform to obtain spectral energy. Then, a filter is used to weighted sum and frequency domain transform the spectrum. The logarithmic result of the filter output is subjected to Discrete Cosine Transform (DCT) to eliminate the correlation of the output results. Considering the dynamic nature of the speech signal, the obtained signal cepstral coefficients are differentially processed to obtain the final MFCC feature vectors ^[20]. The study utilizes HMM to model and process phonemes and sequences in speech and uses state networks to achieve probabilistic connectivity of phoneme sequences, thereby achieving the extraction of signal feature vectors. The current common acoustic model is the Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) for speech recognition, as shown in Fig. 2.

Fig. 2. GMM-HMM acoustic model.

GMM can obtain the probability distribution of fitted data. This language model can delete mismatched words, reduce data volume, limit search space, and improve decoding rate when recognizing and decoding prior knowledge in word sequences. The language model improves the allocation of word sequences or sentences in a language, and uses chain rules to calculate probabilities, represented by Eq. (4).

(4)

$ P(W) = \prod_{i=1}^{s} P(w_i | w_1, w_2, w_3, \dots w_{i-1}). $

In Eq. (4), $s$ is the length. $(w_i | w_1, w_2, w_3, \dots w_{i-1})$ is the sequence information of the word $wi$. When the length of a word sequence is too large, N-gram language models need to be used to simplify conditional probabilities. In speech recognition systems, research uses a pronunciation dictionary to find the corresponding phoneme sequence for each word. During the analysis of English pronunciation, an information search network is established. Fig. 3 is a schematic diagram of its structure.

Fig. 3. Three layer search network.

In Fig. 3, the three-layer search network mainly consists of word, phoneme, and HMM state layers, which respectively realize the connection of text word sequences, phoneme mapping of word sequences, and the establishment of acoustic statistical representations. Subsequently, constraints and guidance information are designed to achieve text alignment under given text conditions, and hidden state sequences in the text are found using the Viterbi algorithm. Based on the idea of dynamic programming, Viterbi obtains the optimal state path by recursively and backtracking the maximum probability of different state observation data at each moment. In large vocabulary search, research proposes using Viterbi to improve computational efficiency and avoid the time-consuming path search. Viterbi uses a heuristic function to add a set bundle width to constrain the node path and delete nodes with lower likelihood during traversal, while expanding nodes with higher likelihood ^[21, ^22]. It can improve the speed of finding the optimal path while avoiding resource waste caused by excessive search space, and the bundle width can limit the reserved nodes in each loop process.

When GMM-HMM trains observation features in hidden states, its likelihood score only considers the current frame and ignores the influence of other states, which to some extent leads to information loss ^[23]. Therefore, clustering analysis is conducted based on acoustic features, and these features are corresponding to the hidden states in HMM to obtain features under supervised state classes. To determine the probability model of acoustic features to ensure the full utilization of state space information, the study utilizes maximum likelihood clustering to generate supervised state analysis, represented by Eq. (5).

(5)

$ d(x, ss_i) = P(x | ss_i) = g(x | \mu ss_i, \sum ss_i) $

In Eq. (5), $d(x, ss_i)$ represents the observed speech input feature $x$. $ss_i$ is the distance of the supervised state class. $g$ is the score model for state likelihood values. Maximum likelihood linear regression can maximize the likelihood of adaptive data by estimating linear transformations of model parameters, which can adjust acoustic models to match test speech and achieve good results with only a small amount of data. When adjusting the acoustic model, the state Gaussian distribution of the HMM model is designed using the same transformation matrix.

3.2. Design of English Pronunciation Error Detection Algorithm

When conducting pronunciation error detection, the study takes phonemes in English word pronunciation as the research object. Different detection strategies are adopted for the confusion reflected by incorrect pronunciation phonemes with other vowel phonemes and other phonemes. That is, detection is based on differences in acoustic phonetics characteristics and detection is carried out by comparing confidence scores and phoneme thresholds. The study aims to extract acoustic features from preprocessed English pronunciation, and then use acoustic models to achieve feature extraction and given text alignment processing, distinguishing the boundary information of different phoneme segments. Based on this, the standard score required for phoneme error detection is calculated to obtain the final error detection result ^[24]. The common standard algorithms for detecting pronunciation errors are the logarithmic likelihood algorithm, the logarithmic posterior probability algorithm, and the Goodness of Pronunciation (GOP) algorithm in speech evaluation techniques. The logarithmic likelihood algorithm defines the likelihood of phonemes based on the state transition probability and transmission probability of HMM. It mainly uses the similarity between the learner's pronunciation and the standard pronunciation as the judgment criterion and has a high dependence on the length of the speech ^[25, ^26]. The difference in score evaluation of phoneme segments refers to the difference in the contribution of different speech scores to the entire segment. However, due to the inevitable interference of external speech factors, this algorithm will spend additional computational time on regularization processing, resulting in limited detection performance. The logarithmic posterior probability is calculated based on the probability score of all frames of the phoneme, represented by Eq. (6).

(6)

$ LPP(q_i) = \sum_{t=\tau_i}^{\tau_i + d_i - 1} \log P(q_i | o_t) P(q_i) $

In Eq. (6), $qi$ represents phonemes, $i$ is the frame, and $P(qi)$ is a prior probability. $ot$ is the corresponding observation vector in the phoneme frame. $P(qi|ot)$ is the similarity of the observed vector to the phoneme. $\tau i,di$ are the starting time and duration of phonemes. $LPP(qi)$ is the posterior probability score of the logarithm of all frames of a phoneme. This algorithm focuses on the standardization of learner pronunciation. But it is also more prone to changes in the speech signal spectrum due to subjective differences in individual learning or changes in transmission channels, which can lead to interference in experimental results. Essentially, Goodness of Promotion (GOF) is a transformation of logarithmic posterior probability. The GOF algorithm belongs to the posterior probability algorithm, which defines phonemes for Chinese vowels. It mainly calculates the quantitative score of its accuracy based on a given speech segment. When calculating, it assumes that the prior probabilities of different phonemes are consistent and the denominator is summed to its maximum value, thus simplifying the formula. By using detection strategies, the accuracy of learners' pronunciation can be compared. The intuitive comparison indicator is the standard score, and a high score indicates a high accuracy of phoneme pronunciation in learners. In practical training, it is necessary to design a global threshold value to compare it with the GOP score. If the GOP score is greater than the threshold value, then the pronunciation is correct, otherwise it is incorrect and there is a pronunciation problem. However, in the actual pronunciation detection process, there are differences in the range of GOP scores for different phonemes, and the design of a single threshold value can also lead to a certain decrease in detection performance ^[27]. Therefore, the study introduces speech recognition technology to optimize the model, which may reduce the differences in acoustic characteristics, and an English pronunciation model that is suitable for Chinese students can be constructed. That is to say, the Maximum Likelihood Line Regression-Maximum A Posteriori (MLLR-MAP) is used to calculate the score of phoneme segments in spoken speech. Phoneme segments above the threshold will be used as adaptive predictions to adjust the model parameters. To reduce the mismatch between the acoustic model and the detected speech, a separate threshold value is trained for each type of phoneme. Likelihood estimation often ignores the information of non productive phonemes in acoustic modeling, leading to its ability to distinguish easily confused phonemes. Therefore, this study aims to classify long and short vowel phonemes that are easily confused during the pronunciation process of reading, using discriminative acoustic phonetics features. Fig. 4 is a schematic diagram of the framework for detecting easily confused phonemes.

Fig. 4. Schematic diagram of the framework for detecting easily confused phonemes in word pronunciation.

After extracting the pronunciation features of English reading, the acoustic features are recognized using MFCC features. The phonetics part uses speech analysis software to extract, including three aspects: resonance peak, fundamental frequency, and duration of phoneme segments. The mispronunciation detection of confused phonemes is transformed into a binary classification problem. Support Vector Machine (SVM) is used to ensure that the training data have good classification performance, thereby obtaining the detection results of phonemes. The dimension and frame rate differences of the MFCC feature vectors improved in the previous text make it difficult to input them uniformly into the classifier ^[28]. To reduce the difference in speech frame rates and ensure the completeness of phoneme acoustic information, the MFCC feature vectors are solved by column matrix mean in this study to obtain statistical MFCC adjustments. It can also reduce the feature dimension while including the mean of all frame feature vectors. Table 1 shows the pronunciation differences between long and short vowels.

Table 1. The pronunciation of long and short vowels.

	Pronunciation points	Rounded spreading of the lips
Long vowel	The tongue is placed naturally with the tip of the tongue against the lower gums and the front part of the tongue is more forward and high	Flat lips, flattened, tense oral muscles
Short vowel	Tip of the tongue against the lower gums, hard palate lifted, tongue position slightly lower, slightly behind	Slightly open lips, between flat and normal, with loose oral muscles

The study normalizes the three dimensions of phonemic features, represented by Eq. (7), to avoid interference from individual speech differences on the measurement results of phonemic features.

(7)

$ normalized\ duration = duration \times articulation\ rate. $

In Eq. (7), $normalized\ duration$ represents the normalized duration of the phoneme segment. $articulation\ rate$ is the pronunciation speed, which does not include the number of phonemes during pauses.

3.3. Design of Quality Evaluation for English Pronunciation

When conducting a comprehensive evaluation of students' English pronunciation, this study extracts evaluation features from three aspects: pronunciation standard, fluency, and prosody. Subsequently, the fusion of evaluation features is achieved using SVR. The SVR algorithm uses a nonlinear function to map data to a high-dimensional feature space for processing. Based on the evaluation results of pronunciation fluency, a speech recognition framework is proposed to extract evaluation features, and a regression scoring model is designed to map the fusion features of the evaluation to the final frequency of fluency. The study selects speech speed, pronunciation speed, pronunciation time ratio, average flow length, and average pause length for fluency feature analysis. Based on this, the study introduces word segment length ratio for pronunciation fluency analysis, represented by Eq. (8).

(8)

$ wordDurationRatio = \frac{1}{n} \sum_{i=1}^{n} \frac{dur_i}{Dur_i}. $

In Eq. (8), $i$ represents the word, $dur_i$ is the actual pronunciation segment length when reading aloud, $Dur_i$ is the corresponding standard pronunciation segment length, and $n$ is the total number of words in the text. After students have mastered their pronunciation to a certain extent, their intonation expression can be further evaluated. There are significant differences in the pronunciation and intonation of different texts when reading aloud. Therefore, when evaluating the results, it is important to ensure consistency between the learner's pronunciation and the reference standard pronunciation in the text. Pitch feature sequences are compared and their similarity was calculated, and intonation scores are obtained using a rating mapping model.

(9)

$ score = \frac{100}{1 + a'(dist)^{b'}}. $

In Eq. (9), $score$ represents the DTW distance between two pitch feature sequences. $a'$, $b'$ are the training parameters. $score$ is the final intonation score, with a score range of 0 to 100. The fusion of feature data can better evaluate the quality of students' pronunciation when reading aloud. Traditional linear regression algorithms ignore the non-linear relationship between evaluation features and manual scoring in actual pronunciation quality evaluation, which can cause interference to the pronunciation quality evaluation results. Although some nonlinear regression algorithms can better represent nonlinear relationships, they require a large training sample size and inevitably have hidden dangers such as overfitting or extreme point problems ^[29]. Therefore, the study chooses SVR to achieve target mapping of feature scores. It is an application of SVM for regression problems, which establishes a functional relationship between input and output data on a limited training dataset. The fusion of evaluation features using SVR ensures that the prediction error of samples in the training dataset is minimized by finding a regression function. The regression function of SVR is represented by Eq. (10).

(10)

$ f(x) = \langle w, \Phi(x) \rangle + b. $

In Eq. (10), $w$ is the weight vector, $b$ is the bias term, $\Phi(x)$ is the function of nonlinear data $x$, and $\langle \rangle$ is the inner product operation. The insensitive loss function in SVR represents a positive metric. When the prediction error of the regression function on the actual target value is within the measurement range, the loss value is 0. To better define the deviation on both sides of the measurement insensitivity, certain constraint conditions are designed with the introduction of relaxation variables, represented by Eq. (11).

(11)

$ \begin{cases} y_i - f(x_i) \le \varepsilon + \xi_i, \\ f(x_i) - y_i \le \varepsilon + \xi_i^*, \\ \xi_i, \xi_i^* \ge 0. \end{cases} $

In Eq. (11), $\varepsilon$ is the metric positive, $\xi_i$, $\xi_i^*$ are the slack variables, $f(x_i)$ is the regression function, and $y_i$ is the objective value. When setting the objective function, the penalty coefficient $C$ ($C > 0$) for prediction errors needs to be set, and the objective function is transformed into a Lagrangian function using a Lagrangian multiplier. By the setting constraint ($0 < \alpha_i^* < C$) in the partial derivative solution, the final SVR function can be obtained, represented by Eq. (12).

(12)

$ f(x) = \sum_{i=1}^{m} (a_i - a_i^*) K(x_i, x) + b. $

In Eq. (12), $K(x, x_i)$ is the kernel function. SVR replaces inner product with kernel function, effectively reducing excessive computation of high-dimensional spatial data. When defining using Radial Basis Fusion (RBF) kernel function, Eq. (13) is obtained.

(13)

$ K(x, x_i) = \exp(-\gamma \|x_i - x\|^2). $

In Eq. (13), $\gamma$ is the kernel parameter. RBF is a scalar function that is radially symmetric, typically defined as a monotonic function of the Euclidean distance between any point in space and a center, and its effect is often local. When evaluating the quality of pronunciation in SVR, the collected pronunciation data are first subjected to standard, fluency, prosodic feature extraction, and score calculation. Then, the feature scores are separately normalized using a cubic polynomial function to ensure consistency between the processing interval and the artificial evaluation partition ^[30]. The multi-dimensional evaluation features constructed in this experiment will be input into the SVR training sample set, and the parameters of the scoring model will be trained. The output is the manual scoring result. The evaluation features to be tested are fused using SVR to obtain comprehensive evaluation results related to pronunciation quality. A quality evaluation model for English reading pronunciation is designed and constructed in Fig. 5.

Fig. 5. Schematic diagram of the overall structure of English pronunciation quality evaluation.

In Fig. 5, the structure includes five aspects: speech signal preprocessing, acoustic feature extraction, Viterbi search, pronunciation error detection, and pronunciation quality evaluation. The evaluation features mainly include pronunciation error detection and quality evaluation, which are analyzed from the aspects of reading standards, fluency, pitch, and pronunciation.

To avoid excessive reliance on hyperparameters and heuristic design in the proposed method, a computational framework based on speech frames is proposed to reduce the number of DTW algorithm iterations. When testing the storage data of speech sequences, external buses and memory are used to calculate the lower bound function and sequence matching of speech features. Then, the dynamic time regularization calculation unit is used to measure and control the module scheduling to calculate and compare all dynamic time regularization distances, and output the minimum distance result. In the SVR algorithm, the extraction of acoustic features and DTW data preprocessing can reduce its computational complexity and improve the model's generalization ability.

4. Application Results of English Pronunciation Evaluation Based on Acoustic Phonetics Features

As a phonetic language, the accuracy, fluency, and naturalness of English features are crucial for learners to improve their pronunciation and speaking abilities. Evaluation techniques can help learners achieve phonetic correction during the reading process and provide timely and accurate feedback. A recognition model based on acoustic phonetics features is designed to analyze the algorithm evaluation performance and application results to provide learners with more effective tools and means. The research conducts validation analysis using collected speech data as experimental content. The experimental environment includes a Microsoft Windows 7 64 bit operating system and a computer (with an Intel (R) Core (TM) i3-2130 CPU @ 3.40GHz and 18GB of memory). Java programming is used for processing. The development tool is Eclipse, and the data analysis tool is EXCEL2014. The study measures the intonation similarity between learner speech and reference standard speech by calculating the DTW distance on pitch feature sequences in a database. After calculating the DTW distance of the voice data separately, 80% of the data was selected as the training set, and the remaining 20% was used as the testing set. Table 2 shows the specific experimental configuration ^[31].

Table 2. Experimental configuration.

Front-end preprocessing	Pre-emphasis coefficient: 0.97, length of add-window function: 25 ms, duration of speech signal per frame: 25 ms, length of speech frame overlap: 15 ms, MFCC feature dimension: 39 dimensions
Knowledge base	Acoustic model: monophonic phoneme model (HMM with three emission states and GMM with eight Gaussian components), language model: ternary language model, pronunciation dictionary: Carnegie Mellon University (CMU) pronunciation dictionary
Speech recognition engine	Sphinx4 speech recognition engine
Detection database	Pronunciation phonemes (CMU ARCTIC corpus), confusable phonemes (L2-ARCTIC corpus), intonation (L2-ARCTIC corpus),pronunciation quality (CMU ARCTIC corpus)

The study selected 30% of the database as the test set and 70% as the training set. The correlation between the test results and the manual scoring results was calculated, and the final result was taken as the mean of the ten fold cross test. The study compared the data effects before and after speech signal preprocessing in Fig. 6.

Fig. 6. Changes in sound frequency signal before and after preprocessing.

Fig. 6 shows the comparison of sound frequency signals before and after preprocessing. In Fig. 6(a) (before preprocessing), the maximum frequency of the audio signal reached 9000Hz, with significant overall fluctuations. When the labels were less than 500, the high and low frequency concentrated characteristics of most signals were more prominent between data sequence numbers 500-900, and there was obvious noise data with significant frequency changes and fluctuations. The frequency change of the signal data in Fig. 6(a) (before preprocessing) had significantly improved, and the frequency change range was basically controlled within $\pm$6000 Hz. The curve changed more smoothly, and the noise reduction effect of the signal was significant. This effectively avoids the recognition error caused by noise interference on the fundamental frequency. Subsequently, an English pronunciation quality evaluation was conducted on the proposed DTW-SVR. Firstly, the accuracy of vocabulary pronunciation and syllable pronunciation was analyzed, and logarithmic posterior probability and single GOP were used as comparison algorithms. Fig. 7 shows the test results.

Fig. 7. Accuracy detection results of vocabulary pronunciation and syllable pronunciation.

In Fig. 7, the proposed fusion algorithm performed better in detecting vocabulary pronunciation errors compared to traditional logarithmic posterior probability algorithms, with a maximum accuracy of 0.4. When the accuracy was less than 0.4, GOP's recall was higher than traditional algorithms, and its error detection performance improvement was relatively stable. The proposed fusion processing method has higher accuracy and recall values than other algorithms, and can effectively achieve pronunciation error detection. In terms of syllable pronunciation accuracy, the proposed algorithm shows good accuracy and recall. The extraction of speech signal features is easily affected by external objective factors, which can lead to interference in recognition accuracy. Therefore, the proposed feature fusion method was compared with the single speech feature extraction test data. The resonance peak, fundamental frequency, and phoneme segment duration characteristics of the speech data were detected and analyzed. Fig. 8 shows the detection results.

Fig. 8. Results of feature recognition accuracy under different signal-to-noise ratios.

A small Signal to noise Ratio (SNR) indicates significant differences in feature information. In Fig. 8, the speech information extraction effect of fused features was significantly higher than that of cepstral coefficient features and prosodic features. Its accuracy difference with the other two algorithms was over 10% when SNR was 6dB. As SNR increased, the average recognition accuracy of the hybrid feature algorithm was above 85%, which was 68.34% and 52.16% higher than that of MFCC and GOP features. This indicates that fused features can better recognize speech signals and distinguish different features, with good stability. A speech quality evaluation analysis was conducted on the proposed acoustic recognition model. The scoring results were analyzed from the aspects of intonation, speed, pitch accuracy, and rhythm, which were compared with manual scoring results in Fig. 9. Five experts from the fields of linguistics and phonology were invited to form a scoring team for the experiment. Each expert independently rated the recorded test database data and used a five point method for evaluation, with a maximum score of 5 being the best result. Subsequently, the average of expert rating results will be used as the final result of the test data. Experts will set rules based on their own experience when making rating decisions. According to one's own evaluation criteria, the evaluation of English pronunciation may be influenced to a certain extent by factors such as recording quality and subjective grading standards, but the evaluation results have a certain degree of professionalism and credibility.

Fig. 9. Pronunciation quality results of machine rating and manual rating.

Fig. 9 shows the confusion matrix of pronunciation quality ratings for three algorithms. The horizontal and vertical coordinates represent machine ratings and manual ratings, respectively. Specifically, there was a significant difference in the scoring results of GMM-HMM, with accuracy rates of 72.313%, 72.425%, 65.489%, and 70.265% in four aspects of English pronunciation: intonation, speed, intonation, and rhythm. The accuracy rate of pronunciation quality ratings based on statistical speech recognition models was above 70%, with a maximum value of 79.504%. The proposed acoustic model recognition algorithm had an accuracy rate of over 85% in all four aspects of English pronunciation quality, with a maximum value of 90.140%. Subsequently, the consistency between different algorithm scoring results and manual scoring results was compared, and Pearson Correlation Coefficient (PCC) was used for analysis in Fig. 10.

Fig. 10. Pearson correlation coefficient results of scores under different algorithms.

The closer the PCC approaches 1, the higher the linear correlation between the two variables. In Fig. 10, the proposed algorithm showed good fitting performance in quality rating, with a high PCC approaching 0.934, far higher than the 0.816 and 0.893 of GMM-HMM and statistical speech recognition models. Subsequently, the proposed DTW-SVR was analyzed for pronunciation fluency in Table 3.

Table 3. Results of different algorithms for reading aloud pronunciation fluency.

Evaluation features	GMM-HMM	Statistical phonological model	DTW-SVR
Speaking speed	0.431	0.681	0.851
Pronunciation speed	0.389	0.639	0.809
Pronunciation time ratio	0.412	0.662	0.832
Average flow length	0.408	0.658	0.828
Average pause length	0.352	0.602	0.772
Phonological segment length	0.403	0.653	0.823
Word segment length ratio	0.419	0.669	0.839
Pronunciation fluency overall performance	0.551	0.801	0.971

Table 3 shows the correlation results of different algorithms on pronunciation fluency features. From Table 3, GMM-HMM scored less than 0.5 in all seven feature dimensions, with an overall performance correlation of 0.551 for pronunciation fluency. The pronunciation evaluation feature scores of the statistical speech recognition model were all less than 0.7, and the overall performance correlation of its pronunciation fluency was 0.801. The proposed DTW-SVR showed good pronunciation fluency evaluation results, with correlation values above 0.75, the best speech speed effect, and an overall performance of 0.971. The above results confirm that DTW-SVR is closer to manual scoring results, and learners have better reading fluency under this model. Subsequently, resource consumption analysis was conducted on the proposed algorithm in Fig. 11.

Fig. 11. Time consumption and memory running resource consumption before and after algorithm improvement.

In Fig. 11, the pre improved model is the traditional speech recognition acoustic model (GMM-HMM), while the improved model refers to the model that introduces time regularization and support vector regression improvement ideas on the GMM-HMM model. In Fig. 11, there was a significant difference in the proportion of time and resource consumption between before and after improving the acoustic model. Specifically, in Fig. 11(a), the time consumption of the improved acoustic model was generally lower than the results before improvement. In evaluations with high information content, its time consumption remained below 0.15% generally. After the data volume exceeds 30, the proportion of time consumed by the improved model decreases, and the decrease is significantly greater than that of the pre improved model. The reason is that high-capacity information contains a variety of content, and the improved model can use SVR to define the data space, reducing excessive calculations. Additionally, the time warping algorithm can measure the similarity between speech signals while considering the time dimension, greatly reducing the amount of sequence processing. In Fig. 11(b), the average memory resource consumed by the improved algorithm during four runs was 1.05%, which had significant amplitude compared to the average of 3.06% before the improvement.

5. Conclusion

In the research, DTW-SVR is proposed based on acoustic speech features to evaluate the quality of English pronunciation and analyze its application. These results confirm that after preprocessing the sound frequency signal, the frequency variation range of the signal is basically controlled within $\pm$6000 Hz, and the noise elimination effect of the signal is significant. The accuracy and recall value of the fusion method are higher than those of GOP and logarithmic posterior probability algorithms in terms of vocabulary pronunciation and syllable pronunciation accuracy. And its average recognition accuracy in resonance peak, fundamental frequency, and phoneme segment duration features is over 85%, higher than 68.34% and 52.16% of MFCC and GOP features. In terms of pronunciation quality evaluation, the accuracy of GMM-HMM in four aspects of English pronunciation: intonation, speed, accuracy, and rhythm is 72.313%, 72.425%, 65.489%, and 70.265%, respectively. The proposed method has a rating accuracy of over 85% in four aspects of English pronunciation quality, with a maximum value of 90.140%. Its PCC tends to be close to 0.934, far higher than the 0.816 and 0.893 of GMM-HMM and statistical speech recognition models. The proposed DTW-SVR shows good pronunciation fluency evaluation results, with correlation values above 0.75, the best speech speed effect, and an overall performance of 0.971, far higher than GMM-HMM (0.551) and statistical speech recognition model (0.801). The proposed DTW-SVR has outstanding quality evaluation performance, with a time consumption of less than 0.15% in areas with high information content. There is a significant amplitude between memory resources and the average value of 3.06% before improvement. The proposed DTW-SVR based on acoustic phonetics features can effectively improve the quality of English pronunciation evaluation, and it is closer to the results of manual scoring. The statistical analysis of English intonation evaluation data is an important focus of future research.

References

Rogerson-Revell P. M. , 2021, Computer-assisted pronunciation training (CAPT): current issues and future directions, RELC Journal, Vol. 52, No. 1, pp. 189-205

Khan A. , Sarfaraz A. , 2019, RNN-LSTM-GRU based language transformation, Soft Computing, Vol. 23, No. 24, pp. 13007-13024

Feng S. , Lee T. , 2019, Exploiting cross-lingual speaker and phonetic diversity for unsupervised subword modeling, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 12, pp. 2000-2011

Li N. , 2021, An improved machine learning algorithm for text-voice conversion of English letters into phonemes, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2743-2753

Jiao F. , Song J. , Zhao X. , Zhao P. , 2021, A spoken English teaching system based on speech recognition and machine learning, International Journal of Emerging Technologies in Learning, Vol. 16, No. 14, pp. 68-82

Cao D. , Guo Y. , 2020, Algorithm research of spoken English assessment based on fuzzy measure and speech recognition technology, International Journal of Biometrics, Vol. 12, No. 1, pp. 120-129

Wang Y. , Zhao P. , 2020, A probe into spoken English recognition in English education based on computer-aided comprehensive analysis, International Journal of Emerging Technologies in Learning, Vol. 15, No. 3, pp. 223-233

Ran D. , Yingli W. , Haoxin Q. , 2021, Artificial intelligence speech recognition model for correcting spoken English teaching, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 3513-3524

Piotrowska M. , Czyżewski A. , Ciszewski T. , Korvel G. , Kurowski A. , Kostek B. , 2021, Evaluation of aspiration problems in L2 English pronunciation employing machine learning, The Journal of the Acoustical Society of America, Vol. 150, No. 1, pp. 120-132

Cao D. , Guo Y. , 2020, Algorithm research of spoken English assessment based on fuzzy measure and speech recognition technology, International Journal of Biometrics, Vol. 12, No. 1, pp. 120-129

Song Z. , 2020, English speech recognition based on deep learning with multiple features, Computing, Vol. 102, No. 3, pp. 663-682

Chan J. Y. H. , 2022, The evolution of assessment in English pronunciation: the case of Hong Kong (1978-2018), Language Assessment Quarterly, Vol. 19, No. 1, pp. 1-26

Gang Z. , 2021, Quality evaluation of English pronunciation based on artificial emotion recognition and Gaussian mixture model, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 4, pp. 7085-7095

Aissiou M. , 2020, A genetic model for acoustic and phonetic decoding of standard Arabic vowels in continuous speech, International Journal of Speech Technology, Vol. 23, No. 2, pp. 425-434

Fang Y. , 2022, Design of oral English intelligent evaluation system based on DTW algorithm, Mobile Networks and Applications, Vol. 27, No. 4, pp. 1378-1385

Fan Y. , Liu L. , 2023, The impact of student learning aids on deep learning and mobile platform on learning behavior, Library Hi Tech, Vol. 41, No. 5, pp. 1376-1394

Dongmei L. , 2021, Design of English text-to-speech conversion algorithm based on machine learning, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2433-2444

Deshmukh A. M. , 2020, Comparison of hidden Markov model and recurrent neural network in automatic speech recognition, European Journal of Engineering and Technology Research, Vol. 5, No. 8, pp. 958-965

Karjanto N. , Simon L. , 2019, English-medium instruction calculus in Confucian-heritage culture: flipping the class or overriding the culture?, Studies in Educational Evaluation, Vol. 63, pp. 122-135

Liang H. , 2021, Role of artificial intelligence algorithm for taekwondo teaching effect evaluation model, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 3239-3250

Huang W. , 2021, Simulation of English teaching quality evaluation model based on Gaussian process machine learning, Journal of Intelligent & Fuzzy Systems, Vol. 40, No. 2, pp. 2373-2383

Gubian M. , Blything R. , Davis C. J. , Jeffery S. B. , 2023, Does that sound right? a novel method of evaluating models of reading aloud: rating nonword pronunciations, Behavior Research Methods, Vol. 55, No. 3, pp. 1314-1331

Almusharraf A. , 2022, EFL learners' confidence, attitudes, and practice towards learning pronunciation, International Journal of Applied Linguistics, Vol. 32, No. 1, pp. 126-141

Nazir F. , Majeed M. N. , Ghazanfar M. A. , Maqsood M. , 2023, A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering, Multimedia Systems, Vol. 29, No. 3, pp. 1699-1715

Brena R. F. , Zuvirie E. , Preciado A. , Valdiviezo A. , Gonzalez-Mendoza M. , 2021, Automated evaluation of foreign language speaking performance with machine learning, International Journal on Interactive Design and Manufacturing, Vol. 15, No. 2-3, pp. 317-331

Savchenko A. V. , Savchenko V. V. , Savchenko L. V. , 2022, Gain-optimized spectral distortions for pronunciation training, Optimization Letters, Vol. 16, No. 7, pp. 2095-2113

Evers K. , Chen S. , 2022, Effects of an automatic speech recognition system with peer feedback on pronunciation instruction for adults, Computer Assisted Language Learning, Vol. 35, No. 8, pp. 1869-1889

Huang W. , Lu J. , Liu J. , 2020, Applications of robust recognition technology in the foreign language speech assessment system, International Journal of High Performance Systems Architecture, Vol. 9, No. 2-3, pp. 87-96

Hai Y. , 2020, Computer-aided teaching mode of oral English intelligent learning based on speech recognition and network assistance, Journal of Intelligent & Fuzzy Systems, Vol. 39, No. 4, pp. 5749-5760

Mokayed H. , Quan T. Z. , Alkhaled L. , Sivakumar V. , 2023, Real-time human detection and counting system using deep learning computer vision techniques, Artificial Intelligence and Applications, Vol. 1, No. 4, pp. 221-229

Ying Zhang

Ying Zhang was born in Guangxi Province, GX, CHN in 1985. She received her B.A. degree in English from Wuhan University of Science and Technology, Hubei, China in 2008 and an M.A. degree in Chinese culture from the Hong Kong Polytechnic University, Hong Kong, China, in 2010. From 2012 to 2015, she was a Teaching Assistant in the School of Foreign Studies from Xiangsihu College of Guangxi Minzu University, Guangxi, China. Since 2015, she has been a Lecturer in the the School of Foreign Studies from Xiangsihu College of Guangxi Minzu University, Guangxi province. She is the author of one book, published more than 6 articles. Her research interests include Pedagogy for English Language Teaching, Second Language Acquisition, AI-assisted Language Learning and Translation Studies.

Xiaoqian Liang

Xiaoqian Liang was born in Guangxi Province, GX, CHN in 1992. She received her B.A. degree in English from Nanning Normal University, GuangXi, China, in 2011 and an M.A. degree in applied linguistic and language teaching from University of Southampton, Southampton, England, in 2016. Since 2020, she has been a Teaching Assistant in the School of Foreign Studies from Xiangsihu College of Guangxi Minzu University, Guangxi, China. She pubiliced more than 6 articles. Her research interests include applied linguistics, second language acquisition, teaching English as second language and modern language.

Shenning Yue

Shenning Yue was born in Guangxi Province, HN, CHN in 1987. He received his B.S. degree in telecommunication from the Royal Melbourne Institute of Technology University (RMIT), Australia, in 2013 and an M.S. degree in teach English to speakers of other languages from University of Sydney, Australia, in 2021. From 2014 to 2016, he was a Account Manager at SME'S R US Pty Ltd., a consulting firm, in Australia. From 2017 to 2021, he was a co-founder and an EFL teacher at Dreamweaver Language Center, Guilin, Guangxi Province, China. Since 2022, he has been a Teaching Assistant in the School of Foreign Studies from Xiangsihu College of Guangxi Minzu University, Guangxi Provice, China. He research interests include computational linguistics and natural language processing.