Mobile QR Code QR CODE

2025

Reject Ratio

81.5%


  1. (School of Internet of Things Engineering, JiangSu Vocational College of Information Technology, Wuxi 214153, China)
  2. (School of Microelectronics, JiangSu Vocational College of Information Technology, Wuxi 214153, China)



SACNN, Self-attentive, Voiceprint recognition, Optimization, Signal denoising

1. Introduction

The application of deep learning in voiceprint recognition has been inspired by its excellent performance in image and natural language processing. In recent years, due to the advantages of neural networks in feature extraction, an increasing number of deep learning-based voiceprint recognition methods, such as speech feature fusion methods [1] and the ResNet-LSTM model [2], have been proposed. These methods have elevated the performance of voiceprint recognition models to new heights, demonstrating robust performance even in noisy environments.

The acoustic model based on DNN [3] shows its advantages in its outstanding modelling capability on text-dependent speech datasets. Deep learning methods leverage their powerful data processing abilities to generate highly dense data and accurately align speech at the frame level [4]. This advantage is particularly significant in text-dependent recognition tasks. However, this approach comes with certain costs, such as increased computational complexity in GMM-UBM and i-vector voiceprint models, because DNNs typically have more training parameters than GMM models [5]. In some cases, the training data must be labelled to train a DNN acoustic model with better recognition performance. Despite these challenges, the continuous development of computer and internet technologies now allows for the provision of large-scale speech databases for voiceprint recognition research.

Two traditional speaker models using deep learning, d-vector [6] and x-vector [7], were suggested in 2014 and 2017, respectively, in light of neural networks' enhanced feature extraction capabilities. In both approaches, neural networks are used to train the voiceprint recognition systems and frame-level speech processing is used to extract speaker voiceprint attributes. Following this design, acoustic models have appeared, providing a crucial basis for developing end-to-end techniques. A speaker verification method was suggested by Hannah Muckenhirn et al. [8] that uses CNN for directly extracting features from input voice signals, cross-validates all of the network model's parameter vectors, and randomly initializes weights between the hidden and output layers [9]. Meanwhile, end-to-end deep learning methods have started to appear, with the most notable being the Deep Speaker model proposed by Baidu's acoustic research team [10].

Later, Google introduced the Transformer model incorporating attention mechanisms, which, combined with CNN, resulted in the SACNN model. This model has begun to see application in various fields, though its use in voiceprint recognition is still exploratory with limited research. Traditional voiceprint recognition models often rely on fixed feature extraction and classification methods, which fail to use complex spatiotemporal information in speech signals fully. In addition, existing models do not perform well when dealing with noise and variable acoustic environments, resulting in reduced recognition accuracy. Based on this, this paper uses the SACNN-Self-attentive model to optimize the voicing recognition technology. It uses a wavelet algorithm to pre-process the speech data for noise reduction. Combining convolutional neural networks and self-attention mechanisms, this method can more effectively capture the subtle features of speech signals. While enhancing the model's attention to essential features, the self-attention mechanism significantly improves the robustness to noise and environmental changes. Through experiments, the SACNN model converges well in voicing recognition, and the accuracy of the two data sets is 1.12% and 1.24% higher than that of the Deep Speaker. The experimental results show that the model exhibits higher accuracy and stability in various test environments, demonstrating its potential in voice print recognition technology.

2. Voiceprint Recognition Technology

2.1. Voice Print Recognition System Framework

From the perspective of disciplinary classification, voiceprint technology is typically categorized under the domain of audio processing techniques. With the emergence of numerous interdisciplinary fields, voiceprint recognition technology can be classified under the broader category of biometric identification. Physiological and behavioral characteristics are the two types of biometric features. Common physiological features include fingerprints, DNA, faces, and retinas, while voiceprints, handwriting, and gait are typically considered behavioral features. The unique voiceprint information carried in speech implies that voiceprint recognition can become an important component of biometric identification technologies, similar to fingerprint recognition and facial recognition, and can serve as a substitute for traditional digital passwords, playing a significant role in various security and encryption-focused domains.

Text-dependent, text-independent, and text-prompted voiceprint recognition are the three categories under which voiceprint recognition, also known as speaker recognition, falls. Text-dependent recognition refers to situations where the same language text is spoken during both voiceprint enrollment and testing. This limits the applicability of this type of voiceprint recognition in real-life scenarios. However, the success rate of recognition increases significantly since successful recognition in text-independent situations implies that recognition can also be achieved in text-dependent situations. Therefore, all the research on voiceprint recognition in this paper is text-independent. Within the category of text-independent speaker recognition, there are two functions: speaker identification and speaker verification. Speaker verification is like opening a door where the lock determines that only one key can open it. Speaker verification involves finding the registered voiceprint among different test voices. Only when encountering this voiceprint will the door open.

On the other hand, speaker identification is like retrieving items from a supermarket locker. You input your identification information, and the system automatically distinguishes among multiple stored users and opens the locker that belongs to you. The registrant's voiceprint characteristics are initially entered into a database to begin the voiceprint recognition procedure. Then, the test subject speaks, and the speech signal enters a well-trained model, which outputs the voiceprint features. The voiceprint features are then compared for similarity with the voiceprint features in the database, and an evaluation score is generated. The system determines that the two individuals are identical when the score exceeds a predetermined threshold. If multiple individuals in the database are judged the same as the test subject, the system outputs the identification result of the person with the highest score. Suppose the test subject's score does not exceed the threshold. In that case, the voiceprint recognition system identifies the person as an unregistered user. In contrast, the speaker verification system determines that the registrant and the test subject are not the same person. The model that can extract voiceprint features is trained and optimized through a large amount of speech data using a well-designed network architecture. Fig. 1 shows the speaker recognition system's architecture.

Vocal print recognition research dates back to the 1960s [11]. Over the following five decades, various advanced scientific technologies have contributed to developing voiceprint recognition tasks. In numerous acoustic research tasks, for instance, acoustic features like vector quantization and dynamic time warping, as well as acoustic models like Mel-frequency cepstral coefficients and linear predictive cepstral coefficients, have been extensively employed [12]. Later, in 2000, Based on the Gaussian mixture model (GMM), Reynolds et al. [13] presented the Gaussian mixture model universal background model (GMM-UBM). Since then, this model has served as the foundational model for voiceprint recognition for over a decade. GMM can fit multiple Gaussian density functions into various shapes of probability density functions. The implementation of this model involves arranging each Gaussian distribution vector in the GMM and combining them into a vector, which is then used as a voiceprint model. This vector is known as a mean supervector. However, in practical application scenarios, the speech data collected from speakers is limited. Yet, the GMM model requires sufficient data for training to achieve good recognition performance. Therefore, the universal background model (UBM) was created. The UBM model uses limited speech data through adaptive methods to train a target speaker's voiceprint type.

In recent years, researchers have benefited from deep neural networks' powerful feature extraction capabilities (DNN). The performance of voiceprint recognition models has been brought to a new level even in complicated real-world contexts [14] thanks to the several deep learning-based techniques that have been developed [15- 17].

2.2. Voiceprint Recognition System Process

Voiceprint recognition systems have five components: speech detection, speech preprocessing, acoustic feature extraction, voiceprint feature extraction model, and feature similarity matching. Regarding the workflow, the voiceprint recognition system can be divided into training, enrollment, and inference stages. Each stage's preprocessing stage remains consistent, including speech detection to remove silence, speech preprocessing module for pre-emphasis, framing, and windowing, and acoustic feature extraction module to transform the speech signal into spectrograms or other acoustic features. The voiceprint feature extraction model undergoes parameter updates during the training stage and uses fixed parameters to generate voiceprint features during the enrollment and inference stages. In the inference stage, the voiceprint features of the input speech are verified, and the voiceprint features in the database are input to the similarity matching module to compute the similarity score. The system then determines the outcome based on the output similarity score and a predefined threshold.

Speech detection, sometimes called vocal activity detection (VAD), is a method for identifying periods of quiet and speech. It analyzes the audio signal and detects changes in the signal (e.g., frequency occurrence or significant amplitude) to determine the presence of speech at the current audio position. Speech detection helps filter out background noise and other non-speech noise in the audio, thereby improving the accuracy of the voiceprint recognition system. Currently, there are two commonly used speech detection methods: threshold-based and model-based methods. Threshold-based methods use statistical techniques to calculate the energy values of manually extracted acoustic features and set a reasonable threshold as the decision point. Model-based methods employ machine learning or deep learning models to directly determine the presence of speech in a given time segment.

The speech preprocessing process consists of three subprocesses: pre-emphasis, framing, and windowing. Speech pre-emphasis is a technique used to improve the quality of speech signals. In real-world environments, the power of speech signals attenuates as the frequency increases. Pre-emphasis can be used to increase the loudness of the speech signal's high-frequency components to make up for the environment's attenuation of those components. The calculation process is as follows:

(1)
$ H(z) = 1-\alpha z^{-1}, $

where $\alpha$ is the pre-accentuated coefficient, and its value range is generally (0.9, 1). Fig. 1 shows the time and frequency domain power changes of speech with and without pre-accentuated speech.

Fig. 1. Speaker recognition system architecture diagram.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig1.png

Fig. 2. Two mainstream acoustic features.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig2.png

The framing process involves dividing the speech signal into sparse segments called frames, transforming it from densely sampled signals. This process reduces data volume and facilitates subsequent signal analysis and processing. The framing process includes two parameters: frame length and frame interval. Typically, speech signals are considered relatively stationary within 10 to 30 milliseconds, making frame lengths within this range suitable. The frame interval should be shorter than this length to maintain continuity between frames. The windowing process prevents the Gibbs phenomenon, where local peaks appear in the spectrum after feature transformation. It also helps mitigate spectral leakage, which refers to the trailing effect in the spectrum across the entire frequency band. The windowing process involves multiplying each value of a frame signal by a different weight determined by a window function. In voiceprint recognition tasks, the Hamming window is commonly used as the window function.

Mel-frequency cepstral coefficients (MFCC) and filterbank coefficients are the two acoustic characteristics that are most frequently utilized in voiceprint recognition. Following the short-time Fourier transform, the spectrum of a frame is used to determine both characteristics. MFCC uses a Mel filterbank to adjust the linear spectrogram so that it better reflects the nonlinear aspects of human hearing. The conversion formula between the Mel frequency scale $f_{mel}$ and the linear frequency scale f is as follows:

(2)
$ f_{mel} = 2595 \times \log_{10} \left(1 + \frac{f}{700}\right). $

After obtaining the Mel frequency-scaled spectrogram, the logarithm function is applied to the magnitude (amplitude) of the spectrogram to correct the loudness. Then, to lessen the correlation between the filterbank characteristics, a discrete cosine transform (DCT) is used. This process yields the MFCC, as Fig. 2(a) shows. MFCC has been the mainstream handcrafted acoustic feature in speech-related applications and is still widely used even in the era of deep learning.

Filterbank coefficients (Fbank), as shown in Fig. 2(b), are the acoustic features obtained by removing the final DCT step in the computation of MFCC. Frank is the primary feature used in deep learning-based voiceprint recognition. This is because the DCT step eliminates the correlation between filterbank features, which aligns well with the feature independence assumption in statistical models, particularly Gaussian Mixture Models (GMMs). However, this assumption is no longer necessary with neural networks as models. Removing the DCT step reduces computational complexity and avoids the potential damage to the nonlinear relationships between the original features caused by this linear transformation.

The similarity matching process is the process of calculating the similarity of two input voiceprint features. Three computing strategies are usually used in this process: cosine similarity, Euclidean distance, and machine learning model. The similarity score obtained by cosine similarity is in the range of $[-1, 1]$, the Euclidean distance mainly measures the difference of features, and the value range after conversion to similar scores is $(0, +\infty)$. The machine learning model is relatively flexible, but the calculation amount is relatively large.

3. SACNN-Self-attentive Model

3.1. Principles and Implementation Details of Self-attentive

The structure of the previous time step's output on the same layer and the production of the upper layer in an RNN is beneficial for the model to learn global features. However, it can also lead to the problem of long sequence dependencies. Although LSTM can partially overcome the issues of gradient explosion and vanishing gradients when dealing with long sequence dependencies in RNNs, it has a more complex computation process. Furthermore, RNN-based models with attention mechanisms still possess the general limitations of RNNs and strongly depend on the sequential nature of data during computation, making extracting the relationships between words in textual sentences challenging.

To address this problem, in 2017, the Google team proposed a model structure called the "Transformer." It replaces the LSTM with a fully attention-based mechanism and applies it to machine translation, achieving excellent results. The core of the Transformer model structure lies in the self-attention mechanism. This mechanism extracts textual features by capturing the relationships between words in a sentence. Moreover, the self-attention mechanism is not based on sequential structure, which eliminates the dependency on the sequence of the text during computation. As a result, it can handle long-range dependencies more effectively, optimize computations, and reduce the complexity of calculations for each layer's neurons. The Transformer model efficiently controls training costs by reducing the computational burden.

The Transformer model structure, as depicted in Fig. 3, is essentially an Encoder-Decoder model with attention mechanisms, similar to the seq2seq model. The Encoder consists of 6 identical encoding modules, each comprising two sub-modules: one sub-module consists of Multi-Head Attention, Residual Connection, and Layer Normalization; the other sub-module consists of a Feed-Forward Neural Network (FFN), along with Residual Connection and Layer Normalization. It's important to note that the input of each sub-module comes from the output of the connected sub-module in the previous layer.

In contrast, the Decoder is made up of six identical decoding units. Every decoding module has three sub-modules, as opposed to three in the Encoder. The first sub-module utilizes Masked Multi-Head Attention, as the predictions at position i in the Decoder are determined by the previous outputs, which distinguishes it from the sub-modules in the Encoder. The second and third sub-modules are the same as the first and second sub-modules in the Encoder. However, the input of the second sub-module (Multi-Head Attention) includes the output from the first sub-module and the Encoder output. This essentially solves the long-range dependence issue in the seq2seq paradigm by enabling the outputs at each place in the Decoder to be correlated with the corresponding positions in the Encoder.

The self-attention process, which forms the basis of the Transformer model, will now be examined. As shown in Fig. 3, the Scaled Dot-Product Attention and the Multi-Head Attention comprise the self-attention mechanism. The Scaled Dot-Product Attention is represented on the left side of the figure, while the Multi-Head Attention is represented on the right side.

First, the scaling dot product attention mechanism is analyzed. In Fig. 3, Q, K, and V represent Query Key and Value, respectively. In reading comprehension, they represent queries, keywords, and answers.

(3)
$ Q = W^Q X, $
(4)
$ K = W^K X, $
(5)
$ V = W^V X, $

where $X$ represents the input vector, $W^Q$, $W^K$, and $W^V$represent the initialization weight matrix of $Q$, $K$ and $V$, respectively.

Thus, the scaling dot product attention mechanism results can be calculated according to the vector values of $Q$, $K$ and $V$, which are expressed as Eq. (6).

(6)
$ Attention(Q,K,V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V, $

where $d_k$ represents the word vector dimension of $Q$ and $K$; $1/\sqrt{d_k}$ is the treatment of the inner product of $Q$ and $K$.

Fig. 3. Network structure of SACNN model.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig3.png

To avoid excessive internal product of the two. The weight of $V$ is obtained by normalizing with the SoftMax function and multiplying it with $V$ to get the final weighted sum. Secondly, the mechanism of multi-head attention is analyzed. Fig. 3 shows the calculation process in Eqs. (7) and (8).

(7)
$ MultiHead = Concat(head_1, head_2, ..., head_n)W^0, $
(8)
$ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V), $

where $W_i^Q \in R^{d_m \times d_k}$; $W_i^K \in R^{d_m \times d_k}$; $W_i^V \in R^{d_m \times d_v}$; $W^O \in R^{d_v \times d_m}$; $d_Q$, $d_K$, $d_V$ represent the word vector dimension of $Q$, $K$ and $V$, respectively, and $d_m$ represents the word vector dimension of the model.

3.2. SACNN-Self-attentive Model

The CNN model SACNN will be constructed in this part using the self-attention mechanism as a basis. In Fig. 3, the particular model structure is displayed. As can be seen from Fig. 3, the Self-Attention mechanism-based SACNN model constructed in this paper mainly consists of four parts, namely, input layer, convolutional neural network layer (CNN layer), self-attention mechanism layer and output layer.

In the input layer, the main function is to transform the data into a matrix vector. Let the length of the data be n, which, after preprocessing and input into the prediction model, can be represented as $X = [x_1, x_2, x_3, ..., x_n]^T$.

In the CNN layer, this paper deviates from the conventional use of two-dimensional convolutions in computer vision systems. Instead, it adopts one-dimensional convolutions that have the same underlying principles. This ensures that the computational complexity of the convolutional neural network (CNN) remains manageable while preserving its strong feature extraction capabilities. Although neural networks can capture the continuity information of time series data, their feature extraction abilities are generally limited. Information loss and inability to extract structural information arise when dealing with long input sequences.

In comparison, CNNs exhibit strong feature extraction capabilities, making them well-suited for overcoming these problems. Therefore, the model constructed in this paper employs convolutional layers to extract information from the data. Using a fixed kernel size to capture data information would be relatively limited. In order to ensure the collection of global information, this research employs several convolutional kernels of varying sizes to capture the feature information of the data.

In the self-attention mechanism layer, since self-attention mechanisms excel at handling natural language, incorporating them into the model can further highlight influential factors and improve model accuracy. The computation process of the self-attention mechanism, which updates the learning parameters based on self-information, can be divided into three steps: similarity calculation, normalization of the similarity data, and weighted summation of the feature weight coefficients, which represent the self-attention weights of a certain data. This paper adopts the dot-product method for similarity calculation, using the softmax function to transform the dot-product computation results into a weight vector. Finally, the obtained weight vector is subjected to weighted summation, resulting in the final self-attention weights.

Furthermore, to address the issue of overfitting, a Dropout layer is added after the concatenation layer of the self-attention mechanism. By randomly ignoring nodes, the Dropout layer prevents the training process from excessively relying on specific feature values.

The findings, shown as $Y$ with a stride of $m$, are then produced after the data entered through the Dropout layer is subjected to feature dimension reduction via a fully linked layer in the output layer.

(9)
$ Y = [y_1, y_2, \dots, y_m]^T, $
(10)
$ y_t = f(w_o s_t + b_o), $

where $W_o$ and $b_o$ represent the weight matrix and the bias vector, respectively, and $y_t$ represents the output of the model at time $t$.

4. Optimization of Voice Print Recognition Based on SACNN

4.1. Speech Signal Denoising Technology

Different from the traditional signal denoising method which is input to various filters after Fourier transform, the denoising effect is not very ideal for the non-stationary signal with sudden change and frequency band aliasing Transform, WT), which has good time-frequency localization characteristics, performs multi-scale processing of signals with different resolutions in time-frequency space, and is widely used in many fields such as speech noise reduction, image processing, weather prediction, seismic investigation and machine vision.

Let $f(t)$ be a finite energy signal; the discrete wavelet transform of this signal can be defined as:

(11)
$ (W_\psi f)(a,b) = \langle f,\psi_{a,b}\rangle = |a_0|^{-m/2} \int_{-\infty}^{+\infty} f(t)\bar{\psi}(t)dt, $

where $\psi_{a,b}$ is called the generating or basis function of the wavelet transform:

(12)
$ \psi_{a,b}(t) = |a|^{-1/2}\psi\left(\frac{x-b}{a}\right), a > 0, b \in R $

Eq. (11) denotes the scaling factor as 'a' and the displacement factor as 'b'. As illustrated in Fig. 4, a time-frequency coordinate system is established for wavelet transformation. In wavelet transformation, the position of the time window is solely influenced by the displacement factor. Therefore, as the scale factor increases, the time window widens, the frequency window narrows, and the center of the frequency window shifts towards the low-frequency direction. Conversely, with a narrower time window, the frequency window widens, and the center of the frequency window moves towards the high-frequency direction. The essence of wavelet transformation lies in manipulating these two factors to construct a combination that can represent any signal in space [18]. By utilizing the scale factor, it is possible to perform a tower-like decomposition of a specific signal in space, as depicted in Fig. 4, following the classical Mallat algorithm [19]. This algorithm provides a computational method for wavelet decomposition and reconstruction, simplifying the overall wavelet calculations.

Fig. 4. Time-frequency coordinates of wavelet transform and Mallat algorithm.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig4.png

In the Mallat algorithm, there is an impulse response function: $h(n)$. Therefore, the scale function and the wavelet function are defined as follows.

(13)
$ \begin{cases} \phi(t) = \sum_n h(n)\phi(2t -n), \\ \psi(t) = \sum_n g(n)\psi(2t -n). \end{cases} $

In formula (13), $g(n) = (-1)^{1-n}h(1 - n)$, The signal $x(t)$ is decomposed by Mallat, and the scale is set to $j$ ($j \ge 1$). Approximate signal and detailed signal obtained by decomposition are respectively:

(14)
$ \begin{cases} A_jx(t) = \langle x(t), \phi_{j,k}(t)\rangle \\ = 2^{-j/2} \int x(t)\phi(2^{-j}t -2k)dt, \\ D_jx(t) = \langle x(t), \psi_{j,k}(t)\rangle \\ = 2^{-j/2} \int x(t)\psi(2^{-j}t -2k)dt. \end{cases} $

It can be found in Eq. (14) that the process of decomposing signal $x(t)$ is the process of decomposing it step by step from scale $j$ to $j + 1$. That is, the process from low resolution to high resolution. And finally decomposed into a high frequency signal $A_jx$ (detailed signal) and a low frequency signal $D_jx$ (approximate signal) [20]:

(15)
$ \begin{cases} A_{j+1}x = \sum_k h(k -2n)A_jx, \\ D_{j+1}x = \sum_k g(k -2n)A_jx, \end{cases} j \ge 1. $

Eq. (16) is the Mallat wavelet reconstruction formula:

(16)
$ x = \sum_{k=1}^k h(n-2k)A_{j+1} + \sum_{k=1}^k g(n-2k)D_{j+1}, j \ge 1. $

After a finite energy signal undergoes wavelet transformation, it is decomposed into a set of detail signals and approximation signals. Each sample point of every signal has its wavelet decomposition coefficient $\omega_{j,k}$. When the signal contains noise, the noise is also decomposed along with the host signal. This portion of wavelet coefficients is left untouched [21]. Conversely, if the result is greater than the threshold, the coefficient is considered to be from noise and needs to be zeroed out or processed through a specific threshold function. This yields an estimate for this portion of wavelet coefficients to replace the original ones. Once all wavelet coefficients are processed, wavelet reconstruction is performed to achieve the denoising effect. The key to wavelet threshold denoising lies in finding an appropriate threshold function, also known as a threshold rule. Conventional wavelet threshold functions may be broadly classified into three categories: semi-soft, soft, and hard kinds [22]. Based on extensive experimentation and empirical analysis, this paper proposes improvements to the semi-soft threshold denoising method mentioned above:

Let the high frequency signal be $Wa_{j,k}$, then the formula for estimating noise standard deviation is as follows:

(17)
$ \sigma_j = \frac{1}{0.6745} \times \frac{1}{N} \sum_{K=1}^N |Wa_{j,k}|, 1 \le j \le J. $

Since the real original signal's SNR varies, the threshold's setting must also be adjusted to reflect the current circumstances. The following is the unified threshold formula found in the body of recognized literature [23]:

(18)
$ \lambda_{1, j} = \sigma_j \sqrt{2\log(N)}. $

After the signal is decomposed to the scale of $J$, the $J$ group of high-frequency signal coefficients are obtained. The wavelet coefficients of each group are arranged from small to large in absolute value, and a vector is obtained:

(19)
$ P = [Wa_{j,n}], 1 \le n \le N. $

This vector is used to calculate the evaluation vector under the JTH wavelet coefficient:$R = [r_n]$, $1 \le n \le N$, where

(20)
$ r_n = \sum_{k=1}^n Wa_{j,n} + (N -i)Wa_{j,n} + (N -2n)\sigma_j^2. $

The interruption value of the evaluation vector is then sorted from large to small, the minimum value is taken as the approximation error, and the corresponding wavelet coefficient $Wa_{j,m}$ is found. The threshold value of the J-layer wavelet decomposition is calculated by using the wavelet coefficient as follows:

(21)
$ \lambda_{a, j} = \sqrt{CD_{\min}}. $

The threshold selection function of the J-layer wavelet decomposition is:

(22)
$ \lambda_j = \begin{cases} \lambda_{1, j}, & (P_{a, j} -\sigma_j^2 < \rho_{N, j}), \\ \min(\lambda_{1, j}, \lambda_{a, j}), & (P_{a, j} -\sigma_j^2 \ge \rho_{N, j}), \end{cases} $

where $P_{a, j}$ is the average value of the absolute value of the wavelet coefficient, and $\rho_{N, j}$ is the minimum energy level of the wavelet coefficient vector. The calculation formula is as follows:

(23)
$ P_{a, j} = \frac{1}{N} \sum_{k=1}^N Wa_{j,k}. $

It is necessary to restore the signal to its original state since the calculated wavelet coefficient, or wavelet coefficient, is believed to be the result of noise. Wavelet reconstruction is the final step in achieving the goal of noise reduction because it replaces the estimated wavelet coefficient value with the actual wavelet coefficient value through a series of computations. A coefficient $\Gamma(\sigma_j)$ reflecting noise intensity is introduced, which is used to reflect the noise intensity of the $J$-layer wavelet high-frequency signal. The calculation formula is as follows:

(24)
$ \Gamma(\sigma_j) = \sqrt{\sigma_j/A_j}. $

In Eq. (24), $A_j$ represents the amplitude of the high-frequency partial coefficient of the J-layer wavelet.

The calculation formula of wavelet coefficient estimate is given:

(25)
$ w_{j,k} = \begin{cases} w_{j,k} -\Gamma(\sigma_j)\times\lambda_j, & w_{j,k} > \lambda_j, \\ w_{j,k} +\Gamma(\sigma_j)\times\lambda_j, & w_{j,k} < -\lambda_j, \\ 0, & -\lambda_j \le w_{j,k} \le \lambda_j. \end{cases} $

The detailed steps to improve the wavelet threshold denoising algorithm are as follows: the original signal $x(t)$ is discretized, and the high-pass filter $h$ and low-pass filter $g$ are set up. The $J$ layer decomposes the wavelet. The wavelet coefficients and the wavelet detail signal amplitude $A_j$ of the layer decomposition are obtained. The noise standard deviation $\sigma_j$ and noise intensity coefficient $\Gamma(\sigma_j)$ of each layer are calculated according to the detailed signal coefficients of each layer. The unified threshold $\lambda_{1, j}$ of each layer signal is calculated. The adaptive threshold $\lambda_{a, j}$ of each layer signal, the average value $P_{a, j}$ of the absolute value of the wavelet coefficients of the layer, and the minimum energy level $\rho_{N, j}$ of the wavelet coefficients of the layer are calculated. The wavelet threshold of each layer is calculated according to Eq. (22). The wavelet coefficients are adjusted to complete the threshold denoising on this scale, and finally, the wavelet reconstruction is carried out according to the Fig. 4.

Based on this as the main idea, simulation experiments are conducted on existing data. The results are shown in Fig. 5. The studies demonstrate that the enhanced wavelet threshold denoising described in this study is more effective in suppressing and removing noise while preserving the majority of the fault information. Additionally, the denoising outcomes are better suited for diagnosing data faults in the future.

Fig. 5. Signal noise reduction effect diagram.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig5.png

4.2. Voice Print Recognition Based on SACNN-Self-Attentive Model

The structure of the SACNN model proposed in the third section is shown in Table 1.

From the table, it can be observed that SACNN is primarily composed of stacked Attentive ResBlock blocks with different output dimensions. The first convolutional layer, conv1, and the pooling layer, pool1, extract low-level features and reduce spatial dimensions to accelerate model training. Following these layers are three groups of Attentive ResBlock blocks with different output dimensions: Attentive ResBlock256, Attentive ResBlock512, and Attentive ResBlock1024. It is noteworthy that the final two groups' initial block has a stride of $2 \times 2$, whilst the following blocks have $1 \times 1$. This design aims to facilitate extracting highly overlapping features by applying simple dimension reduction to the input feature maps.

Table 1. SACNN model structure.

Layer name

Structure

Step size

Output size

Conv1

$3\times3$, 64

$1\times1$

(None, 299, 64, 64)

maxpool1

$3\times3$

$2\times2$

(None, 150, 32, 64)

Attentive ResBlock256

$\begin{bmatrix} 1\times1, 64 \\ 3\times3, 64 \\ 1\times1, 256 \end{bmatrix} \times 2$

$1\times1$

(None, 150, 32, 256)

$1\times1$

Attentive ResBlock512

$\begin{bmatrix} 1\times1, 128 \\ 3\times3, 128 \\ 1\times1, 512 \end{bmatrix} \times 3$

$2\times2$

(None, 75, 16, 512)

$1\times1$

$1\times1$

Attentive ResBlock1024

$\begin{bmatrix} 1\times1, 256 \\ 3\times3, 256 \\ 1\times1, 1024 \end{bmatrix} \times 2$

$2\times2$

(None, 38, 8, 1024)

$1\times1$

average

-

average

(None, 1024)

fc1

1024

fc1

(None, 1024)

fc2

N

fc2

(None, N)

The subsequent average layer transforms frame-level speaker embeddings into sentence-level speaker embeddings. The last fully connected layer, fc2, uses the SoftMax activation function to transfer the sentence-level information to particular speaker identities.

Batch normalization (BN) is applied throughout the structure before the ReLU activation function. The model has approximately 5.5 million parameters and a size of 19MB. The number of neurons in the fc1 layer of this model is 1024, which is similar to the total number of speakers in large-scale datasets, making it more appropriate for large-scale datasets even if the parameter count grows.

Some parameter Settings of the model in the training are shown in Table 2. In general, the training batch size and the number of threads to read data are set according to the memory size and usage of the server. Here, in order to make the GPU utilization rate reach more than 90% during training, batch_size is set to 128, which can speed up the training speed.

Table 2. Experimental training parameter setting.

Parameter name

Parameter value

Instructions

batch_size

128

Batch size for training

epochs

30

Training rounds

Learning_rate

1e-3

Initial learning rate

workers

4

Thread count

The experiments utilize two open-source datasets: TIMIT [24] and Libri speech [25]. Table 3 presents the data partitioning details. Additionally, the datasets are pre-processed using a wavelet algorithm for denoising before the experiments. The following sections provide separate introductions for each dataset.

Table 3. Data partitioning of TIMIT and Libri speech datasets.

Data set name

Data set class

Number of speakers

TIMIT

Training set

462

Test set

168

Libri speech

Training set

251

Test set

40

The TIMIT dataset consists of speech samples from 630 speakers from eight different English regions. There are 438 male and 192 female speakers, and the speech samples have a sampling frequency of 16kHz. Each speaker has ten speech samples, ranging from 2 to 3 seconds. In the voiceprint recognition experiments in this section, the training set consists of 462 speakers, while the test set consists of 168 speakers.

Libri speech is a dataset that contains text and speech from audiobooks, with a total duration of 1000 hours and 2484 speakers. This chapter selects a subset of the dataset, namely train-clean-100 and test-clean. The train-clean-100 subset includes 251 speakers, while the test-clean subset includes 40.

This chapter utilizes all speakers in the test set for voiceprint identification experiments. Half of the sentences from each speaker are used as the enrollment set, while the remaining sentences are used as the evaluation set. For voiceprint verification experiments, half of the speakers from the test set are used as enrolled speakers, and the other half are used as non-enrolled speakers. The enrollment set consists of half of the sentences from the enrolled speakers. In contrast, the evaluation set consists of the remaining sentences concatenated with all the sentences from the non-enrolled speakers. It is important to note that both experiments' evaluation sets are shuffled to randomize the data distribution.

The voice signal must be pre-processed before the experiment. Using a sample frequency of 16 kHz, the speech signal pre-processing step turns all of the audio into a single channel. Then it uses voice endpoint detection (VAD) to remove the silent part of the audio signal. The traditional way to handle variable-length audio is to crop it into fixed-size segments (such as 3-second segments), as shown in Fig. 6 and followed in this article.

4.3. Experiment of Voiceprint Recognition Based on SACNN Model

As mentioned in the preceding section, the preprocessing stage involved applying the following actions to every audio file in the training and test sets. Each audio file underwent the following preprocessing steps: First, silence removal was performed to eliminate unnecessary noise. Then, the audio files were cropped into 3-second segments to ensure uniform duration. Subsequently, Flank features were extracted from the cropped audio segments to capture the spectral information. Finally, the extracted Flank features were transformed into spectrograms of size (299, 40, 1). Spectrograms are visual representations of the audio signals' frequency content over time.

These spectrograms were saved as pickle files, a Python module that allows object serialization by writing the resulting data stream into file objects.

Fig. 6. Voiceprint information preprocessing process.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig6.png

Fig. 7. Loss graphs and accuracy on different data sets.

../../Resources/ieie/IEIESPC.2026.15.2.188/fig7.png

It can be seen from the two Loss curves in Fig. 7 that the model has a good convergence effect on the training set. Some datasets have fewer categories, resulting in less data on the total model training set. It can be seen that the improved model has a good effect on the optimization training. The training set was utilized during the training phase, and both SACNN and CNN models were fed with the spectrograms as inputs. The objective of the training process was to enable the models to correctly identify the corresponding speaker's identity for each audio sample in the training set. The models were trained using the cross-entropy loss function, and the error backpropagation algorithm was employed for weight updates, utilizing stochastic gradient descent with momentum.

Experimental results were obtained, and the loss function curves for the models on the two datasets are depicted in Fig. 7.

Accuracy is the voiceprint recognition experiment's assessment metric, and higher accuracy indicates better model performance. The model Deep Speaker model is also utilized for comparison to the SACNN model suggested in this article. The experimental results in this paper are obtained by using the model structure in the corresponding paper under the same experimental conditions. Two experimental results on Libri Speech and TIMIT datasets are shown in Fig. 7

By comparing the experimental results, the SACNN model converges well in testing data sets. Meanwhile, compared with the experimental results of the Deep Speaker model, the accuracy of the SACNN model has been significantly improved, and the accuracy of the SACNN model in the two data sets can reach 98.35% and 98.43%, respectively. The accuracy of the Deep Speaker model is 1.12% and 1.24% higher than that of the same data set, respectively.

5. Conclusion

This study aims to optimize speaker recognition technology by introducing the SACNN-Self-attentive model. Through analysis and discussion of the experimental results, several conclusions have been drawn:

We successfully applied the SACNN-Self-attentive model to the speaker recognition task. By combining the self-attention mechanism with the CNN network, we effectively captured key information in speaker features, and the model exhibited good convergence on the dataset. Furthermore, compared to the Deep Speaker model in the same dataset, experimental results demonstrated that our model achieved accuracies of 98.35% and 98.43%, which are 1.12% and 1.24% higher than the Deep Speaker model's accuracy, respectively. We applied the wavelet algorithm to preprocess speech signals, effectively preserving important features while suppressing irrelevant information for the speaker recognition task.

The performance of the SACNN model largely depends on the quality and quantity of the training data. When the training data is insufficient or there is noise, the model's generalisation ability will decrease. Introducing the self-attention mechanism increases the computational complexity of the model, which may encounter efficiency bottlenecks in actual deployment. SACNN is a black box model, which makes it challenging to explain its internal mechanism and decision-making process, which limits its application in some scenarios with strict requirements for interpretability.

Future research should explore new lightweight self-attention mechanisms to balance model performance and computational efficiency. Combined with other techniques, such as generative adversarial network (GAN) and transfer learning, SACNN improves the model's generalization performance in the data scarcity scenario. SACNN is applied to more vowel recognition scenarios, such as cross-device, cross-language, etc., to explore its potential in practical applications.

SACNN generally performs well in voicing recognition, but some problems remain worthy of further study. Future research may focus on improving the model's efficiency, generalization, and interpretability to meet the needs of a broader range of applications.

References

1 
Zhang X.-M. , 2021, Based on the speaker recognition research and application of deep learningDOI
2 
Liu Y. , Liang H. , Liu G. , Hu Q. , 2021, Voiceprint recognition method based on ResNet-LSTM, Computer System Application, Vol. 30, No. 6, pp. 215-219Google Search
3 
Krizhevsky A. , Sutskever I. , Hinton G. , 2012, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, Vol. 25DOI
4 
Chen L. , Lee K. A. , Ma B. , Guo W. , Li H. , Dai L.-R. , 2015, Phone-centric local variability vector for text-constrained speaker verificationDOI
5 
Li G. , Hari S. K. S. , Sullivan M. , Tsai T. , Pattabiraman K. , Emer J. , Keckler S. W. , 2017, Understanding error propagation in deep learning neural network (DNN) accelerators and applicationsDOI
6 
Dey S. , Madikeri S. , Ferras M. , Motlicek P. , 2016, Deep neural network-based posteriors for text-dependent speaker verificationDOI
7 
Novotny O. , Plchot O. , Matejka P. , Mosner L. , Glembek O. , 2018, On the use of X-vectors for robust speaker recognitionDOI
8 
Muckenhirn H. , Doss M. M. , Marcel S. , 2018, Towards directly modeling raw speech signal for speaker verification using CNNs, pp. 4884-4888DOI
9 
Heidari A. A. , Faris H. , Mirjalili S. , Alijarah I. , Mafarja M. , 2020, Ant lion optimizer: theory, literature review, and application in multi-layer perceptron neural networks, Nature-Inspired Optimizers: Theories, Literature Reviews and Applications, pp. 23-46DOI
10 
Li C. , Ma X. , Jiang B. , Li X. , Zhang X. , Liu X. , Cao Y. , Kannan A. , Zhu Z. , 2017, Deep speaker: An end-to-end neural speaker embedding system, arXiv preprint arXiv:1705.02304DOI
11 
Kinnunen T. , Li H. , 2010, An overview of text-independent speaker recognition: From features to supervectors, Speech Communication, Vol. 52, No. 1, pp. 12-40DOI
12 
Hanifa R. M. , Isa K. , Mohamad S. , 2021, A review on speaker recognition: Technology and challenges, Computers & Electrical Engineering, Vol. 90, pp. 107005DOI
13 
Reynolds D. A. , Quatieri T. F. , Dunn R. B. , 2000, Speaker verification using adapted Gaussian mixture models, Digital Signal Processing, Vol. 10, No. 1-3, pp. 19-41DOI
14 
McLaren M. , Ferrer L. , Castan D. , Lawson A. , 2016, The speakers in the wild (SITW) speaker recognition database, pp. 818-822DOI
15 
Lei Y. , Scheffer N. , Ferrer L. , McLaren M. , 2014, A novel scheme for speaker recognition using a phonetically-aware deep neural network, pp. 1695-1699DOI
16 
Snyder D. , Garcia-Romero D. , Sell G. , Povey D. , Khudanpur S. , 2018, X-vectors: Robust DNN embeddings for speaker recognition, pp. 5329-5333DOI
17 
Kabir M. M. , Mridha M. F. , Shin J. , Jahan I. , Quwsar A. , 2021, A survey of speaker recognition: Fundamental theories, recognition methods and opportunities, IEEE Access, Vol. 9, pp. 79236-79263DOI
18 
Xu B. , Cen K. , Huang J. , Shen H. , Chen X. , 2020, A review of graph convolutional neural networks, Chinese Journal of Computers, Vol. 43, No. 5, pp. 755-780Google Search
19 
Luo D. , Li Y. , Luo Z. , Han C. , 2023, Detection and analysis of hanging basket wire rope broken strands based on Mallat algorithm, pp. 518-532DOI
20 
Li H. , Zhou Y. , Tian F. , Li S. , Sun T. , 2015, A new adaptive wavelet thresholding function vibration signal denoising algorithm, Journal of Instruments and Meters, Vol. 4, No. 10, pp. 2200-2206Google Search
21 
Chang G. , 2000, Adaptive wavelet thresholding for image denoising and compression, IEEE Transactions on Image Processing, Vol. 9DOI
22 
Guo H.-Y. , Jing X.-J. , Shang Y. , 2010, Research on vehicle license plate location based on wavelet transform and mathematical morphology, Computer Technology and Development, Vol. 20, No. 5, pp. 13-16Google Search
23 
Hou P.-G. , Zhao J. , Liu M. , 2006, License plate location method based on wavelet transform and line scan, Journal of System Simulation, pp. 811-813Google Search
24 
1990, The DARPA TIMIT acoustic-phonetic continuous speech corpus, NIST Speech CDGoogle Search
25 
Panayotov V. , Chen G. , Povey D. , Khudanpur S. , 2015, Librispeech: An ASR corpus based on public domain audio books, pp. 5206-5210DOI
Guoqiang Lu
../../Resources/ieie/IEIESPC.2026.15.2.188/au1.png

Guoqiang Lu was born in Jiang Su Province, China, in 1981. He received his B.S. degree in electronic science and technology from Southeast University, Jiangsu, China, in 2005 and an M.S. degree in instrument science and technology from Nanjing University of Aeronautics and Astronautics, Jiangsu, China, in 2008, and he is pursuing a Ph.D. degree in Instrument Science and Technology with Nanjing University of Aeronautics and Astronautics.

Yanmin Bai
../../Resources/ieie/IEIESPC.2026.15.2.188/au2.png

Yanmin Bai was born in Jiang Su Province, China, in 1982. She received her B.S. and M.S. degrees in computer application technology from the Nanjing University of Aeronautics and Astronautics, Jiangsu, China, in 2005 and 2008, respectively. Currently, she works in JiangSu Vocational College of Information Technology. She has published four papers.