4.1. Speech Signal Denoising Technology
Different from the traditional signal denoising method which is input to various filters
after Fourier transform, the denoising effect is not very ideal for the non-stationary
signal with sudden change and frequency band aliasing Transform, WT), which has good
time-frequency localization characteristics, performs multi-scale processing of signals
with different resolutions in time-frequency space, and is widely used in many fields
such as speech noise reduction, image processing, weather prediction, seismic investigation
and machine vision.
Let $f(t)$ be a finite energy signal; the discrete wavelet transform of this signal
can be defined as:
where $\psi_{a,b}$ is called the generating or basis function of the wavelet transform:
Eq. (11) denotes the scaling factor as 'a' and the displacement factor as 'b'. As illustrated
in Fig. 4, a time-frequency coordinate system is established for wavelet transformation. In
wavelet transformation, the position of the time window is solely influenced by the
displacement factor. Therefore, as the scale factor increases, the time window widens,
the frequency window narrows, and the center of the frequency window shifts towards
the low-frequency direction. Conversely, with a narrower time window, the frequency
window widens, and the center of the frequency window moves towards the high-frequency
direction. The essence of wavelet transformation lies in manipulating these two factors
to construct a combination that can represent any signal in space [18]. By utilizing the scale factor, it is possible to perform a tower-like decomposition
of a specific signal in space, as depicted in Fig. 4, following the classical Mallat algorithm [19]. This algorithm provides a computational method for wavelet decomposition and reconstruction,
simplifying the overall wavelet calculations.
Fig. 4. Time-frequency coordinates of wavelet transform and Mallat algorithm.
In the Mallat algorithm, there is an impulse response function: $h(n)$. Therefore,
the scale function and the wavelet function are defined as follows.
In formula (13), $g(n) = (-1)^{1-n}h(1 - n)$, The signal $x(t)$ is decomposed by Mallat, and the
scale is set to $j$ ($j \ge 1$). Approximate signal and detailed signal obtained by
decomposition are respectively:
It can be found in Eq. (14) that the process of decomposing signal $x(t)$ is the process of decomposing it step
by step from scale $j$ to $j + 1$. That is, the process from low resolution to high
resolution. And finally decomposed into a high frequency signal $A_jx$ (detailed signal)
and a low frequency signal $D_jx$ (approximate signal) [20]:
Eq. (16) is the Mallat wavelet reconstruction formula:
After a finite energy signal undergoes wavelet transformation, it is decomposed into
a set of detail signals and approximation signals. Each sample point of every signal
has its wavelet decomposition coefficient $\omega_{j,k}$. When the signal contains
noise, the noise is also decomposed along with the host signal. This portion of wavelet
coefficients is left untouched [21]. Conversely, if the result is greater than the threshold, the coefficient is considered
to be from noise and needs to be zeroed out or processed through a specific threshold
function. This yields an estimate for this portion of wavelet coefficients to replace
the original ones. Once all wavelet coefficients are processed, wavelet reconstruction
is performed to achieve the denoising effect. The key to wavelet threshold denoising
lies in finding an appropriate threshold function, also known as a threshold rule.
Conventional wavelet threshold functions may be broadly classified into three categories:
semi-soft, soft, and hard kinds [22]. Based on extensive experimentation and empirical analysis, this paper proposes improvements
to the semi-soft threshold denoising method mentioned above:
Let the high frequency signal be $Wa_{j,k}$, then the formula for estimating noise
standard deviation is as follows:
Since the real original signal's SNR varies, the threshold's setting must also be
adjusted to reflect the current circumstances. The following is the unified threshold
formula found in the body of recognized literature [23]:
After the signal is decomposed to the scale of $J$, the $J$ group of high-frequency
signal coefficients are obtained. The wavelet coefficients of each group are arranged
from small to large in absolute value, and a vector is obtained:
This vector is used to calculate the evaluation vector under the JTH wavelet coefficient:$R
= [r_n]$, $1 \le n \le N$, where
The interruption value of the evaluation vector is then sorted from large to small,
the minimum value is taken as the approximation error, and the corresponding wavelet
coefficient $Wa_{j,m}$ is found. The threshold value of the J-layer wavelet decomposition
is calculated by using the wavelet coefficient as follows:
The threshold selection function of the J-layer wavelet decomposition is:
where $P_{a, j}$ is the average value of the absolute value of the wavelet coefficient,
and $\rho_{N, j}$ is the minimum energy level of the wavelet coefficient vector. The
calculation formula is as follows:
It is necessary to restore the signal to its original state since the calculated wavelet
coefficient, or wavelet coefficient, is believed to be the result of noise. Wavelet
reconstruction is the final step in achieving the goal of noise reduction because
it replaces the estimated wavelet coefficient value with the actual wavelet coefficient
value through a series of computations. A coefficient $\Gamma(\sigma_j)$ reflecting
noise intensity is introduced, which is used to reflect the noise intensity of the
$J$-layer wavelet high-frequency signal. The calculation formula is as follows:
In Eq. (24), $A_j$ represents the amplitude of the high-frequency partial coefficient of the
J-layer wavelet.
The calculation formula of wavelet coefficient estimate is given:
The detailed steps to improve the wavelet threshold denoising algorithm are as follows:
the original signal $x(t)$ is discretized, and the high-pass filter $h$ and low-pass
filter $g$ are set up. The $J$ layer decomposes the wavelet. The wavelet coefficients
and the wavelet detail signal amplitude $A_j$ of the layer decomposition are obtained.
The noise standard deviation $\sigma_j$ and noise intensity coefficient $\Gamma(\sigma_j)$
of each layer are calculated according to the detailed signal coefficients of each
layer. The unified threshold $\lambda_{1, j}$ of each layer signal is calculated.
The adaptive threshold $\lambda_{a, j}$ of each layer signal, the average value $P_{a,
j}$ of the absolute value of the wavelet coefficients of the layer, and the minimum
energy level $\rho_{N, j}$ of the wavelet coefficients of the layer are calculated.
The wavelet threshold of each layer is calculated according to Eq. (22). The wavelet coefficients are adjusted to complete the threshold denoising on this
scale, and finally, the wavelet reconstruction is carried out according to the Fig. 4.
Based on this as the main idea, simulation experiments are conducted on existing data.
The results are shown in Fig. 5. The studies demonstrate that the enhanced wavelet threshold denoising described
in this study is more effective in suppressing and removing noise while preserving
the majority of the fault information. Additionally, the denoising outcomes are better
suited for diagnosing data faults in the future.
Fig. 5. Signal noise reduction effect diagram.
4.2. Voice Print Recognition Based on SACNN-Self-Attentive Model
The structure of the SACNN model proposed in the third section is shown in Table 1.
From the table, it can be observed that SACNN is primarily composed of stacked Attentive
ResBlock blocks with different output dimensions. The first convolutional layer, conv1,
and the pooling layer, pool1, extract low-level features and reduce spatial dimensions
to accelerate model training. Following these layers are three groups of Attentive
ResBlock blocks with different output dimensions: Attentive ResBlock256, Attentive
ResBlock512, and Attentive ResBlock1024. It is noteworthy that the final two groups'
initial block has a stride of $2 \times 2$, whilst the following blocks have $1 \times
1$. This design aims to facilitate extracting highly overlapping features by applying
simple dimension reduction to the input feature maps.
Table 1. SACNN model structure.
|
Layer name
|
Structure
|
Step size
|
Output size
|
|
Conv1
|
$3\times3$, 64
|
$1\times1$
|
(None, 299, 64, 64)
|
|
maxpool1
|
$3\times3$
|
$2\times2$
|
(None, 150, 32, 64)
|
|
Attentive ResBlock256
|
$\begin{bmatrix} 1\times1, 64 \\ 3\times3, 64 \\ 1\times1, 256 \end{bmatrix} \times
2$
|
$1\times1$
|
(None, 150, 32, 256)
|
|
$1\times1$
|
|
Attentive ResBlock512
|
$\begin{bmatrix} 1\times1, 128 \\ 3\times3, 128 \\ 1\times1, 512 \end{bmatrix} \times
3$
|
$2\times2$
|
(None, 75, 16, 512)
|
|
$1\times1$
|
|
$1\times1$
|
|
Attentive ResBlock1024
|
$\begin{bmatrix} 1\times1, 256 \\ 3\times3, 256 \\ 1\times1, 1024 \end{bmatrix} \times
2$
|
$2\times2$
|
(None, 38, 8, 1024)
|
|
$1\times1$
|
|
average
|
-
|
average
|
(None, 1024)
|
|
fc1
|
1024
|
fc1
|
(None, 1024)
|
|
fc2
|
N
|
fc2
|
(None, N)
|
The subsequent average layer transforms frame-level speaker embeddings into sentence-level
speaker embeddings. The last fully connected layer, fc2, uses the SoftMax activation
function to transfer the sentence-level information to particular speaker identities.
Batch normalization (BN) is applied throughout the structure before the ReLU activation
function. The model has approximately 5.5 million parameters and a size of 19MB. The
number of neurons in the fc1 layer of this model is 1024, which is similar to the
total number of speakers in large-scale datasets, making it more appropriate for large-scale
datasets even if the parameter count grows.
Some parameter Settings of the model in the training are shown in Table 2. In general, the training batch size and the number of threads to read data are set
according to the memory size and usage of the server. Here, in order to make the GPU
utilization rate reach more than 90% during training, batch_size is set to 128, which
can speed up the training speed.
Table 2. Experimental training parameter setting.
|
Parameter name
|
Parameter value
|
Instructions
|
|
batch_size
|
128
|
Batch size for training
|
|
epochs
|
30
|
Training rounds
|
|
Learning_rate
|
1e-3
|
Initial learning rate
|
|
workers
|
4
|
Thread count
|
The experiments utilize two open-source datasets: TIMIT [24] and Libri speech [25]. Table 3 presents the data partitioning details. Additionally, the datasets are pre-processed
using a wavelet algorithm for denoising before the experiments. The following sections
provide separate introductions for each dataset.
Table 3. Data partitioning of TIMIT and Libri speech datasets.
|
Data set name
|
Data set class
|
Number of speakers
|
|
TIMIT
|
Training set
|
462
|
|
Test set
|
168
|
|
Libri speech
|
Training set
|
251
|
|
Test set
|
40
|
The TIMIT dataset consists of speech samples from 630 speakers from eight different
English regions. There are 438 male and 192 female speakers, and the speech samples
have a sampling frequency of 16kHz. Each speaker has ten speech samples, ranging from
2 to 3 seconds. In the voiceprint recognition experiments in this section, the training
set consists of 462 speakers, while the test set consists of 168 speakers.
Libri speech is a dataset that contains text and speech from audiobooks, with a total
duration of 1000 hours and 2484 speakers. This chapter selects a subset of the dataset,
namely train-clean-100 and test-clean. The train-clean-100 subset includes 251 speakers,
while the test-clean subset includes 40.
This chapter utilizes all speakers in the test set for voiceprint identification experiments.
Half of the sentences from each speaker are used as the enrollment set, while the
remaining sentences are used as the evaluation set. For voiceprint verification experiments,
half of the speakers from the test set are used as enrolled speakers, and the other
half are used as non-enrolled speakers. The enrollment set consists of half of the
sentences from the enrolled speakers. In contrast, the evaluation set consists of
the remaining sentences concatenated with all the sentences from the non-enrolled
speakers. It is important to note that both experiments' evaluation sets are shuffled
to randomize the data distribution.
The voice signal must be pre-processed before the experiment. Using a sample frequency
of 16 kHz, the speech signal pre-processing step turns all of the audio into a single
channel. Then it uses voice endpoint detection (VAD) to remove the silent part of
the audio signal. The traditional way to handle variable-length audio is to crop it
into fixed-size segments (such as 3-second segments), as shown in Fig. 6 and followed in this article.
4.3. Experiment of Voiceprint Recognition Based on SACNN Model
As mentioned in the preceding section, the preprocessing stage involved applying the
following actions to every audio file in the training and test sets. Each audio file
underwent the following preprocessing steps: First, silence removal was performed
to eliminate unnecessary noise. Then, the audio files were cropped into 3-second segments
to ensure uniform duration. Subsequently, Flank features were extracted from the cropped
audio segments to capture the spectral information. Finally, the extracted Flank features
were transformed into spectrograms of size (299, 40, 1). Spectrograms are visual representations
of the audio signals' frequency content over time.
These spectrograms were saved as pickle files, a Python module that allows object
serialization by writing the resulting data stream into file objects.
Fig. 6. Voiceprint information preprocessing process.
Fig. 7. Loss graphs and accuracy on different data sets.
It can be seen from the two Loss curves in Fig. 7 that the model has a good convergence effect on the training set. Some datasets have
fewer categories, resulting in less data on the total model training set. It can be
seen that the improved model has a good effect on the optimization training. The training
set was utilized during the training phase, and both SACNN and CNN models were fed
with the spectrograms as inputs. The objective of the training process was to enable
the models to correctly identify the corresponding speaker's identity for each audio
sample in the training set. The models were trained using the cross-entropy loss function,
and the error backpropagation algorithm was employed for weight updates, utilizing
stochastic gradient descent with momentum.
Experimental results were obtained, and the loss function curves for the models on
the two datasets are depicted in Fig. 7.
Accuracy is the voiceprint recognition experiment's assessment metric, and higher
accuracy indicates better model performance. The model Deep Speaker model is also
utilized for comparison to the SACNN model suggested in this article. The experimental
results in this paper are obtained by using the model structure in the corresponding
paper under the same experimental conditions. Two experimental results on Libri Speech
and TIMIT datasets are shown in Fig. 7
By comparing the experimental results, the SACNN model converges well in testing data
sets. Meanwhile, compared with the experimental results of the Deep Speaker model,
the accuracy of the SACNN model has been significantly improved, and the accuracy
of the SACNN model in the two data sets can reach 98.35% and 98.43%, respectively.
The accuracy of the Deep Speaker model is 1.12% and 1.24% higher than that of the
same data set, respectively.