2.1. Bidirectional Encoder Representations from Transformers
A feedforward neural network and self-attention mechanism construct the Transformer.
The self-attention mechanism focuses on the current vocabulary and considers the context
and context, significantly enhancing the ability to comprehend the context [10,
11]. The Transformer adopts an Encoder-Decoder structure, each composed of 6 repeating
modules, as shown in BERT-Base. The encoder consists of 6 layers: the first layer
is Multihead-Attention, the second layer is feedforwarding neural network, residual
connection, and layer standardization are applied between layers, and the output dimension
is 512. The decoder structure is similar; the first layer is the mask Multihead-Attention,
and the last two layers correspond to the encoder, which also adopts residual and
standardized operations [12].
BERT is a pre-trained model based on Transformer architecture. It has vibrant language
knowledge representation through self-supervised learning on massive unlabeled text,
especially for context-sensitive vocabulary understanding and complex sentence meaning
analysis.
BERT applies the Attention mechanism to modern deep learning to improve the performance
and alignment effect of machine translation [13]. Imagine the constituent elements in Source as composed of a series of < Keyi, Valuei
> data pairs. Given a specific query item in the target, calculate the similarity
between the Query and different Key, perform weighted summation, and calculate the
corresponding weight coefficient of Value. The formula is shown in Eq. (1).
The Similarity table measures the correlation between the Query vector (Query) and
the Key vector (Key). Value represents the information content associated with each
Key, a vector that stores specific information, where L is the length of the input
sentence, Query is the word sequence in Target, and (i = 1, 2, ..., n) is the specific
sequence coding. Source points to the semantic coding corresponding to each word in
the input sentence, so the result obtained is more accurate [14]. The calculation steps of the Attention mechanism are roughly as follows: first,
the weight coefficient is calculated according to Query and Key, and the process can
be subdivided into first calculating the similarity or correlation between Query and
Key and then performing normalization processing Finally, the Value is weighted and
summed according to the weight coefficient.
In the first stage, a two-vector dot product, two-vector Cosine similarity, or neural
network is used to calculate the similarity between Query and Key. The formulas are
shown in Eqs. (2)-(4), where MLP represents a multi-layer perceptron.
Query represents the characteristics of content found or retrieved from the information
set. Key matches the Query and determines which parts of the information set the Query
is related to. In the second stage, a function is introduced to numerically normalize
the calculation results of the previous stage, using the following formula (5): SoftMax is the activation function, and e is the input vector.
A represents a specific variable, and L is the input sentence length. In the third
stage, the calculation result Ai of the weight coefficient corresponding to Valuei
is weighted and summed, as shown in formula (6):
The Valuei represents the corresponding weight coefficient. The attention mechanism
is simple to calculate, with few parameters and fast speed, which solves the problem
of RNN being unable to be calculated in parallel [15,
16]. Due to distance limitation, NLP effectively reduces the dependence complexity between
the source sequence and target sequence, and the empirical results show that the model
has a superior effect. The Key part of BERT is the self-attention mechanism. First,
the self-attention mechanism will calculate three new vectors: Query (v), Key (w),
and Value (x). The vectors have the same dimension. The self-attention mechanism mainly
obtains the representation of words by adjusting the association coefficient matrix
between words and words in sentences, and the formula is shown in Eq. (7).
Among them, Q, K, and V are the input vector matrices of characters, the dimensions
of Q and K are both dk and the vector dimension of V is dv. The multi-head attention
mechanism of BERT projects Q, K, and V through linear transformation, apply Scaled
Dot-Product Attention and splices different self-attention results to extract sentence
semantics. The formulas are shown in Eqs. (8)-(9).
Among them, $W_i^Q \in$ Rdmodelxdk, $W_i^K \in$ Rdmodelxdk, $W_i^V \in$ Rdmodelxdk,
WO $\in$ Rdmodelxdk, head is the number of attention heads, i is the head index, and
WO is the weight parameter. Regarding calculating probability, specifically, the probability
of the next word appearing is calculated from left to right [17]. It is mainly a training set composed of sentences w1, w2, ..., wm, and then a language
model with occurrence probability is obtained by neural network training, as shown
in formula (10).
p indicates the probability that the next word will appear. S denotes the input sequence,
and w refers to the number of sequences. Traditional language models are static and
cannot represent word meaning and grammar dynamically. Pre-trained models such as
ELMo, GPT, and BERT are pre-loaded with semantic information, and large-scale corpus
is used to enhance word meaning expression, improve model robustness, and avoid repeated
training from zero [18,
19]. BERT is based on Transformer’s Encoder, which has a deeper model structure, efficiently
captures context, and significantly enhances feature extraction capabilities [20]. BERT training is divided into two stages: pre-training and fine-tuning, and its
structure is shown in Fig. 1. Pre-training includes a mask language model and next-sentence prediction, and the
fine-tuning stage is adjusted for specific tasks.
The pre-training of BERT includes a mask language model and next sentence prediction
(NSP) [21]. In the masking task, 10% of words are randomly masked, of which 80% are replaced
with [Mask], 10% remain unchanged, and 10% are replaced with random words. The Chinese
dataset uses full-word masking. The NSP task predicts whether two sentences are connected
and uses [CLS] and [SEP] to mark the sentences. If there is a connection, [CLS] outputs
IsNest; otherwise, it outputs NotNest. The BERT input contains [CLS] and [SEP] tags
to identify the beginning and end of the sentence. The input comprises words, words,
and position vectors; sentence vectors assist NSP tasks; position vectors supplement
time series information, and the model learns sequence features through absolute coding.
2.2. Bidirectional Long Short-term Memory
BiLSTM is a variant that solves the gradient problem in recurrent neural networks.
It efficiently uses context, deeply mines semantics, reduces workload, improves entity
recognition accuracy, and is widely used for text semantic information extraction
[22,
23]. BiLSTM contains three stages internally: the forgetting stage selects to forget
old data through the forget gate; The memory stage stores new data through the input
gate and updates the state; The output phase determines the output and activates the
status information. One-way LSTM handles one-way information; two-way LSTM is more
suitable for text context processing. The structure of BiLSTM can be represented by
the following formulas (11)-(12):
where ft represents a forgetting gate, it represents an input gate, $\sigma$ is the
sigmoid activation function, Wf and Wi represent the weight value and the output gate
respectively, and ht−1 represents the hidden layer state at t − 1 time, xt represents
the current input; bf, bi is the deviation term. LSTM stores key information through
a gating mechanism and forgets unimportant parts. Compared with single-layer storage
RNN, LSTM can dynamically capture information and have stronger memory. BiLSTM deeply
mines long-distance text information through memory units and control gates, solves
gradient disappearance, and has significant application value in information retrieval,
automatic question answering, and knowledge graph construction.
2.3. BERT-BiLSTM Fusion Mechanism
The challenges we face include the increased difficulty of training and inference
due to the high complexity of the model, the limited scale of training data that may
limit the improvement of the generalization ability of the system, and the huge consumption
of computing resources in the training and inference process. In response to these
limitations, we explore building more efficient model architectures, leveraging larger
and more diverse training datasets, and optimizing the use of computing resources.
The technical route of BERT-BiLSTM integration shows its unique charm and strong potential.
It integrates the advantages of two deep learning models, dramatically improving the
translation quality and speeding up the translation process, making it an ideal choice
to meet the needs of modern communication [24,
25].
In English phrase translation, BERT can keenly capture the specific contextual meaning
behind each word, thus avoiding ambiguity and misunderstanding caused by literal translation
and ensuring the accuracy of the translation. Meanwhile, BiLSTM, with its unique bidirectional
memory unit, can simultaneously retain past and future contextual information in sequence
data processing, which is crucial for correctly identifying and transforming grammatical
structures in phrases [26,
27]. In the scenario of English phrase translation, BiLSTM can effectively track the
relationship chain between words and help the model understand complex structures,
such as attributive clauses and non-predicate verbs, so that the final translation
is more in line with grammatical norms and reads naturally and fluently.
Integrating BERT and BiLSTM perfectly balances semantic deep mining and grammatical
detail control [28]. In a specific implementation, the powerful pre-training function of BERT is usually
used to obtain the high-dimensional semantic features of source language phrases.
Then, these features are input into the BiLSTM network, and the latter can reorganize
these features based on its excellent sequence processing skills to generate corresponding
expressions in the target language. This process fully uses BERT’s advantages in word-level
and sentence-level understanding. It gives full play to BiLSTM’s expertise in grammatical
reconstruction, contributing to a significant leap in translation effect.
The BERT-BiLSTM convergence solution also brings additional efficiency gains. Since
many common language patterns have been accumulated in the pre-training phase of BERT,
only a small amount of fine-tuning often needs to be used in specific translation
tasks, significantly reducing the data and computing resources required for model
training. Coupled with the rapid response characteristics of BiLSTM when processing
fixed-length inputs, it can effectively improve the operating efficiency of the overall
system.