Firstly, the study discusses facial expression recognition and analyzes traditional
CNN networks. On this basis, a self-cure neural network is introduced. After describing
the various modules of the network, a CS is further introduced for threshold adaptive
adjustment. Finally, a new facial expression recognition model is proposed. In addition,
the study constructs emotional space and emotional transfer paths, and then discusses
emotional motivation and fading compensation separately, proposing an emotional interaction
algorithm. Finally, a new emotional interaction model is proposed by combining facial
expression recognition models.
3.1 Facial Expression Recognition Algorithm Based on Improved Convolutional Neural
Network
Due to the complexity and diversity of facial expressions, facial expression recognition
places high demands on the accuracy and robustness. As a classic foundational network,
CNN has derived various new recognition algorithms based on its excellent convolutional
characteristics. A typical CNN consists of multiple modules, namely convolutional
layer, pooling layer, and fully connected layer [14]. The convolution operation process is shown in Fig. 1.
Fig. 1. Convolutional operation process.
From Fig. 1, the input data is first partially multi-plied by elements. The multi-plied element
data is slid according to their respective positions. By performing convolution operations
on the output values of each position, a two-dimensional tensor map is formed sequentially,
namely, the feature map [15]. The convolutional operation of convolutional layers can extract different feature
information, such as texture, color, borders, etc. The pooling layer compresses the
feature map spatially, reducing its dimension while preserving feature information.
The pooling operation is shown in Fig. 2.
Fig. 2. Pooling operation.
In Fig. 2, by filtering the maximum, minimum, or average values of data information within
a specific area, the computational complexity and network parameters are simplified.
However, traditional CNN has some limitations in facial expression recognition tasks.
Firstly, the different parts and dynamic changes in facial expressions play a crucial
role in expression. Traditional CNNs do not have good modeling capabilities when processing
these information. Secondly, due to the uneven distribution of training data sets
and the similarity between categories, traditional CNNs are easily affected by sample
bias in facial expression recognition tasks, resulting in a decrease in recognition
accuracy [16]. In view of this, SCN is introduced, which evolves from CNN. Through dynamic learning
and adaptive adjustment, self damage or errors are repaired while maintaining the
normal operation of the network [17]. When dynamic learning manifests as the network detecting damage or errors, the SCN
uses this information to update its own configuration to fix the problem. Adaptive
adjustment is manifested as the ability of SCN to adjust its behavior based on new
inputs and feedback, such as reallocating resources, reacquiring routing information,
or adjusting parameter configurations. When the network detects damage or errors,
it is able to self-repair using dynamic learning and adaptive tuning mechanisms. By
re-configuring its own connections and weights, the network can return to its normal
state. The network structure diagram of SCN is shown in Fig. 3.
Fig. 3. SCN architecture diagram.
In Fig. 3, the entire SCN structure consists of three main modules: self attention module,
regularization module, and noise labeling. Firstly, feature extraction is performed
on the image. Secondly, the importance of each part is weighted using a self attention
weighting module. Then, the weighted features are sorted using a regularization module.
The average weight is used as the threshold for regularization in each sorting process
[18]. Finally, the noise labeling module is used to label the highly important features
for subsequent screening and training. The importance weighting process of the self
attention module is shown in equation (1).
In equation (1), $\sigma $ represents the Sigmoid function. $W_{\alpha }^{T} $ represents the $T$-th
layer parameter under the attention mechanism. $\alpha _{i} $ represents the importance
of the $i$-th sample. $x_{i} $ denotes the first $i$ element of the input sequence.
The self-attention weighting module calculates the importance scores of each part
of the image, such as pixels, regions. The Sigmoid function maps these scores between
0 and 1, which can assign weights to different parts of the input image. The main
processing method of the regularization module is weight ranking, which divides various
high weight feature samples and low weight feature samples to ensure the high and
low difference between the two types of data [19]. The process is shown in equation (2).
In equation (2), $L$ represents the regularization function. $\alpha _{H} $ and $\alpha _{L} $ represent
the average weight values of high and low weight restructurings, respectively. $\delta
_{1} $ represents a fixed hyper-parameter. Noise labeling is used to determine the
threshold of each grouped data in the form of labels. The threshold, in turn, is determined
by a combination of performance metrics such as accuracy and recall after model training.
If the predicted probability is greater than the threshold, the data is adjusted to
a high weight feature sample. The process is shown in equation (3).
In equation (3), $\delta _{2} $ represents a predetermined threshold. $l_{\max } $ and $l_{org} $
represent labels that are greater than the predicted probability and the original
labels, respectively. $P_{\max } $ and $P_{gt\ln d} $ represent the maximum prediction
probability and the given prediction probability, respectively. Although SCN can effectively
prevent sample uncertainty issues, repeatedly using a predetermined threshold reduces
noise labels, thereby reducing the robustness of the network. In view of this, CS
is introduced to reduce the impact of uncertain samples in model training [20]. Compared with other methods, the advantage of CS is that it can improve the performance
and robustness of the model in tasks such as facial expression recognition by calculating
sample calibration weights and resetting the loss function, as well as by applying
calibration weights during the attention weighting process. This strategy can be roughly
divided into two directions. The first is to calculate the sample calibration weight
and reset the loss function based on the weight. Secondly, the calibration weights
are assigned to the attention weighting process. The thresholds are adjusted timely
by monitoring the performance of the model on the validation set or test set. This
process is shown in equation (4).
In equation (4), $P_{gt\ln d} $ represents the importance weighting of the optimized $i$-th sample,
while the rest of the algebra remains the same as before. The regularization process
is prone to smaller losses when predicting images with higher weights. However, images
with smaller weights result in significant losses when predicted incorrectly [21]. Regarding this issue, the CS is used to optimize the regularization process. The
CS function is used to replace the regularization function, as shown in equation (5).
In equation (5), $L_{cs} $ represents the CS function. $M$ represents the number of incorrect samples.
For the noise marking process, there may also be differences in weight values in groups
with lower importance. It is unreasonable to use the established threshold again for
judgment. It also takes into account that the CS can dynamically adjust the threshold
according to the actual situation. Therefore, it better adapt to different data distributions
and model performance, and improve the accuracy and reliability of the noise labeling.
Therefore, the threshold for noise labeling is also replaced, as shown in equation
(6) [22].
In equation (6), $\delta _{2} *cw$ represents the calibrated sample weight threshold. The rest of
the algebra remains the same as before. On the basis of optimizing and improving the
above modules, the final SCN-CS facial expression recognition model is proposed, as
shown in Fig. 4.
In Fig. 4, the overall framework of the recognition model is still based on SCN, with only
updates made to each part. Firstly, the student facial images are input and subjected
to initial convolution operations using a CNN, which is then decomposed into multiple
layers. Secondly, the SCN algorithm is used to perform self attention importance weighting,
regularization ranking, and noise labeling operations on these images. After completion,
CS performs threshold replacement on the above three steps, recalculates the results,
and finally outputs them. The model effectively improves the robustness of a single
SCN algorithm. By calibrating the strategy and dynamically adjusting the thresholds,
the model is able to better adapt to the changes in the data. It automatically adjusts
to cope with different data distributions and noise situations during the training
process, which improves the robustness and performance of the model, making it more
robust in facial expression recognition facing data fluctuations.
Fig. 4. SCN-CS model structure.
3.2 Construction of An Interactive Model Based on Emotional Compensation and Motivation
After constructing a facial expression recognition model, it is still difficult to
provide targeted assistance for virtual online classroom teaching through recognition
results [23]. Human emotions are rich and varied. According to dimensional analysis, human emotions
can be divided into six categories: happiness, sadness, surprise, disgust, anger,
and fear. Based on the six basic emotions mentioned above, the emotional space of
four components is constructed, as shown in Fig. 5 [24].
Fig. 5. Four-direction emotion 3D model.
In Fig. 5, in three-dimensional space, emotions can be divided into 8 parts along three coordinates.
Excitement, joy, happiness, and relaxation are positive emotions, while calmness,
depression, tension, and anger are negative emotions. Positive emotions have a motivating
effect on students' teaching emotions, while negative emotions have a restraining
effect. The emotional space expression in this state is shown in equation (7).
In equation (7), $S$ represents the set of emotions. $s_{1} $, $s_{2} $ represent basic emotions.
$A$ represents the probability set of a certain emotional state. $A_{1} $, $A_{2}
$ represent the probability of emotional states. The probability of one emotion is
shown in equation (8).
In equation (8), $A_{i} $ denotes the $i$th emotion. $N$ denotes the number of basic emotional states.
Although emotions cannot be directly observed, preliminary judgments can be made by
processing expression features through special means, such as Hidden Markov Model
(HMM). Compared with other models of the same type, HMM is able to calculate the emotion
transfer probability by building emotion sequences, thus effectively capturing the
dynamic changes in the sequence data. It is suitable for describing the evolution
of emotions over time. Therefore, the study utilizes HMM to construct the relational
expressions of the above eight emotions, as shown in Fig. 6.
Fig. 6. HMM of 8 emotional states.
According to Fig. 6 combined with equation (8), the conversion probability between emotions of the same category is higher, while
the conversion probability between emotions of different categories is lower. Therefore,
to optimize the transmission mode between emotional states, the study utilizes the
HMM to convert conditions into probabilistic evaluation problems for calculation [25]. The forward-backward algorithm is used to solve probability assessment problems
[26]. The probability calculation of forward and backward variables is shown in equation
(9).
In equation (9), $\lambda $ represents the parameters of the HMM emotion model. $O$ represents the
observation sequence 0. $\alpha _{T} $ and $\beta _{T} $ represent variables that
move forward and backward at time $T$. After continuous iteration, the parameters
in the model tend to be rationalized. The optimal observation sequence judgment is
shown in equation (10).
In equation (10), $\varepsilon $ represents the threshold. $\lambda _{0} $ represents the model parameters
after multiple iterations. If the probability result satisfies the above equation,
the probability of emotional transition is output at this time. Conscious stimulation
is used to control emotions and stabilize learning state. The study defines this consciousness
as motivation and dilution compensation [27]. Incentives act on students' emotional states through motivational factors, thereby
achieving a positive guiding effect. Incentive factors can be further divided into
four categories, namely reducing learning difficulty, removing difficult problems,
enhancing classroom fun, and improving teaching quality [28]. For the convenience of subsequent calculations, the empirical value probability
is used to express the calculation method for these four incentive factors and transition
state probabilities, as shown in equation (11).
In equation (11), $\tau $ represents the empirical constant. $A_{k} (i,j)$ represents the transition
state probability of the $i$-th state $j$. $I_{k1} ,I_{k2} ,I_{k3} ,I_{k4} $ represent
four types of incentive factors, respectively. In addition, influenced by external
stimuli, students emotions may also experience a fading process, as shown in equation
(12).
In equation (12), $E$ represents the emotion in the ideal state. $E(t)$ represents the emotional state
at moment $t$. $d$ represents the rate of change. $\psi $ represents the emotional
dilution factor, which can directly reflect the rate of emotional dilution in students.
Based on the above emotional state description, a Emotional Compensation and Encouragement
Algorithm (ECEA) is proposed based on a facial expression recognition model [29]. This algorithm converts student emotions into the emotion axis through dimension
reduction for judgment. The emotion that satisfies the optimal emotion region on the
emotion axis is defined as the target emotion region, as shown in equation (13).
In equation (13), $TEA$ represents the target emotional region. $\pi $ represents the emotional state
model. $C$ represents the dimension reduction coefficient of emotions. The condition
for adopting emotional motivation strategies is shown in equation (14).
In equation (14), $P_{TEA} $ represents the optimal emotional region. $O_{TEA} $ represents the current
emotional region. $\varphi $ represents a constant on the emotional axis, and $\varphi
\in (0$, $1$, $2$, $\cdots $, $n)$. The interaction model will only actively adopt
emotional motivation strategies when the current emotional state of the student is
less than the optimal emotional state. In summary, the final Emotional Compensation
and Encouragement Model (ECEM) is proposed. The interaction process of the model is
shown in Fig. 7 [30].
Fig. 7. ECEM model interaction flow.
In Fig. 7, the entire process is roughly divided into four parts. Firstly, the image is recognized
and classified using a facial expression recognition model. The recognition results
are input into the interaction model, with initialization parameters set. The formula
determines whether the current emotion is in the target emotion region. If it is satisfied,
there is no need for a motivational factor, that is, a motivational factor of 0, and
outputs the current emotion. If the target emotion region is not met, the current
emotion is initialized with parameters. The probability of the next possible emotion
is re estimated through a calculation formula. Then, based on the optimal emotional
region on the emotional axis, if it meets the optimal emotional region, the incentive
factor is output. If it is not satisfied, it is returned and the parameters are initialized
to calculate the optimal emotional region until it is satisfied. In summary, this
model can achieve emotional monitoring and interaction among students during remote
virtual teaching, improving teaching quality.