3.1 Pattern Recognition based on the Gaussian Mixture Model
At present, the conclusions drawn from the analysis of individual learning behavior
data on college students are mostly one-sided. This is because online learning is
different from in-class learning, covering a wider range of topics, and including
more students. With expansion of the number of samples, analysis of behavioral data
for groups of students will be more meaningful than individual analysis. To better
understand the characteristics of students' group learning, this study first defines
the potential learning groups, the learning modes, and learning motivations. The learning
group refers to individuals with similar learning behaviors. The learning mode represents
common learning characteristics of the potential online learning group. Learning motivation
refers to the students’ intentions and desire to learn independently. Table 1 is a summary of online learning behavior data from students in different courses
in one university.
As shown in Table 1, the online learning behavior data selected for this study include the number of
students, homework exercises, unit tests, videos watched, discussions, and weeks of
learning. Considering that data defects and noise will interfere with the identification
model, these data need to be transformed and cleaned in advance. The purpose of data
transformation is to delete invalid data and unify the dimensions of different data.
Data cleaning for this study involved the mean method to supplement missing data.
Based on the above data, this study puts student learning behavior data into two kinds
of recognition: learning payoff and learning harvest. Between them, the effort feature
corresponds to the learning behavioral effort, and the harvest feature corresponds
to the learning effect. Eq. (1) is the mathematical expression of learning behavior.
In Eq. (1), $effort^{w}$ represents the amount of learning behavioral effort in a week, $a_{i}$
represents the weight of the $i$-th learning behavior calculated by the Pearson correlation
coefficient method, $ef_{i}^{w}$ represents the amount that students pay for the $i$-th
learning behavior in week $w$, and $n$ is the total number of student learning behavior
categories. Eq. (2) is the mathematical expression of learning harvest.
In Eq. (2), $effect^{w}$ represents the learning the student gained in week w, while $effect_{\max
}^{w}$ and $effect_{\min }^{w}$ represent the maximum and minimum harvest. Eq. (3) is the mathematical expression of the Pearson correlation coefficient.
In Eq. (3), $X_{i}$ represents sample data; $\left(\frac{X_{i}-\overline{X}}{\delta _{X}}\right)$
represents the standard score of the sample data, in which $\overline{X}$ represents
the mean value of the sample data, and $\delta _{X}$ is the standard deviation of
the sample data. Eq. (4) is the learning efficiency calculation.
In Eq. (4), $ratio^{w}$ represents the weekly learning efficiency of each student taking a certain
course. Learning efficiency is based on the ratio of the students' weekly effort to
their weekly gain. The learning efficiency sequence of a course consists of the entire
week’s learning efficiency, which is recorded as $E_{ratio}=\left(ratio^{1}\right.,\,\,ratio^{2},$
$\left.\ldots ,\,\,ratio^{w}\right)$. The learning efficiency sequence is the input
value of the Gaussian mixture model (GMM). Gaussian mixture is used for clustering
analysis with the help of several linear combinations that obey a Gaussian distribution.
Eq.(5) is the mathematical expression of the GMM.
In Eq. (5), $P\left(E_{ratio}\left| \theta \right.\right)$ represents the GMM mark, $\phi \left(E_{ratio}\left|
\theta \right._{k}\right)$ represents the $k$-th Gaussian distribution function, and
$\alpha _{k}$ is a mixed weight with a positive or zero value. The expression $\theta
_{k}=\left(\mu _{k},{\delta _{k}}^{2}\right)$ is the intrinsic parameter of the Gaussian
distribution function, in which $\mu _{k}$ and ${\delta _{k}}^{2}$ represent the mean
and variance of the Gaussian distribution function, and k indicates the number of
clusters. The output of the GMM in this study is the learning behavior model of the
students. Fig. 1 is a schematic of the student learning motivation prediction model.
Fig. 1 illustrates the study’s capsule neural network as an objective learning method to
predict students' learning motivation. The network includes a feature capsule layer,
a motivation capsule layer, and an output layer. From the perspective of network flow,
the feature capsule layer is the input layer. This layer is responsible for combining
the students’ learning behavior feature vector and the learning efficiency sequence
vector to obtain the feature capsule. The core of the dynamic capsule layer is the
dynamic routing unit and four independent capsule units. The dynamic routing unit
is responsible for transmitting the upper layer feature information, and the capsule
unit retains the learning motivation information. The output layer uses a fully connected
layer and normalization processing to get the classification results. Eq. (6) calculates the output layer loss function.
In Eq. (6), $n$ represents network training sample data, and $y_{i}$ and $\overset{\frown }{y}_{i}$
represent prediction labels and tag labels, respectively. The capsule neural network
carries out network input and network output in the form of a vector, which will be
beneficial in representing the characteristic information of learning data.
Fig. 1. Schematic diagram of the student learning motivation prediction model.
Table 1. Data Sheet of Students' Online Learning Behavior in Different Courses.
Curriculum
|
# of students
|
# of homework exercises
|
# of unit tests
|
# of videos watched
|
# of discussions
|
# of study weeks
|
Advanced math
|
11228
|
11125
|
102281
|
11373
|
6724
|
16
|
College Physics
|
8550
|
9328
|
82940
|
10859
|
5411
|
12
|
English Listening and Speaking
|
9305
|
11111
|
90245
|
11242
|
2925
|
12
|
Computer Fundamentals
|
10768
|
11114
|
67650
|
11330
|
3467
|
10
|
College Chinese
|
6308
|
3231
|
50877
|
8809
|
3493
|
10
|
Sports
|
6875
|
3554
|
53194
|
4573
|
3035
|
12
|
3.2 Prediction from LSTM and the Self-attention Mechanism
The input data of the learning score model depends on the learning behavior model
and the learning motivation obtained in the previous section. Considering that predicting
students' performance is a multi-classification task, this research analyzes the relationship
between students' learning behavior data based on Bi-LSTM and the self-attention mechanism
(SAM) to predict performance. Fig. 2 is a schematic of the Bi-LSTM network structure.
In Fig. 2, Bi-LSTM is composed of forward LSTM and backward LSTM. The distribution positions
of the two are in an up-down relationship. The input layer simultaneously sends sequence
features into the forward and backward LSTM, and the output of the whole network is
the combination and splicing of the output results of these two networks. Compared
with one-way LSTM, Bi-LSTM has both front-to-back and back-to-front information flow.
This two-way structure is conducive to a model obtaining the dynamic relationship
between features of the learning sequence, and enables the model to more accurately
identify dependencies between the data. Eq. (7) is the mathematical expression of the Bi-LSTM network.
In Eq. (7), $\overset{\rightarrow }{LSTM}$ and $\overset{\leftarrow }{LSTM}$ represent the forward
LSTM network and the backward LSTM network, respectively, $\overset{\rightarrow }{h}_{t}$
and $\overset{\leftarrow }{h}_{t}$, respectively, represent the state values of forward
and backward hidden layers at time $t$, and $h_{t}$ represents the Bi-LSTM network
status value at time $t$. This value is spliced by the state values of the forward
hidden layer and the backward hidden layer. Although Bi-LSTM has the ability to acquire
the dynamic relationship of the learning feature sequence, it does not have a unit
for weighting the sequence features. Therefore, this study introduces the self-attention
mechanism to weight the learning feature sequence. Fig. 3 is a schematic for weight vector generation of the SAM.
In Fig. 3, the difference between the self-attention mechanism and the general attention mechanism
lies in the introduction of three weight vectors: the query weight vector (Q), the
key weight vector (K), and the value weight vector (V). Since the SAM generates these
three weight vectors with the same input value, it has the ability to analyze the
internal relationship of the input sequence characteristics. Eq. (8) expresses the weighted sequence feature vector.
In Eq. (8), $d_{k}$ represents the dimension of vector $Q$ and vector $K$. $A\left(Q,K,V\right)$
is the weighted sequence feature vector. Eq. (8) shows that the first step of the $A\left(Q,K,V\right)$ calculation is to obtain $QK$
through point multiplication. The second step is to normalize $QK$. Finally, the normalized
result and vector $V$ are multiplied. Eq. (9) is the output result of the performance prediction model.
In Eq. (9), $\alpha $ is the input vector, $W_{s}$ is the training parameter matrix, and $y$
is the output. Eq.(10) expresses the loss function used by the model to optimize the output.
In Eq. (10),$N$ is the number of output result categories, and $y_{j}$ and $p_{j}$ represent
real labels and forecast labels, respectively. Fig. 4 shows the flow chart for performance prediction.
As shown in Fig. 4, the Bi-LSTM network obtains the output value of the hidden layer based on the weekly
learning feature sequence. The output value together with the learning mode, learning
motivation, and basic attribute feature vector, are input for the self-attention mechanism
in the weighting calculation. After that, the weighted results are output through
the fully connected layer and the normalization layer.
Fig. 2. The Bi-LSTM network’s structure.
Fig. 3. Weighted calculation of the self-attention mechanism.
Fig. 4. Schematic of the performance prediction process.