Vision-based Multi-task Hybrid Model for Teacher-Student Behavior Recognition in Classroom
Environment
ZhouHuan1,a
ZhuWenrui2,a,*
-
(School of Big Data and Artificial Intelligence, Anhui Xinhua University / Hefei, AH
230088, China
zhouhuan0813@ustc.edu.cn
)
-
(School of Civil Engineering, Xi'an University of Architecture and Technology / Xi'an,
SN 710055, China
wenrui_zhu@foxmail.com
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Classroom behavior, Dual-stream framework, Multi-task hybrid model, Multi-mode learning, Spatio-temporal graph convolutional network
1. Introduction
The rapid development of artificial intelligence can effectively promote the modernization
level of education [1]. From the students' perspective, big data analysis will assist in customizing student-centred
personalized teaching and learning programmes to increase students' learning interests.
Rich digital teaching resources can also provide students with multiple views and
effectively solve the problem of unbalanced educational resources. From the teachers'
perspective, using computer vision technology to analyze students' behavior in classroom
surveillance videos can improve the interaction between teaching and learning, and
the design of teaching sessions can be further optimized according to students' feedback.
In the past, most analysis of classroom teaching information came from relevant professionals
who observed the recording of lecture videos and the corresponding student performance.
This approach is less efficient and ignores the correlation between teacher and student
behavior. Most current research on classroom behavior recognition is based on single-frame
image recognition, ignoring the temporal features of behavior [2], which often causes misdetection of similar behaviors. On the other hand, it is difficult
to adapt to complex multiple-recognition tasks using single-modal data. For example,
student behaviour detection has the characteristics of many targets that require location
and are prone to overlap. Using skeleton data can effectively solve these problems.
Teacher behavior detection has the characteristics of a single target and small movement
optical flow. Based on the above problems, we need to adopt different data sources
according to the specific tasks and design different models to complete the temporal
and spatial feature extraction.
Another difficulty of behavior recognition in classroom environment is obtaining large
amounts of data on teacher and student behaviors, which is very scarce in public behavior
datasets. In response to this problem, our solution is to use dual-position shooting
with multi-ocular cross-sense source cameras to simultaneously capture videos of teachers
and students in the classroom environment, thereby avoiding the temporal alignment
and misalignment of multimodal data.
This article is optimized based on previous classroom behavior recognition defects,
and proposed a vision-based multi-task hybrid model. The model uses RGB, optical flow
and skeletal data as inputs to serve the real-time detection of teacher-student behaviors
in classroom environment. By analyzing the correspondence between teacher and student
behaviors in different contexts, we can discover their hidden correlation. It will
help to improve the interaction between teaching and learning and promote the development
of intelligent education.
The contribution of this paper can be summarized as follows.
· For the teacher behavior recognition task, we proposed a spatio-temporal dual-stream
framework based on 2D-CNN and ViT [3]. As shown in Fig. 1, the 2D-CNN backbone is mainly used to extract spatial features from RGB videos,
and ViT is mainly used to extract temporal features from optical flow. The acquired
spatio-temporal features will be spliced along the channel dimensions to obtain the
fused features, which are finally input to the fully connected (FC) layer to obtain
the classification of behaviors.
· For the student behavior recognition task, we proposed MSSTGCN based on skeleton
data. As shown in Fig. 2, the multi-level skeleton information is sequentially fed into the spatio-temporal
graph convolution block for feature extraction. The multi-scale features are sequentially
aggregated by the Non-local [4] block, which can capture the fine-grained features among similar behaviors to the
greatest extent.
Fig. 1. The architecture of the spatio-temporal dual-stream framework.
Fig. 2. The architecture of the multi-level stacked spatio-temporal graph convolutional network (MSSTGCN).
2. Related Work
2.1 Spatio-temporal Behavior Recognition based on RGB and Optical Flow
As an important part of video perception and processing, spatio-temporal behavior
recognition aims to detect the target behaviour in each video frame. Similar to object
detection methods, both of them require classification and localization. However,
the difference is that spatio-temporal action recognition emphasizes the processing
of temporal information. It is hard to determine the type of target behavior solely
through a single frame image, which can easily lead to divergent classification. The
temporal correlations contained in consecutive frames will significantly improve the
accuracy of behavior recognition.
The core challenge of spatio-temporal behavior recognition is to efficiently construct
temporal associations of targets in consecutive frames. To overcome this challenge,
a common idea is to leverage the spatio-temporal feature extraction capability of
3D-CNN to establish a robust action detection network. However, prevalent 3D-CNN-based
methods for spatio-temporal behavior recognition, such as I3D [5], X3D [6], and SlowFast [7], suffer from a shared drawback: their substantial parameter size leads to slow computational
speeds.
To satisfy the real-time requirement, some researchers have started considering spatio-temporal
action detectors based on 2D-CNNs, such as ACT [8] and MOC [9]. Unlike 3D-CNN-based methods, these 2D-CNN-based methods cleverly simulate temporal
modelling by splicing spatial features from each frame. However, the limited spatio-temporal
feature extraction capability of 2D-CNN is insufficient to meet the accuracy requirements
of spatio-temporal behavior recognition. Therefore, this type of work often sets up
an additional parallel branch to process the optical flow corresponding to the input
video clips, resulting in a dual-stream network model that can process RGB and optical
flow in parallel. Optical flow is an explicit short-term temporal association for
calculating object motion between adjacent frames. Adding optical flow can significantly
enhance the temporal feature extraction. For example, in the case of MOC, the behavior
recognition mAP on the UCF101-24 dataset is improved by almost 7% after adding optical
flow.
2.2 Spatio-temporal Behavior Recognition based on Skeleton Data
Skeleton data is lightweight and easy to calibrate. Recently, It has been widely used
in spatio-temporal behavior recognition in complex scenes. The idea of skeleton-based
spatio-temporal behavior recognition is to perform motion modelling and feature extraction
through acquired skeleton information, obtain feature vectors reflecting motion information,
and complete the classification of the behaviors on this basis. Common skeleton-based
spatio-temporal behavior recognition methods mainly include RNN-based methods, CNN-based
methods, and GCN-based methods.
RNN is good at processing time series data with long-term dependence like 3D skeleton
joints, but is not ideal for modelling spatial information of joints. Many researchers
have focused on improving the spatial feature extraction ability of RNNs. For example,
Liu et al. [10] used ST-LSTM to traverse the human body in the form of a bidirectional tree to improve
the adjacency attribute between skeleton joints. Zheng et al. [11] used RRN to learn the spatial features in the skeleton and explore the complementarity
of the connectivity between the joints for behavior recognition.
CNN can learn the semantic information of skeleton through efficient spatial modelling,
but it is weaker than RNN in processing temporal information of motion. Many researchers
are committed to strengthening CNN's ability of extracting temporal features. Tu et
al. [12] proposed two-stream 3D CNNs with different kernel sizes to capture multi-scale temporal
features and convert skeleton data into multi-temporal sequences. Li et al. [13] proposed HCN that achieved the learning of global co-occurrence features for all
joints.
Human skeletons are natural topologies. Compared to RNN and CNN, GCN is better at
handling non-Euclidean data such as skeletons. Most GCN-based spatio-temporal behavior
recognition methods use different blocks to extract temporal and spatial features
respectively. The most classic example is the ST-GCN proposed by Yan et al. [14] This network uses multi-layer graph convolution to construct spatio-temporal graph.
The physical structure of the human body is represented by joints and spatial edges.
Temporal edges are added to replace the original optical flow. Based on ST-GCN, Wen
et al. [15] extend the correlation between adjacency joints to all joints, effectively integrating
multi-level semantic information of the joints to learn higher-order features. Thakkar
et al. [16] proposed a part-based GCN that divides the human body into four subgraphs with the
intention of capturing potential correlation between distant joints. Although these
methods have considered the impact of higher-order semantic information on skeleton-based
behavior recognition, they have yet to carried out further research on the fusion
of features at different scales.
2.3 Spatio-temporal Behavior Recognition Methods in Classroom Environment
Before AI was applied to classroom teaching, teachers primarily relied on methods
such as the Flanders Interaction Analysis System (FIAS) and the improved Flanders
Interaction Analysis System (iFIAS) [17], which are based on information technology, to analyze students' classroom behaviors.
However, these methods fall within manual analysis manual analysis and cannot achieve
sustainable, large-scale observation and analysis.
In recent years, mainstream methods for classroom behavior recognition are all based
on object detection and pose estimation. Zhang et al. [18] applied YOLO to detect students' facial movements, predicting classroom engagement
through behaviors like yawning, smiling, and eye closure. Huang et al. [19] proposed a deep spatio-temporal residual convolutional neural network to detect and
track multiple students' behavior trajectories in teaching videos in real-time, achieving
promising results. Lin et al. [20] combined student posture with object detection, reducing erroneous connections between
skeletal nodes and misclassification of similar behaviors. Mo et al. [21] introduced a multi-task classroom behavior recognition network comprising a pose
estimator, object detector, and MTHN module, successfully predicting student behaviors
by integrating multi-scale features.
However, these researches have primarily focused on analyzing the spatial features
of classroom behaviors, neglecting the impact of temporal features on contextual semantics.
To address this, Zhang et al. [22] introduced the incorporation of Multi-Scale Temporal Attention (MSTA) and Efficient
Temporal Attention (ETA) modules into the SlowFast model, enhancing the model's ability
to capture temporal features. Xu et al. [23] utilized the Temporal Shift Module (TSM) to augment the 2D-CNN backbone of YOWO,
thereby enhancing its capability to acquire temporal information related to behaviors.
The above researches are all about student behavior recognition in the classroom environment,
and there is almost no research foces on teacher-student spatio-temporal behavior
recognition. This paper aims to explore the correlation between teacher and student
behaviors in the classroom by designing a multi-task hybrid model.
3. Vision-based Multi-task Hybrid Model
The schematic pipeline of the proposed model is shown in Fig. 3. This hybrid model consists of two parts: the teacher behavior recognition model
and the student behavior recognition model.
Fig. 3. The schematic pipeline of the vision-based multi-task hybrid model.
3.1 Teacher Behavior Recognition Model based on Dual-stream Framework
The teacher behavior recognition model is a single-stage network, consisting of two
branches: spatial stream and temporal stream. The input of the spatial stream is a
sequence of consecutive RGB frames, from which a 2D-CNN backbone extracts features
from static images and fuses features from different frames along the channel dimension
to classify actions in the video. The 2D-CNN backbone in this paper uses Resnet-18
with a small parameter size to extract spatial features for each frame. These spatial
feature maps are then concatenated along the channel dimension to form a thick feature
map. The classification layer convolves all frames' spatial features simultaneously,
implicitly incorporating the concept of temporal sequence. In behavior recognition
tasks, the temporal association obtained by merely concatenating feature maps is not
sufficient, and need to add optical flow as the temporal stream input.
The temporal stream uses consecutive optical flow as input to capture temporal information
between frames via ViT. Optical flow represents the temporal change of pixels between
adjacent frames, which includes the features from both horizontal and vertical vector
channels, effectively reflecting motion information. The process of feature extraction
from optical flow based on ViT is shown in Fig. 4. First, the optical flow images are divided into patches, and each patch is mapped
to a one-dimensional vector. These vectors, along with class tokens and positional
encodings, are then fed into the Transformer Encoder. The Encoder has 12 layers, each
layer comprising temporal self-attention block, layer normalization, dropout, and
multilayer perceptron. The temporal self-attention block enables the model to capture
the changes of temporal features, so as to further optimize the extraction of temporal
features. This is specifically achieved by performing temporal self-attention operations
on the same spatial numbered patchs of different frames when extracting features using
ViT. The equation of the self-attention operation is as follows.
where $Q$ is the query matrix, $K$ is the key matrix, $V$ is the value matrix, and
$d_{K}$ is the row vector dimension of the $K$ matrix.
After obtaining the spatial and temporal stream features, we adopt channel fusion
to fuse spatial and temporal features. These features are concatenated along the channel
dimension and then processed by a 1x1 convolution and 3x3 convolution. Both the 1x1
convolution and the 3x3 convolution are followed by a Batch Normalization (BN) layer
and LeakyReLU function. The fused features are then fed into a Deep Neural Network
(DNN) to achieve classification, coordinate regression, and confidence estimation
for different behaviors.
Fig. 4. Feature extraction from optical flow based on Vision Transformer (ViT).
3.2 Student behavior Recognition Model based on MSSTGCN
Traditional skeleton-based behavior methods tend to focus solely on 1-order skeleton
semantic information (physical connections between joints) and 2-order skeleton semantic
information (the potential association between two joints with a hop connection distance
of 2), which is reasonable for some behavior recognition tasks with significant variations
in movement. However, considering the similar categories and minor variations of student
behaviors in classroom environment, we can no longer ignore the higher-order skeleton
semantic information. Therefore, we designed a multi-level stacked spatio-temporal
graph convolutional network to model the long-range spatial dependence of the human
body and extract rich spatio-temporal features.
MSSTGCN consists of multiple parallel branches, each composed of three spatio-temporal
graph convolutional blocks (STGCBs) that process skeleton information at specific
scales. The structure of the STGCB is shown in Fig. 5, which includes a GCN for extracting spatial features and a TCN for extracting temporal
features. Both of them are followed by a BN layer and LeakyReLU function. Inspired
by the design in Reference [31], we artificially set different adjacency matrices for GCN in different STAGCBs to
allow it to extract skeleton features at specific scales. For instance, to obtain
1-order information only, we construct an adjacency matrix only between a joint and
its directly connected joints. To obtain 2-order information, we construct an adjacency
matrix only between a joint and the joints that are two hops away from it. Following
this logic, by manually configuring adjacency matrices, we can obtain all higher-order
semantic information for a given joint.
As shown in Fig. 2, we use a stepped stacking structure for feature fusion between multiple branches.
The features extracted by the previous branch are concatenated with the features extracted
by the current branch, and then input into the Non-local block for enhanced extraction
of critical features, which are subsequently input to the next branch. Students' classroom
behaviors are different due to subtle changes, the Non-local operation can strengthen
the model's ability to distinguish the details of similar behaviors. Without the inclusion
of the Non-local block, some subtle differences in features may gradually be overlooked
by the model during aggregation with the next-level features, resulting in the model's
inability to distinguish similar behaviors after multiple feature aggregation. The
calculation of the Non-local block is as follows.
where $i$ represents the position index in spatial or temporal for the desired response,
and $j$ represents the possible position index for the response. $x$ is the input
feature, and $y$ is the output feature. The function $f$ calculates the correlation
between $i$ and $j$. The function $g$ calculates the linear mapping of the input signal
at position $j$ by multiplying it with a parameter matrix. $c\left(x\right)$ is normalized
by Softmax to keep the output value remain within a certain range, which aids in model
convergence. Eq. (2) is to calculate the information of other positions $j$ that may be relevant to position
$i$. These relevant positions can be from preceding or succeeding frames, unlike traditional
convolution that only calculates the information on adjacent positions, thus can extract
more extensive features. In Eq. (3), $w_{z}$ is a parameter for linear mapping that can change during network training,
"$+x_{i}$" represents residual connections, and $z_{i}$ represents the output result
of the Non-local block.
Fig. 6 shows the calculation process of the Non-local block. First, the input is convolved
by ${\theta}$, ${\varphi}$ and g to obtain ${\theta}$($x_{i}$), ${\varphi}$($x_{j}$)
and g($x_{j}$). Then, ${\theta}$($x_{i}$) and ${\varphi}$($x_{j}$) are dot multiplied
to obtain $f\left(x_{i},x_{j}\right)$, which is normalized by Softmax. This result
is multiplied by g($x_{j}$) to obtain $y_{i}$, which corresponds to the result of
Eq. (2). Finally, $y_{i}$ is convolved with $w_{z}$ and added to the original input $x_{i}$
to obtain the output $z_{i}$.
If we use $U_{n}$ to represent the result of feature fusion in the $n$-th branch,
$F_{n}$ to represent the feature extracted by the graph convolution operation in the
$n$-th branch, the output of each branch in MSSTGCN are as follows.
Fig. 5. The structure of the STGCB.
Fig. 6. The calculation process of the Non-local block.
4. Experiments
4.1 Dataset
Given the absence of large publicly available datasets for evaluating model performance
of teacher-student behavior recognition in classroom environment, this research collected
raw data from teaching videos of real classroom environment at a Chinese university.
By analyzing the characteristics of teacher and students behaviors in the classroom,
we summarized four representative teacher behaviors (writing on the blackboard, front-standing
lecture, side-standing lecture, inspecting the classroom) and six student behaviors
(listening, reading, writing, playing with mobile phones, sleeping, talking). The
description of each behavior is shown in Table 1. Listening, reading and writing represent students' positive behaviors in the classroom,
while playing with mobile phones, sleeping and talking represent students' negative
behaviors in the classroom. We aim to reveal the correlation between teacher and students
behaviors in classroom environment by analyzing the proportion of students' behavior
states under various teacher behaviors.
The dataset used in this research was collected from 8 different classroom environments,
with each scene having a total duration of about 90 minutes (45 minutes from the teacher's
front view and 45 minutes from the student's front view). During the process of data
preprocessing, we crop and segment the raw videos, standardizing them into short video
clips with a resolution of 512${\times}$ 424 and a frame rate of 30 fps. Each video
clip was kept in 8-10 seconds in length, and the total number of videos is more than
4500. We further intercept the video clips by frame to label the behaviour categories
and save the data labels as JSON files. Some labelled frame images are shown in Fig. 7.
Fig. 7. Teacher behaviors and student behaviors in the dataset.
Table 1. The description of each behavior.
Teacher behavior
|
behavior state
|
Writing on the blackboard
|
Standing with your back to the students and writing with your hand
|
Front-standing lecture
|
Standing face to face with the students and talking
|
Side-standing lecture
|
Standing sideways towards the students and talking
|
Inspecting the classroom
|
Walking back and forth around the students
|
Student behavior
|
behavior state
|
Listening
|
Sitting uprigh
and looking ahead
|
Reading
|
Bowing the head
and looking at the book
|
Writing
|
Bowing the head
and writing with a pen
|
Playing with mobile phone
|
Bowing the head
and looking at the mobile phone
|
Sleeping
|
Bowing the head
and close the eyes
|
Talking
|
Watching others
and open the mouth
|
4.2 Implementation Details and Analysis
In this research, we use the Pytorch deep learning framework to construct the model,
TV-L1 [24] to calculate the optical flow between adjacent frames, and DWPose [25] to capture the upper body skeleton information. When training the teacher behavior
recognition model, we divide the training set, verification set and test set according
to 8:1:1. The batch size is set to 32, and we use SGD as the optimizer with an initial
learning rate of 0.0001 and a momentum of 0.9. The classification loss function is
Focal Loss, the bounding box regression loss function is Smooth L1 Loss, and the confidence
loss function is MSE Loss. The model is trained for 40 epochs. To prevent overfitting,
when loading the pre-trained weights of ViT, we first linearly transform the optical
flow to the range of 0-255, thus converting it to the same value range as the RGB
channel. Then, we average the weights of the first convolutional layer of the ViT
and copy this value according to the number of channels of the optical flow. We modify
the number of input channels of the first convolutional layer of the ViT and load
the averaged weights. When training the students' behavior recognition model, we set
the batch size to 16 and use the cross-entropy loss as the loss function, keeping
the other parameter settings unchanged.
Fig. 8 shows the confusion matrix for the classification of different behaviors by the multi-task
hybrid model on the test set. We find that the spatio-temporal dual-stream network
achieved over 90% classification accuracy for all four teacher behaviors. Among these,
the accuracy for inspecting the classroom is the highest at 98.4%, while the accuracy
for side-standing lecture is the lowest at 92.6%. According to our analysis, inspecting
the classroom which involves walking back and forth, exhibits more significant temporal
and spatial feature differences than the other relatively static behaviors, making
it easier for the model to understand and classify. Due to issues related to angle
and instantaneous states, side-standing lecture is easily misclassified by the model
as front-standing lecture or writing on the blackboard, leading to a lower accuracy.
MSSTGCN achieved an accuracy of over 87% for student behavior recognition. Among these,
the accuracy for sleeping is the highest at 95.9%, while the accuracy for behaviors
like listening, playing with mobile phones, and talking are all above 90%. Reading
and writing exhibit high similarity in skeleton information, making these two behaviors
more likely to be confused by the model, resulting in slightly lower classification
accuracy.
To further analyze the correlation between teacher behaviors and student behaviors,
we match and statistically analyze the teacher-student behavior recognition results
at the same time in the same scene. The results are shown in Fig. 9. We found that when the teacher is turning back to write on the blackboard, the average
proportion of students' negative behaviours is 68.2%, and the behavior of playing
with mobile phones accounting for more than half of these at 37.4%. When the teacher
is facing the students, the average proportion of students' negative behaviors significantly
decreases, and the average proportion of students' positive behaviors increases substantially.
Specifically, when the teacher is side-standing lecture, the average proportion of
students' positive behaviors is 56.3%, with the behavior of listening accounting for
28.6%. When the teacher is front-standing lecture, the average proportion of students'
positive behaviors rises to 67.9%, with the behavior of listening rises to 40.5%.
It is worth noting that compared to writing on the blackboard, the proportion of students'
sleeping behavior did not significantly decrease when the teacher is facing the students.
This indicates that student sleeping behavior is hardly influenced by the teacher's
lecturing state. When the teacher is inspecting the classroom, the average proportion
of students' positive behaviors reaches 88.2%, and the average proportion of students'
negative behaviors decreases to 11.8%. Specifically, the proportion of playing with
mobile phones and talking behaviors are both maintained below 3%, and the proportion
of sleeping behavior also decreases to about 8%, demonstrating that teacher's inspecting
behavior effectively constrains student behaviors in the classroom environment.
Fig. 8. The confusion matrix for the classification of different behaviors.
Fig. 9. The teacher-student behavior recognition results at the same time in the same scene.
4.3 Ablation and Comparison Experiments
To validate the effectiveness of the multi-task hybrid model in classroom behavior
recognition, this research conducted ablation and comparison experiments under the
same experimental environment and parameter settings.
For teacher behavior recognition, the ablation experiments investigate the impact
of using single-modality data and different backbones on the performance, The results
are shown in Table 2. Obviously, the accuracy of the dual-stream framework with RGB + optical flow is
superior to any single-stream framework. Using ResNet-18 as the backbone of RGB stream
performs better than using VGG-16 and slightly worse than using other 2D-CNN backbones.
Using ViT as the backbone of optical flow stream is significantly superior to any
other RNN-based backbones. Considering the model complexity (Params) and computational
efficiency (GFLPs), we choose ResNet-18+ViT as the the backbone of the dual-stream
framework. The comparison experiments evaluate the performance of other state-of-the-art
methods based on RGB or optical flow on the teacher behavior dataset, and the results
are shown in Table 3.
For student behavior recognition, the ablation experiments investigate the impact
of inputting different levels of skeleton information and using Non-local block on
the spatio-temporal graph convolutional network. The results are shown in Table 4. Compared with the 1-order and 2-order information, the input of higher-order semantic
information can enhance the model's ability to distinguish complex and similar behaviors.
However, it is worth noting that merely inputting richer skeletal information provides
limited assistance in feature extraction. Incorporating the Non-local block among
features at different scales allows for better feature fusion. The comparison experiments
evaluate the performance of other state-of-the-art skeleton-based methods on the student
behavior dataset, and the results are shown in Table 5.
To elucidate the reasons for using methods based on different modal data in different
tasks, we also set up cross-validation experiments. The results are shown in Table 6.
Table 2. The ablation results for spatio-temporal dual-stream framework.
Modality
|
Backbone
|
Acc
(%)
|
Params(×106 )
|
GFLOPs
(×109 )
|
RGB only
|
VGG-16
|
76.3
|
140.89
|
16.45
|
DarkNet-19
|
81.0
|
23.10
|
7.99
|
ResNet-50
|
81.8
|
29.71
|
4.96
|
DenseNet-169
|
82.4
|
36.24
|
10.40
|
ResNet-18
|
79.2
|
14.66
|
4.28
|
optical
flow only
|
LSTM
|
68.5
|
-
|
62.78
|
GRU
|
64.9
|
-
|
54.27
|
ViT
|
87.6
|
55.57
|
77.91
|
RGB + optical flow
|
VGG-16
+ ViT
|
91.6
|
198.02
|
98.34
|
DarkNet-19
+ ViT
|
96.3
|
80.54
|
89.17
|
ResNet-50
+ ViT
|
96.8
|
87.62
|
86.43
|
DenseNet-169
+ ViT
|
97.1
|
93.89
|
89.35
|
ResNet-18
+ ViT
|
95.5
|
71.99
|
85.12
|
Table 3. The comparison results with other state-of-the-art methods based on RGB or optical flow.
Methods
|
Modality
|
Acc (%)
|
GFLOPs
(×109 )
|
Two-stream [12]
|
RGB +
optical flow
|
86.5
|
-
|
SlowFast [7]
|
RGB
|
90.7
|
65.79
|
MARS [26]
|
RGB +
optical flow
|
90.4
|
-
|
MOC [9]
|
RGB
|
89.7
|
29.77
|
YOWO [27]
|
RGB
|
91.2
|
43.83
|
TubeR [28]
|
RGB
|
93.9
|
122.46
|
Dual-stream framework (ours)
|
RGB +
optical flow
|
95.5
|
85.12
|
Table 4. The ablation results for MSSTGCN.
Methods
|
Input(skeleton information)
|
Acc (%)
|
GFLOPs
(×109 )
|
MSSTGCN
(without Non-local block)
|
1-order
|
86.5
|
17.95
|
1-order and 2-order
|
86.9
|
17.98
|
1-order, 2-order, and 3-order
|
87.2
|
17.99
|
1-order, 2-order,
3-order,…, and n-order
|
88.3
|
17.99
|
MSSTGCN
(with Non-local block)
|
1-order
|
89.4
|
18.07
|
1-order and 2-order
|
90.1
|
18.24
|
1-order, 2-order, and 3-order
|
90.6
|
18.33
|
1-order, 2-order,
3-order,…, and n-order
|
90.8
|
18.36
|
Table 5. The comparison results with other state-of-the-art skeleton-based methods.
Methods
|
Acc (%)
|
GFLOPs (×109 )
|
ST-GCN [14]
|
81.4
|
16.34
|
Motif-GCN [15]
|
85.7
|
-
|
PB-GCN [16]
|
83.7
|
-
|
2s-AGCN [29]
|
85.2
|
39.15
|
MST-GCN [30]
|
89.6
|
22.66
|
MSAAGCN [31]
|
91.1
|
54.80
|
MSSTGCN (ours)
|
90.8
|
18.36
|
Table 6. The cross-validation results on different tasks.
Methods
|
Tasks
|
Acc (%)
|
FPS(f·s-1)
|
Dual-stream framework
|
Teacher behavior recognition
|
95.5
|
28
|
MSSTGCN
|
Teacher behavior recognition
|
87.1
|
47
|
Dual-stream framework
|
Student behavior recognition
|
84.6
|
11
|
MSSTGCN
|
Student behavior recognition
|
90.8
|
23
|
5. Conclusions
This paper proposes a vision-based multi-task hybrid model to explore the impact of
teacher behavior on student behavior by recognizing behaviors in the classroom environment.
This hybrid model consists of two parts: a teacher behavior recognition model based
on spatio-temporal dual-stream framework and a student behavior recognition model
based on multi-level stacked spatio-temporal graph convolutional network. By collecting
video data from real classroom environments, this experiment validates the effectiveness
of the proposed method. Further analysis of the experimental results reveals that
teacher's behavior significantly influences students' classroom learning states. In
the future, we will consider enhancing the diversity of classroom environments and
increasing the variety and number of behavior samples to refine the study of interactions
between behaviors.
ACKNOWLEDGMENTS
This research is supported by the Natural Science Research Project of Anhui Province
(KJ2021A1160) and the Quality Engineering Project of Anhui Province (2022sx060 & 2022jyxm659).
REFERENCES
J. Zhao, J. Li, and J. Jian, ``A study on posture-based teacher-student behavioral
engagement pattern,'' Sustainable Cities and Society, Vol. 67, 102749, Apr. 2021.
W. Xie, Y. Tao, J. Gao, D. Zhou, and W. Wang, ``YOWO Based Real-time Recognition of
Classroom Learning Behaviors,'' Modern Educational Technology (in Chinese), Vol. 32,
no. 6, pp. 107-114, Jun. 2022.
A. Dosovitskiy, et al., ``An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale,'' arXiv preprint arXiv:2010.11929, 2021.
X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local Neural Networks," in Proc. of
CVPR, pp. 7794-7803, Jun. 2018.
J Carreira, and A. Zisserman, ``Quo Vadis, Action Recognition? A New Model and the
Kinetics Dataset,'' in Proc. of CVPR, pp. 4724-4733, Jul. 2017.
C. Feichtenhofer, ``X3D: Expanding Architectures for Efficient Video Recognition,''
in Proc. of CVPR, pp. 200-210, Jun. 2020.
C. Feichtenhofer, H. Fan, J. Malik and K. He, ``SlowFast Networks for Video Recognition,''
in Proc. of ICCV, pp. 6201-6210, Oct. 2019.
V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, ``Action tubelet detector
for spatio-temporal action localization,'' in Proc. of ICCV, pp. 4415-4423, Oct. 2017.
Y. Li, Z. Wang, L. Wang, and G. Wu, ``Actions as Moving Points,'' in Proc. of ECCV,
Part XVI 16, pp. 68-84, Aug. 2020.
J. Liu, A. Shahroudy, D. Xu, and G. Wang, ``Spatio-Temporal LSTM with Trust Gates
for 3D Human Action Recognition,'' in Proc. of ECCV, Part III 14, pp. 816-833, Jul.
2016.
W. Zheng, L. Li, Z. Zhang, Y. Huang and L. Wang, ``Relational Network for Skeleton-Based
Action Recognition,'' in Proc. of ICME, pp. 826-831, Jul. 2019.
J. Tu, M. Liu and H. Liu, "Skeleton-Based Human Action Recognition Using Spatial Temporal
3D Convolutional Neural Networks," in Proc. of ICME, pp. 1-6, Jul. 2018.
C. Li, Q. Zhong, D. Xie, and S. Pu, "Co-occurrence Feature Learning from Skeleton
Data for Action Recognition and Detection with Hierarchical Aggregation," in Proc.
of IJCAI, pp. 786-792, Jul. 2018.
S. Yan, Y. Xiong, and D. Lin, " Spatial Temporal Graph Convolutional Networks for
Skeleton-Based Action Recognition," in Proc. of AAAI, pp. 7444-7452, Feb. 2018.
Y. Wen, L. Gao, H. Fu, F. Zhang, and S. Xia, "Graph CNNs with Motif and Variable Temporal
Block for Skeleton-Based Action Recognition," in Proc. of AAAI, pp. 8989-8996, Jan.
2019.
K. Thakkar, and P. J. Narayanan, "Part-based Graph Convolutional Network for Action
Recognition," arXiv preprint arXiv:1809.04983, 2018.
N. A. Flanders, "Analyzing teacher behavior," Massachusetts: Addison-Wesley, 1970.
Y. Zhang, et al., "AI Education Based on Evaluating Concentration of Students in Class:
Using Machine Vision to Recognize Students’ Classroom Behavior," in Proc. of ICVIP,
pp. 126-133, Dec. 2021.
Y. Huang, M. Liang, X. Wang, Z. Chen, and X.Cao, ``Multi-person classroom action recognition
in classroom teaching videos based on deep spatiotemporal residual convolution neural
network,'' Journal of Computer Applications (in Chinese), Vol. 42, no. 3, pp. 736-742,
Mar. 2022.
F. Lin, H. H. Ngo, C. R. Dow, K. H. Lam, and H. L. Le, ``Student Behavior Recognition
System for the Classroom Environment Based on Skeleton Pose Estimation and Person
Detection,'' Sensors, Vol. 21, no. 16, 5314, Aug. 2021.
J. Mo, R. Zhu, H. Yuan, Z. Shou, and L. Chen, ``Student behavior recognition based
on multitask learning,'' Multimedia Tools and Applications, Vol. 82, no. 12, pp. 19091-19108,
May. 2023.
S. Zhang, et al., ``MSTA-SlowFast: A Student Behavior Detector for Classroom Environments,''
Sensors, Vol. 23, no. 11, 5205, May. 2023.
X. Xu, and J. Zhang, ``Classroom Behavior Recognition of Students Based on Improved
YOWO Algorithm,'' Computer Systems & Applications (in Chinese), Vol. 33, no. 4, pp.
113-122, Jan. 2024.
F. Steinbr\"{u}cker, T. Pock and D. Cremers, "Large displacement optical flow computation
without warping," in Proc. of ICCV, pp. 1609-1614, Sep. 2009.
Z. Yang, A. Zeng, C. Yuan, and Y. Li, "Effective Whole-body Pose Estimation with Two-stages
Distillation," in Proc. of ICCVW, pp. 4212-4222, Oct. 2023.
N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: Motion-Augmented RGB
Stream for Action Recognition," in Proc. of CVPR, pp. 7874-7883, Jun. 2019.
O. K\"{o}p\"{u}kl\"{u}, X. Wei, and G. Rigoll, ``You Only Watch Once: A Unified CNN
Architecture for Real-Time Spatiotemporal Action Localization,'' arXiv preprint arXiv:1911.06644,
2021.
J. Zhao, et al., "TubeR: Tubelet Transformer for Video Action Detection," in Proc.
of CVPR, pp. 13588-13597, Jun. 2022.
L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Two-Stream Adaptive Graph Convolutional Networks
for Skeleton-Based Action Recognition," in Proc. of CVPR, pp. 12018-12027, Jun. 2019.
Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, "Multi-Scale Spatial Temporal Graph Convolutional
Network for Skeleton-Based Action Recognition," in Proc. of AAAI, Vol. 35, no. 2,
pp. 1113-1122., Feb. 2021.
Z. Zheng, Y. Wang, X. Zhang, and J. Wang, ``Multi-Scale Adaptive Aggregate Graph Convolutional
Network for Skeleton-Based Action Recognition,'' Applied Sciences, Vol. 12, no. 3,
1402, Jan. 2022.
Huan Zhou is an Associate Professor at the School of Big Data and Artificial Intelligence,
Anhui Xinhua University (AHXU), Hefei, China. She received her B.E. degree in Electronic
and Information Engineering from Hefei Normal University (HFNU) in 2014, and received
her M.E. degree in Optical Engineering from Nanjing University of Information Science
and Technology (NUIST) in 2017. Her research interests include data mining and machine
learning.
Wenrui Zhu received his B.E. degree in Network Engineering from Anhui Jianzhu University
(AHJZU) in 2019, and received his M.E. degree in Electronic and Information Engi-neering
from AHJZU in 2023. Currently, he is pursuing the Ph.D. degree at Xi'an University
of Architecture and Technology (XAUAT), Xi'an, China. His research interests include
computer vision and human action recognition.