Mobile QR Code QR CODE

  1. (School of Big Data and Artificial Intelligence, Anhui Xinhua University / Hefei, AH 230088, China zhouhuan0813@ustc.edu.cn )
  2. (School of Civil Engineering, Xi'an University of Architecture and Technology / Xi'an, SN 710055, China wenrui_zhu@foxmail.com )



Classroom behavior, Dual-stream framework, Multi-task hybrid model, Multi-mode learning, Spatio-temporal graph convolutional network

1. Introduction

The rapid development of artificial intelligence can effectively promote the modernization level of education [1]. From the students' perspective, big data analysis will assist in customizing student-centred personalized teaching and learning programmes to increase students' learning interests. Rich digital teaching resources can also provide students with multiple views and effectively solve the problem of unbalanced educational resources. From the teachers' perspective, using computer vision technology to analyze students' behavior in classroom surveillance videos can improve the interaction between teaching and learning, and the design of teaching sessions can be further optimized according to students' feedback.

In the past, most analysis of classroom teaching information came from relevant professionals who observed the recording of lecture videos and the corresponding student performance. This approach is less efficient and ignores the correlation between teacher and student behavior. Most current research on classroom behavior recognition is based on single-frame image recognition, ignoring the temporal features of behavior [2], which often causes misdetection of similar behaviors. On the other hand, it is difficult to adapt to complex multiple-recognition tasks using single-modal data. For example, student behaviour detection has the characteristics of many targets that require location and are prone to overlap. Using skeleton data can effectively solve these problems. Teacher behavior detection has the characteristics of a single target and small movement optical flow. Based on the above problems, we need to adopt different data sources according to the specific tasks and design different models to complete the temporal and spatial feature extraction.

Another difficulty of behavior recognition in classroom environment is obtaining large amounts of data on teacher and student behaviors, which is very scarce in public behavior datasets. In response to this problem, our solution is to use dual-position shooting with multi-ocular cross-sense source cameras to simultaneously capture videos of teachers and students in the classroom environment, thereby avoiding the temporal alignment and misalignment of multimodal data.

This article is optimized based on previous classroom behavior recognition defects, and proposed a vision-based multi-task hybrid model. The model uses RGB, optical flow and skeletal data as inputs to serve the real-time detection of teacher-student behaviors in classroom environment. By analyzing the correspondence between teacher and student behaviors in different contexts, we can discover their hidden correlation. It will help to improve the interaction between teaching and learning and promote the development of intelligent education.

The contribution of this paper can be summarized as follows.

· For the teacher behavior recognition task, we proposed a spatio-temporal dual-stream framework based on 2D-CNN and ViT [3]. As shown in Fig. 1, the 2D-CNN backbone is mainly used to extract spatial features from RGB videos, and ViT is mainly used to extract temporal features from optical flow. The acquired spatio-temporal features will be spliced along the channel dimensions to obtain the fused features, which are finally input to the fully connected (FC) layer to obtain the classification of behaviors.

· For the student behavior recognition task, we proposed MSSTGCN based on skeleton data. As shown in Fig. 2, the multi-level skeleton information is sequentially fed into the spatio-temporal graph convolution block for feature extraction. The multi-scale features are sequentially aggregated by the Non-local [4] block, which can capture the fine-grained features among similar behaviors to the greatest extent.

Fig. 1. The architecture of the spatio-temporal dual-stream framework.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig1.png
Fig. 2. The architecture of the multi-level stacked spatio-temporal graph convolutional network (MSSTGCN).
../../Resources/ieie/IEIESPC.2024.13.6.587/fig2.png

2. Related Work

2.1 Spatio-temporal Behavior Recognition based on RGB and Optical Flow

As an important part of video perception and processing, spatio-temporal behavior recognition aims to detect the target behaviour in each video frame. Similar to object detection methods, both of them require classification and localization. However, the difference is that spatio-temporal action recognition emphasizes the processing of temporal information. It is hard to determine the type of target behavior solely through a single frame image, which can easily lead to divergent classification. The temporal correlations contained in consecutive frames will significantly improve the accuracy of behavior recognition.

The core challenge of spatio-temporal behavior recognition is to efficiently construct temporal associations of targets in consecutive frames. To overcome this challenge, a common idea is to leverage the spatio-temporal feature extraction capability of 3D-CNN to establish a robust action detection network. However, prevalent 3D-CNN-based methods for spatio-temporal behavior recognition, such as I3D [5], X3D [6], and SlowFast [7], suffer from a shared drawback: their substantial parameter size leads to slow computational speeds.

To satisfy the real-time requirement, some researchers have started considering spatio-temporal action detectors based on 2D-CNNs, such as ACT [8] and MOC [9]. Unlike 3D-CNN-based methods, these 2D-CNN-based methods cleverly simulate temporal modelling by splicing spatial features from each frame. However, the limited spatio-temporal feature extraction capability of 2D-CNN is insufficient to meet the accuracy requirements of spatio-temporal behavior recognition. Therefore, this type of work often sets up an additional parallel branch to process the optical flow corresponding to the input video clips, resulting in a dual-stream network model that can process RGB and optical flow in parallel. Optical flow is an explicit short-term temporal association for calculating object motion between adjacent frames. Adding optical flow can significantly enhance the temporal feature extraction. For example, in the case of MOC, the behavior recognition mAP on the UCF101-24 dataset is improved by almost 7% after adding optical flow.

2.2 Spatio-temporal Behavior Recognition based on Skeleton Data

Skeleton data is lightweight and easy to calibrate. Recently, It has been widely used in spatio-temporal behavior recognition in complex scenes. The idea of skeleton-based spatio-temporal behavior recognition is to perform motion modelling and feature extraction through acquired skeleton information, obtain feature vectors reflecting motion information, and complete the classification of the behaviors on this basis. Common skeleton-based spatio-temporal behavior recognition methods mainly include RNN-based methods, CNN-based methods, and GCN-based methods.

RNN is good at processing time series data with long-term dependence like 3D skeleton joints, but is not ideal for modelling spatial information of joints. Many researchers have focused on improving the spatial feature extraction ability of RNNs. For example, Liu et al. [10] used ST-LSTM to traverse the human body in the form of a bidirectional tree to improve the adjacency attribute between skeleton joints. Zheng et al. [11] used RRN to learn the spatial features in the skeleton and explore the complementarity of the connectivity between the joints for behavior recognition.

CNN can learn the semantic information of skeleton through efficient spatial modelling, but it is weaker than RNN in processing temporal information of motion. Many researchers are committed to strengthening CNN's ability of extracting temporal features. Tu et al. [12] proposed two-stream 3D CNNs with different kernel sizes to capture multi-scale temporal features and convert skeleton data into multi-temporal sequences. Li et al. [13] proposed HCN that achieved the learning of global co-occurrence features for all joints.

Human skeletons are natural topologies. Compared to RNN and CNN, GCN is better at handling non-Euclidean data such as skeletons. Most GCN-based spatio-temporal behavior recognition methods use different blocks to extract temporal and spatial features respectively. The most classic example is the ST-GCN proposed by Yan et al. [14] This network uses multi-layer graph convolution to construct spatio-temporal graph. The physical structure of the human body is represented by joints and spatial edges. Temporal edges are added to replace the original optical flow. Based on ST-GCN, Wen et al. [15] extend the correlation between adjacency joints to all joints, effectively integrating multi-level semantic information of the joints to learn higher-order features. Thakkar et al. [16] proposed a part-based GCN that divides the human body into four subgraphs with the intention of capturing potential correlation between distant joints. Although these methods have considered the impact of higher-order semantic information on skeleton-based behavior recognition, they have yet to carried out further research on the fusion of features at different scales.

2.3 Spatio-temporal Behavior Recognition Methods in Classroom Environment

Before AI was applied to classroom teaching, teachers primarily relied on methods such as the Flanders Interaction Analysis System (FIAS) and the improved Flanders Interaction Analysis System (iFIAS) [17], which are based on information technology, to analyze students' classroom behaviors. However, these methods fall within manual analysis manual analysis and cannot achieve sustainable, large-scale observation and analysis.

In recent years, mainstream methods for classroom behavior recognition are all based on object detection and pose estimation. Zhang et al. [18] applied YOLO to detect students' facial movements, predicting classroom engagement through behaviors like yawning, smiling, and eye closure. Huang et al. [19] proposed a deep spatio-temporal residual convolutional neural network to detect and track multiple students' behavior trajectories in teaching videos in real-time, achieving promising results. Lin et al. [20] combined student posture with object detection, reducing erroneous connections between skeletal nodes and misclassification of similar behaviors. Mo et al. [21] introduced a multi-task classroom behavior recognition network comprising a pose estimator, object detector, and MTHN module, successfully predicting student behaviors by integrating multi-scale features.

However, these researches have primarily focused on analyzing the spatial features of classroom behaviors, neglecting the impact of temporal features on contextual semantics. To address this, Zhang et al. [22] introduced the incorporation of Multi-Scale Temporal Attention (MSTA) and Efficient Temporal Attention (ETA) modules into the SlowFast model, enhancing the model's ability to capture temporal features. Xu et al. [23] utilized the Temporal Shift Module (TSM) to augment the 2D-CNN backbone of YOWO, thereby enhancing its capability to acquire temporal information related to behaviors.

The above researches are all about student behavior recognition in the classroom environment, and there is almost no research foces on teacher-student spatio-temporal behavior recognition. This paper aims to explore the correlation between teacher and student behaviors in the classroom by designing a multi-task hybrid model.

3. Vision-based Multi-task Hybrid Model

The schematic pipeline of the proposed model is shown in Fig. 3. This hybrid model consists of two parts: the teacher behavior recognition model and the student behavior recognition model.

Fig. 3. The schematic pipeline of the vision-based multi-task hybrid model.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig3.png

3.1 Teacher Behavior Recognition Model based on Dual-stream Framework

The teacher behavior recognition model is a single-stage network, consisting of two branches: spatial stream and temporal stream. The input of the spatial stream is a sequence of consecutive RGB frames, from which a 2D-CNN backbone extracts features from static images and fuses features from different frames along the channel dimension to classify actions in the video. The 2D-CNN backbone in this paper uses Resnet-18 with a small parameter size to extract spatial features for each frame. These spatial feature maps are then concatenated along the channel dimension to form a thick feature map. The classification layer convolves all frames' spatial features simultaneously, implicitly incorporating the concept of temporal sequence. In behavior recognition tasks, the temporal association obtained by merely concatenating feature maps is not sufficient, and need to add optical flow as the temporal stream input.

The temporal stream uses consecutive optical flow as input to capture temporal information between frames via ViT. Optical flow represents the temporal change of pixels between adjacent frames, which includes the features from both horizontal and vertical vector channels, effectively reflecting motion information. The process of feature extraction from optical flow based on ViT is shown in Fig. 4. First, the optical flow images are divided into patches, and each patch is mapped to a one-dimensional vector. These vectors, along with class tokens and positional encodings, are then fed into the Transformer Encoder. The Encoder has 12 layers, each layer comprising temporal self-attention block, layer normalization, dropout, and multilayer perceptron. The temporal self-attention block enables the model to capture the changes of temporal features, so as to further optimize the extraction of temporal features. This is specifically achieved by performing temporal self-attention operations on the same spatial numbered patchs of different frames when extracting features using ViT. The equation of the self-attention operation is as follows.

(1)
$ \text{Attention}\left(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}\right)=\text{SoftMax}\left(\frac{\boldsymbol{Q}\cdot \boldsymbol{K}^{\mathrm{T}}}{\sqrt{d_{K}}}\right)\cdot \boldsymbol{V} $

where $Q$ is the query matrix, $K$ is the key matrix, $V$ is the value matrix, and $d_{K}$ is the row vector dimension of the $K$ matrix.

After obtaining the spatial and temporal stream features, we adopt channel fusion to fuse spatial and temporal features. These features are concatenated along the channel dimension and then processed by a 1x1 convolution and 3x3 convolution. Both the 1x1 convolution and the 3x3 convolution are followed by a Batch Normalization (BN) layer and LeakyReLU function. The fused features are then fed into a Deep Neural Network (DNN) to achieve classification, coordinate regression, and confidence estimation for different behaviors.

Fig. 4. Feature extraction from optical flow based on Vision Transformer (ViT).
../../Resources/ieie/IEIESPC.2024.13.6.587/fig4.png

3.2 Student behavior Recognition Model based on MSSTGCN

Traditional skeleton-based behavior methods tend to focus solely on 1-order skeleton semantic information (physical connections between joints) and 2-order skeleton semantic information (the potential association between two joints with a hop connection distance of 2), which is reasonable for some behavior recognition tasks with significant variations in movement. However, considering the similar categories and minor variations of student behaviors in classroom environment, we can no longer ignore the higher-order skeleton semantic information. Therefore, we designed a multi-level stacked spatio-temporal graph convolutional network to model the long-range spatial dependence of the human body and extract rich spatio-temporal features.

MSSTGCN consists of multiple parallel branches, each composed of three spatio-temporal graph convolutional blocks (STGCBs) that process skeleton information at specific scales. The structure of the STGCB is shown in Fig. 5, which includes a GCN for extracting spatial features and a TCN for extracting temporal features. Both of them are followed by a BN layer and LeakyReLU function. Inspired by the design in Reference [31], we artificially set different adjacency matrices for GCN in different STAGCBs to allow it to extract skeleton features at specific scales. For instance, to obtain 1-order information only, we construct an adjacency matrix only between a joint and its directly connected joints. To obtain 2-order information, we construct an adjacency matrix only between a joint and the joints that are two hops away from it. Following this logic, by manually configuring adjacency matrices, we can obtain all higher-order semantic information for a given joint.

As shown in Fig. 2, we use a stepped stacking structure for feature fusion between multiple branches. The features extracted by the previous branch are concatenated with the features extracted by the current branch, and then input into the Non-local block for enhanced extraction of critical features, which are subsequently input to the next branch. Students' classroom behaviors are different due to subtle changes, the Non-local operation can strengthen the model's ability to distinguish the details of similar behaviors. Without the inclusion of the Non-local block, some subtle differences in features may gradually be overlooked by the model during aggregation with the next-level features, resulting in the model's inability to distinguish similar behaviors after multiple feature aggregation. The calculation of the Non-local block is as follows.

(2)
$ y_{i}=\frac{1}{c\left(x\right)}\sum _{\forall j}\mathrm f\left(x_{i},x_{j}\right)g\left(x_{j}\right) $
(3)
$ z_{i}=w_{z}y_{i}+x_{i} $

where $i$ represents the position index in spatial or temporal for the desired response, and $j$ represents the possible position index for the response. $x$ is the input feature, and $y$ is the output feature. The function $f$ calculates the correlation between $i$ and $j$. The function $g$ calculates the linear mapping of the input signal at position $j$ by multiplying it with a parameter matrix. $c\left(x\right)$ is normalized by Softmax to keep the output value remain within a certain range, which aids in model convergence. Eq. (2) is to calculate the information of other positions $j$ that may be relevant to position $i$. These relevant positions can be from preceding or succeeding frames, unlike traditional convolution that only calculates the information on adjacent positions, thus can extract more extensive features. In Eq. (3), $w_{z}$ is a parameter for linear mapping that can change during network training, "$+x_{i}$" represents residual connections, and $z_{i}$ represents the output result of the Non-local block.

Fig. 6 shows the calculation process of the Non-local block. First, the input is convolved by ${\theta}$, ${\varphi}$ and g to obtain ${\theta}$($x_{i}$), ${\varphi}$($x_{j}$) and g($x_{j}$). Then, ${\theta}$($x_{i}$) and ${\varphi}$($x_{j}$) are dot multiplied to obtain $f\left(x_{i},x_{j}\right)$, which is normalized by Softmax. This result is multiplied by g($x_{j}$) to obtain $y_{i}$, which corresponds to the result of Eq. (2). Finally, $y_{i}$ is convolved with $w_{z}$ and added to the original input $x_{i}$ to obtain the output $z_{i}$.

If we use $U_{n}$ to represent the result of feature fusion in the $n$-th branch, $F_{n}$ to represent the feature extracted by the graph convolution operation in the $n$-th branch, the output of each branch in MSSTGCN are as follows.

(4)
$ \begin{array}{l} U_{1}=F_{1}\\ U_{2}=\text{Non-Local}\left(Concat\left(U_{1}+F_{2}\right)\right)\\ U_{3}=\text{Non-Local}\left(Concat\left(U_{2}+F_{3}\right)\right)\\ \vdots \\ U_{n}=\text{Non-Local}\left(Concat\left(U_{n-1}+F_{n}\right)\right) \end{array} $
Fig. 5. The structure of the STGCB.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig5.png
Fig. 6. The calculation process of the Non-local block.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig6.png

4. Experiments

4.1 Dataset

Given the absence of large publicly available datasets for evaluating model performance of teacher-student behavior recognition in classroom environment, this research collected raw data from teaching videos of real classroom environment at a Chinese university. By analyzing the characteristics of teacher and students behaviors in the classroom, we summarized four representative teacher behaviors (writing on the blackboard, front-standing lecture, side-standing lecture, inspecting the classroom) and six student behaviors (listening, reading, writing, playing with mobile phones, sleeping, talking). The description of each behavior is shown in Table 1. Listening, reading and writing represent students' positive behaviors in the classroom, while playing with mobile phones, sleeping and talking represent students' negative behaviors in the classroom. We aim to reveal the correlation between teacher and students behaviors in classroom environment by analyzing the proportion of students' behavior states under various teacher behaviors.

The dataset used in this research was collected from 8 different classroom environments, with each scene having a total duration of about 90 minutes (45 minutes from the teacher's front view and 45 minutes from the student's front view). During the process of data preprocessing, we crop and segment the raw videos, standardizing them into short video clips with a resolution of 512${\times}$ 424 and a frame rate of 30 fps. Each video clip was kept in 8-10 seconds in length, and the total number of videos is more than 4500. We further intercept the video clips by frame to label the behaviour categories and save the data labels as JSON files. Some labelled frame images are shown in Fig. 7.

Fig. 7. Teacher behaviors and student behaviors in the dataset.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig7.png
Table 1. The description of each behavior.

Teacher behavior

behavior state

Writing on the blackboard

Standing with your back to the students and writing with your hand

Front-standing lecture

Standing face to face with the students and talking

Side-standing lecture

Standing sideways towards the students and talking

Inspecting the classroom

Walking back and forth around the students

Student behavior

behavior state

Listening

Sitting uprigh

and looking ahead

Reading

Bowing the head

and looking at the book

Writing

Bowing the head

and writing with a pen

Playing with mobile phone

Bowing the head

and looking at the mobile phone

Sleeping

Bowing the head

and close the eyes

Talking

Watching others

and open the mouth

4.2 Implementation Details and Analysis

In this research, we use the Pytorch deep learning framework to construct the model, TV-L1 [24] to calculate the optical flow between adjacent frames, and DWPose [25] to capture the upper body skeleton information. When training the teacher behavior recognition model, we divide the training set, verification set and test set according to 8:1:1. The batch size is set to 32, and we use SGD as the optimizer with an initial learning rate of 0.0001 and a momentum of 0.9. The classification loss function is Focal Loss, the bounding box regression loss function is Smooth L1 Loss, and the confidence loss function is MSE Loss. The model is trained for 40 epochs. To prevent overfitting, when loading the pre-trained weights of ViT, we first linearly transform the optical flow to the range of 0-255, thus converting it to the same value range as the RGB channel. Then, we average the weights of the first convolutional layer of the ViT and copy this value according to the number of channels of the optical flow. We modify the number of input channels of the first convolutional layer of the ViT and load the averaged weights. When training the students' behavior recognition model, we set the batch size to 16 and use the cross-entropy loss as the loss function, keeping the other parameter settings unchanged.

Fig. 8 shows the confusion matrix for the classification of different behaviors by the multi-task hybrid model on the test set. We find that the spatio-temporal dual-stream network achieved over 90% classification accuracy for all four teacher behaviors. Among these, the accuracy for inspecting the classroom is the highest at 98.4%, while the accuracy for side-standing lecture is the lowest at 92.6%. According to our analysis, inspecting the classroom which involves walking back and forth, exhibits more significant temporal and spatial feature differences than the other relatively static behaviors, making it easier for the model to understand and classify. Due to issues related to angle and instantaneous states, side-standing lecture is easily misclassified by the model as front-standing lecture or writing on the blackboard, leading to a lower accuracy. MSSTGCN achieved an accuracy of over 87% for student behavior recognition. Among these, the accuracy for sleeping is the highest at 95.9%, while the accuracy for behaviors like listening, playing with mobile phones, and talking are all above 90%. Reading and writing exhibit high similarity in skeleton information, making these two behaviors more likely to be confused by the model, resulting in slightly lower classification accuracy.

To further analyze the correlation between teacher behaviors and student behaviors, we match and statistically analyze the teacher-student behavior recognition results at the same time in the same scene. The results are shown in Fig. 9. We found that when the teacher is turning back to write on the blackboard, the average proportion of students' negative behaviours is 68.2%, and the behavior of playing with mobile phones accounting for more than half of these at 37.4%. When the teacher is facing the students, the average proportion of students' negative behaviors significantly decreases, and the average proportion of students' positive behaviors increases substantially. Specifically, when the teacher is side-standing lecture, the average proportion of students' positive behaviors is 56.3%, with the behavior of listening accounting for 28.6%. When the teacher is front-standing lecture, the average proportion of students' positive behaviors rises to 67.9%, with the behavior of listening rises to 40.5%. It is worth noting that compared to writing on the blackboard, the proportion of students' sleeping behavior did not significantly decrease when the teacher is facing the students. This indicates that student sleeping behavior is hardly influenced by the teacher's lecturing state. When the teacher is inspecting the classroom, the average proportion of students' positive behaviors reaches 88.2%, and the average proportion of students' negative behaviors decreases to 11.8%. Specifically, the proportion of playing with mobile phones and talking behaviors are both maintained below 3%, and the proportion of sleeping behavior also decreases to about 8%, demonstrating that teacher's inspecting behavior effectively constrains student behaviors in the classroom environment.

Fig. 8. The confusion matrix for the classification of different behaviors.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig8.png
Fig. 9. The teacher-student behavior recognition results at the same time in the same scene.
../../Resources/ieie/IEIESPC.2024.13.6.587/fig9.png

4.3 Ablation and Comparison Experiments

To validate the effectiveness of the multi-task hybrid model in classroom behavior recognition, this research conducted ablation and comparison experiments under the same experimental environment and parameter settings.

For teacher behavior recognition, the ablation experiments investigate the impact of using single-modality data and different backbones on the performance, The results are shown in Table 2. Obviously, the accuracy of the dual-stream framework with RGB + optical flow is superior to any single-stream framework. Using ResNet-18 as the backbone of RGB stream performs better than using VGG-16 and slightly worse than using other 2D-CNN backbones. Using ViT as the backbone of optical flow stream is significantly superior to any other RNN-based backbones. Considering the model complexity (Params) and computational efficiency (GFLPs), we choose ResNet-18+ViT as the the backbone of the dual-stream framework. The comparison experiments evaluate the performance of other state-of-the-art methods based on RGB or optical flow on the teacher behavior dataset, and the results are shown in Table 3.

For student behavior recognition, the ablation experiments investigate the impact of inputting different levels of skeleton information and using Non-local block on the spatio-temporal graph convolutional network. The results are shown in Table 4. Compared with the 1-order and 2-order information, the input of higher-order semantic information can enhance the model's ability to distinguish complex and similar behaviors. However, it is worth noting that merely inputting richer skeletal information provides limited assistance in feature extraction. Incorporating the Non-local block among features at different scales allows for better feature fusion. The comparison experiments evaluate the performance of other state-of-the-art skeleton-based methods on the student behavior dataset, and the results are shown in Table 5.

To elucidate the reasons for using methods based on different modal data in different tasks, we also set up cross-validation experiments. The results are shown in Table 6.

Table 2. The ablation results for spatio-temporal dual-stream framework.

Modality

Backbone

Acc

(%)

Params(×106 )

GFLOPs

(×109 )

RGB only

VGG-16

76.3

140.89

16.45

DarkNet-19

81.0

23.10

7.99

ResNet-50

81.8

29.71

4.96

DenseNet-169

82.4

36.24

10.40

ResNet-18

79.2

14.66

4.28

optical

flow only

LSTM

68.5

-

62.78

GRU

64.9

-

54.27

ViT

87.6

55.57

77.91

RGB + optical flow

VGG-16

+ ViT

91.6

198.02

98.34

DarkNet-19

+ ViT

96.3

80.54

89.17

ResNet-50

+ ViT

96.8

87.62

86.43

DenseNet-169

+ ViT

97.1

93.89

89.35

ResNet-18

+ ViT

95.5

71.99

85.12

Table 3. The comparison results with other state-of-the-art methods based on RGB or optical flow.

Methods

Modality

Acc (%)

GFLOPs

(×109 )

Two-stream [12]

RGB +

optical flow

86.5

-

SlowFast [7]

RGB

90.7

65.79

MARS [26]

RGB +

optical flow

90.4

-

MOC [9]

RGB

89.7

29.77

YOWO [27]

RGB

91.2

43.83

TubeR [28]

RGB

93.9

122.46

Dual-stream framework (ours)

RGB +

optical flow

95.5

85.12

Table 4. The ablation results for MSSTGCN.

Methods

Input(skeleton information)

Acc (%)

GFLOPs

(×109 )

MSSTGCN

(without Non-local block)

1-order

86.5

17.95

1-order and 2-order

86.9

17.98

1-order, 2-order, and 3-order

87.2

17.99

1-order, 2-order,

3-order,…, and n-order

88.3

17.99

MSSTGCN

(with Non-local block)

1-order

89.4

18.07

1-order and 2-order

90.1

18.24

1-order, 2-order, and 3-order

90.6

18.33

1-order, 2-order,

3-order,…, and n-order

90.8

18.36

Table 5. The comparison results with other state-of-the-art skeleton-based methods.

Methods

Acc (%)

GFLOPs (×109 )

ST-GCN [14]

81.4

16.34

Motif-GCN [15]

85.7

-

PB-GCN [16]

83.7

-

2s-AGCN [29]

85.2

39.15

MST-GCN [30]

89.6

22.66

MSAAGCN [31]

91.1

54.80

MSSTGCN (ours)

90.8

18.36

Table 6. The cross-validation results on different tasks.

Methods

Tasks

Acc (%)

FPS(f·s-1)

Dual-stream framework

Teacher behavior recognition

95.5

28

MSSTGCN

Teacher behavior recognition

87.1

47

Dual-stream framework

Student behavior recognition

84.6

11

MSSTGCN

Student behavior recognition

90.8

23

5. Conclusions

This paper proposes a vision-based multi-task hybrid model to explore the impact of teacher behavior on student behavior by recognizing behaviors in the classroom environment. This hybrid model consists of two parts: a teacher behavior recognition model based on spatio-temporal dual-stream framework and a student behavior recognition model based on multi-level stacked spatio-temporal graph convolutional network. By collecting video data from real classroom environments, this experiment validates the effectiveness of the proposed method. Further analysis of the experimental results reveals that teacher's behavior significantly influences students' classroom learning states. In the future, we will consider enhancing the diversity of classroom environments and increasing the variety and number of behavior samples to refine the study of interactions between behaviors.

ACKNOWLEDGMENTS

This research is supported by the Natural Science Research Project of Anhui Province (KJ2021A1160) and the Quality Engineering Project of Anhui Province (2022sx060 & 2022jyxm659).

REFERENCES

1 
J. Zhao, J. Li, and J. Jian, ``A study on posture-based teacher-student behavioral engagement pattern,'' Sustainable Cities and Society, Vol. 67, 102749, Apr. 2021.DOI
2 
W. Xie, Y. Tao, J. Gao, D. Zhou, and W. Wang, ``YOWO Based Real-time Recognition of Classroom Learning Behaviors,'' Modern Educational Technology (in Chinese), Vol. 32, no. 6, pp. 107-114, Jun. 2022.DOI
3 
A. Dosovitskiy, et al., ``An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,'' arXiv preprint arXiv:2010.11929, 2021.DOI
4 
X. Wang, R. Girshick, A. Gupta, and K. He, "Non-local Neural Networks," in Proc. of CVPR, pp. 7794-7803, Jun. 2018.DOI
5 
J Carreira, and A. Zisserman, ``Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,'' in Proc. of CVPR, pp. 4724-4733, Jul. 2017.DOI
6 
C. Feichtenhofer, ``X3D: Expanding Architectures for Efficient Video Recognition,'' in Proc. of CVPR, pp. 200-210, Jun. 2020.DOI
7 
C. Feichtenhofer, H. Fan, J. Malik and K. He, ``SlowFast Networks for Video Recognition,'' in Proc. of ICCV, pp. 6201-6210, Oct. 2019.DOI
8 
V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, ``Action tubelet detector for spatio-temporal action localization,'' in Proc. of ICCV, pp. 4415-4423, Oct. 2017.DOI
9 
Y. Li, Z. Wang, L. Wang, and G. Wu, ``Actions as Moving Points,'' in Proc. of ECCV, Part XVI 16, pp. 68-84, Aug. 2020.DOI
10 
J. Liu, A. Shahroudy, D. Xu, and G. Wang, ``Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition,'' in Proc. of ECCV, Part III 14, pp. 816-833, Jul. 2016.DOI
11 
W. Zheng, L. Li, Z. Zhang, Y. Huang and L. Wang, ``Relational Network for Skeleton-Based Action Recognition,'' in Proc. of ICME, pp. 826-831, Jul. 2019.DOI
12 
J. Tu, M. Liu and H. Liu, "Skeleton-Based Human Action Recognition Using Spatial Temporal 3D Convolutional Neural Networks," in Proc. of ICME, pp. 1-6, Jul. 2018.DOI
13 
C. Li, Q. Zhong, D. Xie, and S. Pu, "Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation," in Proc. of IJCAI, pp. 786-792, Jul. 2018.DOI
14 
S. Yan, Y. Xiong, and D. Lin, " Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition," in Proc. of AAAI, pp. 7444-7452, Feb. 2018.DOI
15 
Y. Wen, L. Gao, H. Fu, F. Zhang, and S. Xia, "Graph CNNs with Motif and Variable Temporal Block for Skeleton-Based Action Recognition," in Proc. of AAAI, pp. 8989-8996, Jan. 2019.DOI
16 
K. Thakkar, and P. J. Narayanan, "Part-based Graph Convolutional Network for Action Recognition," arXiv preprint arXiv:1809.04983, 2018.DOI
17 
N. A. Flanders, "Analyzing teacher behavior," Massachusetts: Addison-Wesley, 1970.URL
18 
Y. Zhang, et al., "AI Education Based on Evaluating Concentration of Students in Class: Using Machine Vision to Recognize Students’ Classroom Behavior," in Proc. of ICVIP, pp. 126-133, Dec. 2021.DOI
19 
Y. Huang, M. Liang, X. Wang, Z. Chen, and X.Cao, ``Multi-person classroom action recognition in classroom teaching videos based on deep spatiotemporal residual convolution neural network,'' Journal of Computer Applications (in Chinese), Vol. 42, no. 3, pp. 736-742, Mar. 2022.DOI
20 
F. Lin, H. H. Ngo, C. R. Dow, K. H. Lam, and H. L. Le, ``Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection,'' Sensors, Vol. 21, no. 16, 5314, Aug. 2021.DOI
21 
J. Mo, R. Zhu, H. Yuan, Z. Shou, and L. Chen, ``Student behavior recognition based on multitask learning,'' Multimedia Tools and Applications, Vol. 82, no. 12, pp. 19091-19108, May. 2023.DOI
22 
S. Zhang, et al., ``MSTA-SlowFast: A Student Behavior Detector for Classroom Environments,'' Sensors, Vol. 23, no. 11, 5205, May. 2023.DOI
23 
X. Xu, and J. Zhang, ``Classroom Behavior Recognition of Students Based on Improved YOWO Algorithm,'' Computer Systems & Applications (in Chinese), Vol. 33, no. 4, pp. 113-122, Jan. 2024.DOI
24 
F. Steinbr\"{u}cker, T. Pock and D. Cremers, "Large displacement optical flow computation without warping," in Proc. of ICCV, pp. 1609-1614, Sep. 2009.DOI
25 
Z. Yang, A. Zeng, C. Yuan, and Y. Li, "Effective Whole-body Pose Estimation with Two-stages Distillation," in Proc. of ICCVW, pp. 4212-4222, Oct. 2023.DOI
26 
N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, "MARS: Motion-Augmented RGB Stream for Action Recognition," in Proc. of CVPR, pp. 7874-7883, Jun. 2019.DOI
27 
O. K\"{o}p\"{u}kl\"{u}, X. Wei, and G. Rigoll, ``You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization,'' arXiv preprint arXiv:1911.06644, 2021.DOI
28 
J. Zhao, et al., "TubeR: Tubelet Transformer for Video Action Detection," in Proc. of CVPR, pp. 13588-13597, Jun. 2022.DOI
29 
L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition," in Proc. of CVPR, pp. 12018-12027, Jun. 2019.DOI
30 
Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, "Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition," in Proc. of AAAI, Vol. 35, no. 2, pp. 1113-1122., Feb. 2021.DOI
31 
Z. Zheng, Y. Wang, X. Zhang, and J. Wang, ``Multi-Scale Adaptive Aggregate Graph Convolutional Network for Skeleton-Based Action Recognition,'' Applied Sciences, Vol. 12, no. 3, 1402, Jan. 2022.DOI
Huan Zhou
../../Resources/ieie/IEIESPC.2024.13.6.587/au1.png

Huan Zhou is an Associate Professor at the School of Big Data and Artificial Intelligence, Anhui Xinhua University (AHXU), Hefei, China. She received her B.E. degree in Electronic and Information Engineering from Hefei Normal University (HFNU) in 2014, and received her M.E. degree in Optical Engineering from Nanjing University of Information Science and Technology (NUIST) in 2017. Her research interests include data mining and machine learning.

Wenrui Zhu
../../Resources/ieie/IEIESPC.2024.13.6.587/au2.png

Wenrui Zhu received his B.E. degree in Network Engineering from Anhui Jianzhu University (AHJZU) in 2019, and received his M.E. degree in Electronic and Information Engi-neering from AHJZU in 2023. Currently, he is pursuing the Ph.D. degree at Xi'an University of Architecture and Technology (XAUAT), Xi'an, China. His research interests include computer vision and human action recognition.