Mobile QR Code QR CODE

  1. (Digital Technology School, Sias University, Zhengzhou, Henan 451150, China)



Classroom behavior, Deep learning, YOLO v5s

1. Introduction

Student behavior determines, to a certain extent, the amount of knowledge students acquire in the classroom. Therefore, automatic analysis of classroom behavior is of great importance in evaluating teaching effectiveness [1]. Observing students' classroom behavior relies on teachers' observations, which is very time-consuming and may distract from teaching efforts. With the rapid development of deep learning technology in fields such as image recognition and object detection, some scholars have begun to study how to use deep learning to automatically identify student behavior in the classroom. Among the related research on behavior identification, Wu constructed a classroom behavior recognition model [2]. Combining the particle swarm optimization\textemdash{}k-nearest neighbors algorithm with emotional image processing algorithms proved highly accurate in identifying both emotion and behavior. Xie et al. proposed a deep learning algorithm based on spatio-temporal representation learning to evaluate college students' classroom posture [3]. The results revealed that the proposed algorithm had a 5% higher accuracy than the baseline 3D convolutional neural network (CNN), and it was an effective tool for identifying abnormal behavior in college classrooms. Lin et al. [4] proposed a system that uses a deep neural network to classify actions and identify student behavior. The experiment results showed that the proposed system had a 15.15% higher average accuracy rate and a 12.15% higher average recall rate than the skeleton-based approach. Pang [5] combined a conventional clustering analysis algorithm and the random forest algorithm with a human skeleton model to recognize students' classroom behavior. Through experiments, the effectiveness of recognizing behavior based on human skeleton models was verified. Mao [6] proposed an intelligent image recognition system for students' classroom behavior. Simulating many classroom behaviors through a large number of experiments, the system could help students accurately and quickly identify incorrect classroom behavior and could make timely reminders. Ma and Yang [7] constructed a system for analyzing and assessing classroom behavior using deep learning face recognition technology and found that the system could effectively evaluate students' classroom behavior.

This article explores the recognition and identification of students' classroom behavior in universities. By collecting data for input into the constructed You Only Look Once Version 5 Small (YOLO v5s) model for experimentation, the behavior of students in the classroom was identified and analyzed. The results are compared with other object detection models to prove the effectiveness and feasibility of the student classroom behavior recognition model proposed in this article.

2. Algorithms for Behavior Recognition

Currently, most universities have smart classrooms equipped with video surveillance devices to record students' behavior, which can be recognized based on the content of the surveillance video. Target detection and recognition algorithms based on deep learning can be either single-step or two-step. Compared with the two-step recognition algorithm, training the single-step algorithm is more stable, and the recognition speed is higher. YOLO [8] is one of the best algorithms in the field of target detection and is a single-step recognition algorithm based on target regression. YOLO v5 is one of the newer versions, has high accuracy, and is fast. It comes in v5s, v5m, v5l, and v5x versions based on networks with different depths and widths. This paper evaluated these four models based on actual application scenarios, and we ultimately chose YOLO v5s [9] as the model for our target detection. YOLO v5s consists of six modules: Focus, CBL, CSP, SPP, upsampling, and Concat, as described below.

(1) Focus structure: The original image (640${\times}$640${\times}$3) is imported and sliced into a feature map (320${\times}$320${\times}$12). Then, after a convolution operation, it becomes a 320${\times}$320${\times}$32 feature map.

(2) CBL: This module consists of Conv, Bn, and the Leaky\_ReLU activation function. The feature map is convolved, normalized, and activated sequentially. A convolution operation with a kernel size of 3 and a step length of 2 is performed to downsample the feature map, whereas a convolution operation with a kernel size of 1 and a step length of 1 is used for feature mapping.

(3) Cross Stage Partial (CSP): The CSP1\_X structure is used in the main backbone network, and the CSP2\_X structure is used in the neck.

(4) Spatial Pyramid Pooling (SPP): The feature map is subjected to a k${\times}$k maximum pooling operation to increase the reception range of the main data feature.

(5) Upsampling: This module uses the nearest-neighbor method to double the size of the advanced feature map.

(6) Concat: This module adds advanced features to low-level features to create a new feature map.

3. Case Analysis

3.1 Data Acquisition and Processing

A dataset was constructed for the experiment in this study and was divided into training and testing sets at a 7:3 ratio. The experimental data were classroom videos obtained from the Digital Technology School, Sias University, Zhengzhou, China. After obtaining consent from teachers and students, cameras installed in the classroom were used to record videos of the classes. Student behavior was evaluated as shown in Table 1. Since there were very few cases of sleeping or standing up to answer questions in the initial collection, we deliberately asked students to perform these actions in subsequent collections to supplement the dataset.

After data acquisition and before the experiment began, the collected data were processed. First, five classroom behaviors (raising the head to listen, standing up to answer questions, sleeping, playing with a mobile phone, and turning to chat) were collected from the video data. Each type of behavior was limited to approximately 20 minutes of video. Video frames were then extracted at equal intervals. Finally, a dataset of approximately 4200 images was obtained. Then, the images were labeled by using the LabelImg Python-based image annotation tool [10]. It was preinstalled on a Windows system, and a folder named JPEGImages was created to store the labeled images. Then, a folder named Annotations was created to store the annotated files. Finally, all the images were imported at once, and after the LabelImg tool was applied, the labeled images were used to generate corresponding XML files. The images before labeling are shown in Figs. 1-1 and 1-3, and the images after labeling are shown in Figs. 1-2 and 1-4. After all the images were labeled, all the labeled XML files and the corresponding images were uniformly named and stored in the JPEGImages and Annotations folders.

Fig. 1. Comparison of student classroom behavior video images before and after labeling.
../../Resources/ieie/IEIESPC.2023.12.5.398/fig1.png
Table 1. Criteria for Determining Classroom Behavior.

Behavior

Judgment criteria

Raising the head to listen to the lecture

Looking up at the teacher, blackboard, or PowerPoint presentation, and taking notes

Standing up to answer questions

Standing in front of the chair

Sleeping on the table

Bending over the table

Playing with a mobile phone

Looking down at a phone and holding it in the hand

Turning the head to chat

Turning the head and talking to another student

3.2 Experimental Steps and Parameter Settings

The steps of the experiment are as follows. First, the data were collected and processed, the data images were labeled using the LabelImg tool before the experiment, and the training set in the processed dataset was input to the model for training. Secondly, after continuously training, the model improved, according to the results, so the test set was input for experiment. Third, different algorithms were used to identify students' classroom behaviors with different numbers of people in the classroom, considering there are large and small classes in universities. Fourth, considering that students' classroom behaviors are different, and each behavior has a different posture, the recognition results of the models for the five different classroom behaviors were evaluated by using three different algorithms. Finally, at the end of the experiment, the results of the three algorithms were evaluated under different intersection over union (IoU) thresholds. The three algorithms were YOLO v5s, a single shot multibox detector (SSD), and a region-based convolutional neural network (R-CNN). The purpose of this paper is to demonstrate that the YOLO v5s object detection model is feasible for recognizing student behavior in the classroom.

In order to ensure true and effective recognition results from the YOLO v5s model, the training parameters were kept consistent during the ablation experiment using the proposed method. The parameter settings are presented below, and the stochastic gradient descent (SGD) algorithm was chosen for network optimization. The initial learning rate was 0.001. The batch size was 8. The number of epochs was 100. The damping index was set to 0.5. The length-width ratio of the anchor box was 1:2. The binary cross-entropy loss function was used for classification, and the bounder loss function [11] was used:

(1)
$L_{CIoU}=1-\left(IoU-\frac{D_{2}^{2}}{D_{c}^{2}}-\frac{V^{2}}{\left(1-IoU\right)+V}\right)$,

where $L_{CIoU}$ represents the loss value of border prediction, $IoU$ represents the overlap between predicted and true boxes, $D^{2}$ represents the center distance between predicted and true boxes, $D_{c}$ represents the diagonal distance between predicted and true boxes, and V is a parameter that measures consistency in the length-width ratio.

3.3 Model Evaluation Indicators

Precision (P), recall (R), average precision (AP), and mean average precision (mAP) [12] were used as evaluation indicators to assess the recognition results of the model. Considering the practical application requirements of the model, real-time detection and recognition of students' classroom behaviors were required. Therefore, the detection speed (in frames per second) was also evaluated. The precision and recall rate expressions are:

(2)
$\mathrm{P}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$,

and

(3)
$\mathrm{R}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$,

where TP indicates a positive example recognized as positive, FP indicates a negative example recognized as positive, and FN indicates a positive example recognized as negative.

In addition to the above two indicators, there were AP and mAP, where AP measures the detection performance of a specific class, while mAP measures the detection performance of the model for all classes. The calculation of AP is explained as follows. When recognizing students' classroom behaviors, the precision and recall rates of each behavioral class can be calculated, and a precision/recall curve can be obtained for each class of behavior. The area under the curve is the AP value, whereas mAP is the mean for all the AP values of all categories. The expressions for AP and mAP are:

(4)
$\mathrm{AP}=\int _{0}^{1}\mathrm{P}\left(\mathrm{R}\right)\mathrm{dR}$,

and

(5)
$\mathrm{mAP}=\frac{\sum _{\mathrm{i}=1}^{\mathrm{C}}\mathrm{AP}_{\mathrm{i}}}{\mathrm{C}}$,

where P and R stand for precision and recall, respectively, and C is the number of categories.

3.4 Result Analysis

Considering the different class sizes in universities, and in order to better apply the recognition model in the future, the experiment covered two situations: a moderate class size and a large class size. For the moderate class size, one class was selected for video recording, while for the large class size, two to three classes were selected. Identifying classroom behavior was performed under different recognition models and for different class sizes. From the results presented in Table 2, we can see that under the different classroom densities, the precision of the YOLO v5s model for medium and large classes was 94.37% and 94.29%, respectively. Recall was 95.71% and 94.29%, respectively, and mAP was 96.02 and 95.48, respectively, suggesting that detection and recognition effects were similar under different class sizes. At the same time, there was not much difference in detection speed for medium and large class sizes (118.25 fps and 117.65 fps, respectively, a difference of only 0.6 s). Compared with the SSD and R-CNN models, the detection speed of YOLO~v5s was much higher, indicating it was more suitable for real-time detection of students' classroom behavior. Although the evaluation index data of the YOLO~v5s recognition model in the large class situation was slightly lower than for the moderate class, the overall difference was not significant, indicating the YOLO v5s recognition model can be applied to classroom behavior recognition under different class sizes.

Each student's behavior in the classroom varies, and the body posture varies. Therefore, it is crucial for the model to accurately classify and recognize each behavior. This article evaluated AP for recognition results from three different algorithms. As shown in Table 3, the AP of the YOLO v5s model for the five behaviors were 97.4 for raising the head and listening, 93.5 for standing up to answer questions, 93.6 for sleeping, 95.9 for looking down and playing with a phone, and 98.6 for turning the head to chat. These values were all higher than the AP under the SSD and R-CNN models for these behaviors. Moreover, by analyzing the overall AP of the three different models, it was found that the AP for standing up to answer questions, raising the head and listening, and turning the head to chat were higher than the AP for sleeping and looking down and playing with a phone. This implies that some students have similar upper body postures for sleeping on the table and looking down and playing with a phone, leading to detection errors in these two categories.

mAP is the average accuracy of the model under different IoU thresholds. A higher mAP means a more accurate model. Therefore, IoU is a crucial function for calculating mAP. This paper evaluated the recognition results of three different algorithms at different thresholds [15]. In Table 4, mAP@0.5 and mAP@0.75 represent AP for all images in each class at an IoU set at 0.5 and 0.75, respectively; mAP@0.5:0.95 represents the mAP at 0.5-0.95 IoU thresholds (with increments of 0.05). A higher mAP indicates better detections by the model. Based on the data in Table 4, we can see that YOLO v5s had a higher mAP than SSD and R-CNN models under the different IoU thresholds, reaching 95.8, 94.3, and 92.9, respectively. This indicates the YOLO v5s performance is excellent in the field of object detection, and it can be used for real-time recognition of classroom behavior in college students, achieving the expected experimental results.

Table 2. Recognition Results of Three Algorithms for Different Class Sizes.

Category

Model

Precision

Recall rate

mAP

Detection speed

Medium class size

YOLO v5s

94.37%

95.71%

96.02

118.34

SSD [13]

87.88%

82.86%

88.66

92.14

R-CNN [14]

80.65%

71.43%

80.27

91.93

Large class size

YOLO v5s

94.29%

94.29%

95.48

117.65

SSD

84.36%

77.14%

84.91

89.37

R-CNN

76.19%

68.57%

79.35

87.64

Table 3. Evaluation of Identification Results under Different Classroom Behaviors.

Model

AP

mAP

Raise the head and listen

Sleep on the table

Look down and play with a phone

Turn the head to chat

Stand up to answer a question

YOLO v5s

97.4

93.5

93.6

95.9

98.6

95.8

SSD

93.3

81.1

82.9

91.2

93.5

88.4

R-CNN

82.2

74.8

75.1

80.5

82.9

79.1

Table 4. Evaluation of Recognition Results under Different IoU Thresholds.

Model

mAP@0.5

mAP@0.75

mAP@0.5:0.95

YOLO v5s

95.8

94.3

92.9

SSD

88.4

85.9

80.2

R-CNN

79.1

78.5

76.3

4. Discussion

Real-time recognition of students' classroom behavior using deep learning techniques can evaluate classroom situations and help improve the quality of teaching. In this study, the YOLO v5s recognition model was used to detect and recognize students' behavior in the classroom. Teachers can use this information to evaluate students' in their regular classes. The experiment results of this study showed that under different classroom densities and IoU thresholds, the YOLO v5s model was superior to SSD and R-CNN models in terms of precision, recall, AP, mAP, and detection speed. The results revealed that YOLO v5s can be applied to real-time classroom behavior recognition under different classroom densities. After developing the model for identification of classroom behaviors, different types of behavior need to be managed, and some bad behavior is often due to the difficulty in effectively engaging students with the content, and failing establish a genuine relationship with them [16]. Some studies have suggested that providing students with social rewards, such as praise, encouragement, and care, to promote good classroom behavior is the most accepted management approach [17]. This paper argues that the key to implementing student behavior management in the classroom is teacher behavior [18], and that both classroom management and student interactions should be involved, such as strengthening the establishment of attendance systems and frequently asking students to answer questions. In the future, research directions will focus on classroom management, such as building a classroom management system based on the YOLO v5s recognition model. According to the teaching needs of colleges and universities, the system is divided into three ports: student, teacher, and administrator [19].

(1) Students can view their attendance and video recognition results by entering their student ID and password [20].

(2) Teachers can identify and view behavior detection results and attendance records through the classroom video. They can send the results and attendance records to students, and have permission to modify the data. For example, if a student attends class for only a short time, but the attendance record was miscalculated, the teacher can change it.

(3) The operation rights of the administrator include and exceed those of students and teachers. The administrator can organize courses and process class videos, such as saving them in different locations according to different courses and semesters, and can delete videos from the previous semester.

5. Conclusion

This article provides a brief introduction to student classroom behavior and the YOLO object detection algorithm. Prior to the experiment, video data were converted to images and labeled using the Python-based LabelImg tool. Then, the YOLO v5s model was used to build an object detection and recognition model to identify and analyze student classroom behavior. The performance of the model was evaluated based on precision, recall, AP, mAP, and detection speed. The experiment results showed that under medium and large classroom densities, respectively, the YOLO v5s model achieved precision of 94.37% and 94.29%, recall rates of 95.71% and 94.29%, and AP of 96.02 and 95.48, with respective detection speeds of 118.25 fps and 117.65 fps. The recognition results were consistent at both classroom densities. The mAP values from YOLO v5s at different IoU thresholds were higher than those of the SSD and R-CNN models, reaching 95.8, 94.3, and 92.9. This paper demonstrates that YOLO v5s is an excellent model in the field of object detection, and it can be effectively applied to real-time recognition of college students' behavior in the classroom.

REFERENCES

1 
B. Yang, Z. Yao, H. Lu, Y. Zhou, and J. Xu, ``In-classroom learning analytics based on student behavior, topic and teaching characteristic mining - ScienceDirect,'' Pattern Recognition Letters, Vol. 129, No. Jan., pp. 224-231, Jan. 2020.DOI
2 
S. Wu, ``Simulation of classroom student behavior recognition based on PSO-kNN algorithm and emotional image processing,'' Journal of Intelligent and Fuzzy Systems, Vol. 40, No. 4, pp. 1-11, Dec. 2020.DOI
3 
Y. Xie, S. Zhang, and Y. Liu, ``Abnormal Behavior Recognition in Classroom Pose Estimation of College Students Based on Spatiotemporal Representation Learning,'' Traitement du Signal: signal image parole, Vol. 38, No. 1, pp. 89-95, Feb. 2021.DOI
4 
F. Lin, H. Ngo, C. Dow, K. H. Lam, and H. L. Le, ``Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection,'' Sensors (Basel, Switzerland), Vol. 21, No. 16, pp. 1-20, Aug. 2021.DOI
5 
C. Pang, ``Simulation of student classroom behavior recognition based on cluster analysis and random forest algorithm,'' Journal of Intelligent and Fuzzy Systems, Vol. 40, No. 2, pp. 2421-2431, Feb. 2021.DOI
6 
L. Mao, ``Remote classroom action recognition based on improved neural network and face recognition,'' Journal of Intelligent and Fuzzy Systems, Vol. 2021, No. 1, pp. 1-11, March. 2021.DOI
7 
C. Ma, and P. Yang, ``Research on Classroom Teaching Behavior Analysis and Evaluation System Based on Deep Learning Face Recognition Technology,'' Journal of Physics: Conference Series, Vol. 1992, No. 3, pp. 1-7, Aug. 2021.DOI
8 
X. Feng, Y. Piao, and S. Sun, ``Vehicle tracking algorithm based on deep learning,'' Journal of Physics: Conference Series, Vol. 1920, No. 1, pp. 1-7, May. 2021.DOI
9 
Z. Ying, Z. Lin, Z. Wu, K. Liang, and X. Hu, ``A modified-YOLOv5s model for detection of wire braided hose defects,'' Measurement, Vol. 190, pp. 110683.1-110683.11, Jan. 2022.DOI
10 
S. Tabassum, S. Ullah, N. H. Al-Nur, and S. Shatabda, ``Poribohon-BD: Bangladeshi local vehicle image dataset with annotation for classification,'' Data in Brief, Vol. 33, No. 1, pp. 1-6, Dec. 2020.DOI
11 
S. Wu, and X. Li, ``IoU-Balanced loss functions for single-stage object detection,'' Pattern Recognition Letters, Vol. 156, No. Apr., pp. 96-103, Jan. 2022.DOI
12 
S. Li, Y. Li, Y. Li, M. Li, and X. Xu, ``YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection,'' IEEE Access, Vol. 9, pp. 141861-141875, Oct. 2021.DOI
13 
R. Ranjan, A. Bansal, J. Zheng, H. Xu, J. Gleason, B. Lu, A. Nanduri, J. C. Chen, C. D. Castillo, and R. Chellappa, ``A Fast and Accurate System for Face Detection, Identification, and Verification,'' IEEE Transactions on Biometrics Behavior & Identity Science, Vol. 1, No. 2, pp. 82-96, April. 2019.DOI
14 
U. H. Gawande, K. O. Hajari, and Y. G. Golhar, ``Scale Invariant Mask R-CNN for Pedestrian Detection,'' Electronic Letters on Computer Vision and Image Analysis, Vol. 19, No. 3, pp. 98-117, Nov. 2020.DOI
15 
D. Sun, Y. Yang, M. Li, J. Yang, B. Meng, R. Bai, L. Li, and J. Ren, ``A Scale Balanced Loss for Bounding Box Regression,'' IEEE Access, Vol. 8, pp. 108438-108448, June. 2020.DOI
16 
W. C. Hunter, A. D., Jasper, K. Barnes, L. L. Davis, K. Davis, J. D. Singleton, S. Barton-Arwood, and T. M. Scott, ``Promoting positive teacher-student relationships through creating a plan for Classroom Management On-boarding,'' Multicultural Learning and Teaching, Vol. 18, No. 1, Feb. 2021.DOI
17 
J. D. McLennan, H. Sampasa-Kanyinga, K. Georgiades, and E. Duku, ``Variation in Teachers' Reported Use of Classroom Management and Behavioral Health Strategies by Grade Level,'' School Mental Health, Vol. 12, No. 1, pp. 67-76, March. 2020.DOI
18 
A. Al-Bahrani, ``Classroom management and student interaction interventions: Fostering diversity, inclusion, and belonging in the undergraduate economics classroom,'' The Journal of Economic Education, Vol. 53, No. 3, pp. 259-272, May. 2022.DOI
19 
J. Zhang, ``Computer Assisted Instruction System Under Artificial Intelligence Technology,'' Pediatric Obesity, Vol. 16, No. 5, pp. 1-13, March. 2021.DOI
20 
N. P. Putra, S. Loppies, and R. Zubaedah, ``Prototype of College Student Attendance Using Radio Frequency Identification (RFID) at Musamus University,'' IOP Conference Series: Materials Science and Engineering, 2021, Vol. 1125, No. 1, pp. 1-8, May. 2021.DOI

Author

Ms. Xing Su
../../Resources/ieie/IEIESPC.2023.12.5.398/au1.png

Ms. Xing Su is currently a lecturer at Sias University in Zhengzhou. She graduated from Fort Hays State University in the United States with a master degree. Her research interests include economic management and educational management

Wei Wang
../../Resources/ieie/IEIESPC.2023.12.5.398/au2.png

Wei Wang is a lecturer at Sias University in Zhengzhou, China. He graduated from Fort Hays State University in the United States with a master’s degree, and from Peking University HSBC Business School with an EMBA. He is working on his PhD at the University of Kuala Lumpur, Malaysia. His research interests include resource and environmental economics and industrial management. He has published one paper and one book.