SuXing1*
WangWei1
-
(Digital Technology School, Sias University, Zhengzhou, Henan 451150, China)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Classroom behavior, Deep learning, YOLO v5s
1. Introduction
Student behavior determines, to a certain extent, the amount of knowledge students
acquire in the classroom. Therefore, automatic analysis of classroom behavior is of
great importance in evaluating teaching effectiveness [1]. Observing students' classroom behavior relies on teachers' observations, which is
very time-consuming and may distract from teaching efforts. With the rapid development
of deep learning technology in fields such as image recognition and object detection,
some scholars have begun to study how to use deep learning to automatically identify
student behavior in the classroom. Among the related research on behavior identification,
Wu constructed a classroom behavior recognition model [2]. Combining the particle swarm optimization\textemdash{}k-nearest neighbors algorithm
with emotional image processing algorithms proved highly accurate in identifying both
emotion and behavior. Xie et al. proposed a deep learning algorithm based on spatio-temporal
representation learning to evaluate college students' classroom posture [3]. The results revealed that the proposed algorithm had a 5% higher accuracy than the
baseline 3D convolutional neural network (CNN), and it was an effective tool for identifying
abnormal behavior in college classrooms. Lin et al. [4] proposed a system that uses a deep neural network to classify actions and identify
student behavior. The experiment results showed that the proposed system had a 15.15%
higher average accuracy rate and a 12.15% higher average recall rate than the skeleton-based
approach. Pang [5] combined a conventional clustering analysis algorithm and the random forest algorithm
with a human skeleton model to recognize students' classroom behavior. Through experiments,
the effectiveness of recognizing behavior based on human skeleton models was verified.
Mao [6] proposed an intelligent image recognition system for students' classroom behavior.
Simulating many classroom behaviors through a large number of experiments, the system
could help students accurately and quickly identify incorrect classroom behavior and
could make timely reminders. Ma and Yang [7] constructed a system for analyzing and assessing classroom behavior using deep learning
face recognition technology and found that the system could effectively evaluate students'
classroom behavior.
This article explores the recognition and identification of students' classroom behavior
in universities. By collecting data for input into the constructed You Only Look Once
Version 5 Small (YOLO v5s) model for experimentation, the behavior of students in
the classroom was identified and analyzed. The results are compared with other object
detection models to prove the effectiveness and feasibility of the student classroom
behavior recognition model proposed in this article.
2. Algorithms for Behavior Recognition
Currently, most universities have smart classrooms equipped with video surveillance
devices to record students' behavior, which can be recognized based on the content
of the surveillance video. Target detection and recognition algorithms based on deep
learning can be either single-step or two-step. Compared with the two-step recognition
algorithm, training the single-step algorithm is more stable, and the recognition
speed is higher. YOLO [8] is one of the best algorithms in the field of target detection and is a single-step
recognition algorithm based on target regression. YOLO v5 is one of the newer versions,
has high accuracy, and is fast. It comes in v5s, v5m, v5l, and v5x versions based
on networks with different depths and widths. This paper evaluated these four models
based on actual application scenarios, and we ultimately chose YOLO v5s [9] as the model for our target detection. YOLO v5s consists of six modules: Focus, CBL,
CSP, SPP, upsampling, and Concat, as described below.
(1) Focus structure: The original image (640${\times}$640${\times}$3) is imported
and sliced into a feature map (320${\times}$320${\times}$12). Then, after a convolution
operation, it becomes a 320${\times}$320${\times}$32 feature map.
(2) CBL: This module consists of Conv, Bn, and the Leaky\_ReLU activation function.
The feature map is convolved, normalized, and activated sequentially. A convolution
operation with a kernel size of 3 and a step length of 2 is performed to downsample
the feature map, whereas a convolution operation with a kernel size of 1 and a step
length of 1 is used for feature mapping.
(3) Cross Stage Partial (CSP): The CSP1\_X structure is used in the main backbone
network, and the CSP2\_X structure is used in the neck.
(4) Spatial Pyramid Pooling (SPP): The feature map is subjected to a k${\times}$k
maximum pooling operation to increase the reception range of the main data feature.
(5) Upsampling: This module uses the nearest-neighbor method to double the size of
the advanced feature map.
(6) Concat: This module adds advanced features to low-level features to create a new
feature map.
3. Case Analysis
3.1 Data Acquisition and Processing
A dataset was constructed for the experiment in this study and was divided into training
and testing sets at a 7:3 ratio. The experimental data were classroom videos obtained
from the Digital Technology School, Sias University, Zhengzhou, China. After obtaining
consent from teachers and students, cameras installed in the classroom were used to
record videos of the classes. Student behavior was evaluated as shown in Table 1. Since there were very few cases of sleeping or standing up to answer questions in
the initial collection, we deliberately asked students to perform these actions in
subsequent collections to supplement the dataset.
After data acquisition and before the experiment began, the collected data were processed.
First, five classroom behaviors (raising the head to listen, standing up to answer
questions, sleeping, playing with a mobile phone, and turning to chat) were collected
from the video data. Each type of behavior was limited to approximately 20 minutes
of video. Video frames were then extracted at equal intervals. Finally, a dataset
of approximately 4200 images was obtained. Then, the images were labeled by using
the LabelImg Python-based image annotation tool [10]. It was preinstalled on a Windows system, and a folder named JPEGImages was created
to store the labeled images. Then, a folder named Annotations was created to store
the annotated files. Finally, all the images were imported at once, and after the
LabelImg tool was applied, the labeled images were used to generate corresponding
XML files. The images before labeling are shown in Figs. 1-1 and 1-3, and the images after labeling are shown in Figs. 1-2 and 1-4. After all the images were labeled, all the labeled XML files and the corresponding
images were uniformly named and stored in the JPEGImages and Annotations folders.
Fig. 1. Comparison of student classroom behavior video images before and after labeling.
Table 1. Criteria for Determining Classroom Behavior.
Behavior
|
Judgment criteria
|
Raising the head to listen to the lecture
|
Looking up at the teacher, blackboard, or PowerPoint presentation, and taking notes
|
Standing up to answer questions
|
Standing in front of the chair
|
Sleeping on the table
|
Bending over the table
|
Playing with a mobile phone
|
Looking down at a phone and holding it in the hand
|
Turning the head to chat
|
Turning the head and talking to another student
|
3.2 Experimental Steps and Parameter Settings
The steps of the experiment are as follows. First, the data were collected and processed,
the data images were labeled using the LabelImg tool before the experiment, and the
training set in the processed dataset was input to the model for training. Secondly,
after continuously training, the model improved, according to the results, so the
test set was input for experiment. Third, different algorithms were used to identify
students' classroom behaviors with different numbers of people in the classroom, considering
there are large and small classes in universities. Fourth, considering that students'
classroom behaviors are different, and each behavior has a different posture, the
recognition results of the models for the five different classroom behaviors were
evaluated by using three different algorithms. Finally, at the end of the experiment,
the results of the three algorithms were evaluated under different intersection over
union (IoU) thresholds. The three algorithms were YOLO v5s, a single shot multibox
detector (SSD), and a region-based convolutional neural network (R-CNN). The purpose
of this paper is to demonstrate that the YOLO v5s object detection model is feasible
for recognizing student behavior in the classroom.
In order to ensure true and effective recognition results from the YOLO v5s model,
the training parameters were kept consistent during the ablation experiment using
the proposed method. The parameter settings are presented below, and the stochastic
gradient descent (SGD) algorithm was chosen for network optimization. The initial
learning rate was 0.001. The batch size was 8. The number of epochs was 100. The damping
index was set to 0.5. The length-width ratio of the anchor box was 1:2. The binary
cross-entropy loss function was used for classification, and the bounder loss function
[11] was used:
where $L_{CIoU}$ represents the loss value of border prediction, $IoU$ represents
the overlap between predicted and true boxes, $D^{2}$ represents the center distance
between predicted and true boxes, $D_{c}$ represents the diagonal distance between
predicted and true boxes, and V is a parameter that measures consistency in the length-width
ratio.
3.3 Model Evaluation Indicators
Precision (P), recall (R), average precision (AP), and mean average precision (mAP)
[12] were used as evaluation indicators to assess the recognition results of the model.
Considering the practical application requirements of the model, real-time detection
and recognition of students' classroom behaviors were required. Therefore, the detection
speed (in frames per second) was also evaluated. The precision and recall rate expressions
are:
and
where TP indicates a positive example recognized as positive, FP indicates a negative
example recognized as positive, and FN indicates a positive example recognized as
negative.
In addition to the above two indicators, there were AP and mAP, where AP measures
the detection performance of a specific class, while mAP measures the detection performance
of the model for all classes. The calculation of AP is explained as follows. When
recognizing students' classroom behaviors, the precision and recall rates of each
behavioral class can be calculated, and a precision/recall curve can be obtained for
each class of behavior. The area under the curve is the AP value, whereas mAP is the
mean for all the AP values of all categories. The expressions for AP and mAP are:
and
where P and R stand for precision and recall, respectively, and C is the number of
categories.
3.4 Result Analysis
Considering the different class sizes in universities, and in order to better apply
the recognition model in the future, the experiment covered two situations: a moderate
class size and a large class size. For the moderate class size, one class was selected
for video recording, while for the large class size, two to three classes were selected.
Identifying classroom behavior was performed under different recognition models and
for different class sizes. From the results presented in Table 2, we can see that under the different classroom densities, the precision of the YOLO
v5s model for medium and large classes was 94.37% and 94.29%, respectively. Recall
was 95.71% and 94.29%, respectively, and mAP was 96.02 and 95.48, respectively, suggesting
that detection and recognition effects were similar under different class sizes. At
the same time, there was not much difference in detection speed for medium and large
class sizes (118.25 fps and 117.65 fps, respectively, a difference of only 0.6 s).
Compared with the SSD and R-CNN models, the detection speed of YOLO~v5s was much higher,
indicating it was more suitable for real-time detection of students' classroom behavior.
Although the evaluation index data of the YOLO~v5s recognition model in the large
class situation was slightly lower than for the moderate class, the overall difference
was not significant, indicating the YOLO v5s recognition model can be applied to classroom
behavior recognition under different class sizes.
Each student's behavior in the classroom varies, and the body posture varies. Therefore,
it is crucial for the model to accurately classify and recognize each behavior. This
article evaluated AP for recognition results from three different algorithms. As shown
in Table 3, the AP of the YOLO v5s model for the five behaviors were 97.4 for raising the head
and listening, 93.5 for standing up to answer questions, 93.6 for sleeping, 95.9 for
looking down and playing with a phone, and 98.6 for turning the head to chat. These
values were all higher than the AP under the SSD and R-CNN models for these behaviors.
Moreover, by analyzing the overall AP of the three different models, it was found
that the AP for standing up to answer questions, raising the head and listening, and
turning the head to chat were higher than the AP for sleeping and looking down and
playing with a phone. This implies that some students have similar upper body postures
for sleeping on the table and looking down and playing with a phone, leading to detection
errors in these two categories.
mAP is the average accuracy of the model under different IoU thresholds. A higher
mAP means a more accurate model. Therefore, IoU is a crucial function for calculating
mAP. This paper evaluated the recognition results of three different algorithms at
different thresholds [15]. In Table 4, mAP@0.5 and mAP@0.75 represent AP for all images in each class at an IoU set at
0.5 and 0.75, respectively; mAP@0.5:0.95 represents the mAP at 0.5-0.95 IoU thresholds
(with increments of 0.05). A higher mAP indicates better detections by the model.
Based on the data in Table 4, we can see that YOLO v5s had a higher mAP than SSD and R-CNN models under the different
IoU thresholds, reaching 95.8, 94.3, and 92.9, respectively. This indicates the YOLO
v5s performance is excellent in the field of object detection, and it can be used
for real-time recognition of classroom behavior in college students, achieving the
expected experimental results.
Table 2. Recognition Results of Three Algorithms for Different Class Sizes.
Category
|
Model
|
Precision
|
Recall rate
|
mAP
|
Detection speed
|
Medium class size
|
YOLO v5s
|
94.37%
|
95.71%
|
96.02
|
118.34
|
SSD [13]
|
87.88%
|
82.86%
|
88.66
|
92.14
|
R-CNN [14]
|
80.65%
|
71.43%
|
80.27
|
91.93
|
Large class size
|
YOLO v5s
|
94.29%
|
94.29%
|
95.48
|
117.65
|
SSD
|
84.36%
|
77.14%
|
84.91
|
89.37
|
R-CNN
|
76.19%
|
68.57%
|
79.35
|
87.64
|
Table 3. Evaluation of Identification Results under Different Classroom Behaviors.
Model
|
AP
|
mAP
|
Raise the head and listen
|
Sleep on the table
|
Look down and play with a phone
|
Turn the head to chat
|
Stand up to answer a question
|
YOLO v5s
|
97.4
|
93.5
|
93.6
|
95.9
|
98.6
|
95.8
|
SSD
|
93.3
|
81.1
|
82.9
|
91.2
|
93.5
|
88.4
|
R-CNN
|
82.2
|
74.8
|
75.1
|
80.5
|
82.9
|
79.1
|
Table 4. Evaluation of Recognition Results under Different IoU Thresholds.
Model
|
mAP@0.5
|
mAP@0.75
|
mAP@0.5:0.95
|
YOLO v5s
|
95.8
|
94.3
|
92.9
|
SSD
|
88.4
|
85.9
|
80.2
|
R-CNN
|
79.1
|
78.5
|
76.3
|
4. Discussion
Real-time recognition of students' classroom behavior using deep learning techniques
can evaluate classroom situations and help improve the quality of teaching. In this
study, the YOLO v5s recognition model was used to detect and recognize students' behavior
in the classroom. Teachers can use this information to evaluate students' in their
regular classes. The experiment results of this study showed that under different
classroom densities and IoU thresholds, the YOLO v5s model was superior to SSD and
R-CNN models in terms of precision, recall, AP, mAP, and detection speed. The results
revealed that YOLO v5s can be applied to real-time classroom behavior recognition
under different classroom densities. After developing the model for identification
of classroom behaviors, different types of behavior need to be managed, and some bad
behavior is often due to the difficulty in effectively engaging students with the
content, and failing establish a genuine relationship with them [16]. Some studies have suggested that providing students with social rewards, such as
praise, encouragement, and care, to promote good classroom behavior is the most accepted
management approach [17]. This paper argues that the key to implementing student behavior management in the
classroom is teacher behavior [18], and that both classroom management and student interactions should be involved,
such as strengthening the establishment of attendance systems and frequently asking
students to answer questions. In the future, research directions will focus on classroom
management, such as building a classroom management system based on the YOLO v5s recognition
model. According to the teaching needs of colleges and universities, the system is
divided into three ports: student, teacher, and administrator [19].
(1) Students can view their attendance and video recognition results by entering their
student ID and password [20].
(2) Teachers can identify and view behavior detection results and attendance records
through the classroom video. They can send the results and attendance records to students,
and have permission to modify the data. For example, if a student attends class for
only a short time, but the attendance record was miscalculated, the teacher can change
it.
(3) The operation rights of the administrator include and exceed those of students
and teachers. The administrator can organize courses and process class videos, such
as saving them in different locations according to different courses and semesters,
and can delete videos from the previous semester.
5. Conclusion
This article provides a brief introduction to student classroom behavior and the YOLO
object detection algorithm. Prior to the experiment, video data were converted to
images and labeled using the Python-based LabelImg tool. Then, the YOLO v5s model
was used to build an object detection and recognition model to identify and analyze
student classroom behavior. The performance of the model was evaluated based on precision,
recall, AP, mAP, and detection speed. The experiment results showed that under medium
and large classroom densities, respectively, the YOLO v5s model achieved precision
of 94.37% and 94.29%, recall rates of 95.71% and 94.29%, and AP of 96.02 and 95.48,
with respective detection speeds of 118.25 fps and 117.65 fps. The recognition results
were consistent at both classroom densities. The mAP values from YOLO v5s at different
IoU thresholds were higher than those of the SSD and R-CNN models, reaching 95.8,
94.3, and 92.9. This paper demonstrates that YOLO v5s is an excellent model in the
field of object detection, and it can be effectively applied to real-time recognition
of college students' behavior in the classroom.
REFERENCES
B. Yang, Z. Yao, H. Lu, Y. Zhou, and J. Xu, ``In-classroom learning analytics based
on student behavior, topic and teaching characteristic mining - ScienceDirect,'' Pattern
Recognition Letters, Vol. 129, No. Jan., pp. 224-231, Jan. 2020.
S. Wu, ``Simulation of classroom student behavior recognition based on PSO-kNN algorithm
and emotional image processing,'' Journal of Intelligent and Fuzzy Systems, Vol. 40,
No. 4, pp. 1-11, Dec. 2020.
Y. Xie, S. Zhang, and Y. Liu, ``Abnormal Behavior Recognition in Classroom Pose Estimation
of College Students Based on Spatiotemporal Representation Learning,'' Traitement
du Signal: signal image parole, Vol. 38, No. 1, pp. 89-95, Feb. 2021.
F. Lin, H. Ngo, C. Dow, K. H. Lam, and H. L. Le, ``Student Behavior Recognition System
for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection,''
Sensors (Basel, Switzerland), Vol. 21, No. 16, pp. 1-20, Aug. 2021.
C. Pang, ``Simulation of student classroom behavior recognition based on cluster analysis
and random forest algorithm,'' Journal of Intelligent and Fuzzy Systems, Vol. 40,
No. 2, pp. 2421-2431, Feb. 2021.
L. Mao, ``Remote classroom action recognition based on improved neural network and
face recognition,'' Journal of Intelligent and Fuzzy Systems, Vol. 2021, No. 1, pp.
1-11, March. 2021.
C. Ma, and P. Yang, ``Research on Classroom Teaching Behavior Analysis and Evaluation
System Based on Deep Learning Face Recognition Technology,'' Journal of Physics: Conference
Series, Vol. 1992, No. 3, pp. 1-7, Aug. 2021.
X. Feng, Y. Piao, and S. Sun, ``Vehicle tracking algorithm based on deep learning,''
Journal of Physics: Conference Series, Vol. 1920, No. 1, pp. 1-7, May. 2021.
Z. Ying, Z. Lin, Z. Wu, K. Liang, and X. Hu, ``A modified-YOLOv5s model for detection
of wire braided hose defects,'' Measurement, Vol. 190, pp. 110683.1-110683.11, Jan.
2022.
S. Tabassum, S. Ullah, N. H. Al-Nur, and S. Shatabda, ``Poribohon-BD: Bangladeshi
local vehicle image dataset with annotation for classification,'' Data in Brief, Vol.
33, No. 1, pp. 1-6, Dec. 2020.
S. Wu, and X. Li, ``IoU-Balanced loss functions for single-stage object detection,''
Pattern Recognition Letters, Vol. 156, No. Apr., pp. 96-103, Jan. 2022.
S. Li, Y. Li, Y. Li, M. Li, and X. Xu, ``YOLO-FIRI: Improved YOLOv5 for Infrared Image
Object Detection,'' IEEE Access, Vol. 9, pp. 141861-141875, Oct. 2021.
R. Ranjan, A. Bansal, J. Zheng, H. Xu, J. Gleason, B. Lu, A. Nanduri, J. C. Chen,
C. D. Castillo, and R. Chellappa, ``A Fast and Accurate System for Face Detection,
Identification, and Verification,'' IEEE Transactions on Biometrics Behavior & Identity
Science, Vol. 1, No. 2, pp. 82-96, April. 2019.
U. H. Gawande, K. O. Hajari, and Y. G. Golhar, ``Scale Invariant Mask R-CNN for Pedestrian
Detection,'' Electronic Letters on Computer Vision and Image Analysis, Vol. 19, No.
3, pp. 98-117, Nov. 2020.
D. Sun, Y. Yang, M. Li, J. Yang, B. Meng, R. Bai, L. Li, and J. Ren, ``A Scale Balanced
Loss for Bounding Box Regression,'' IEEE Access, Vol. 8, pp. 108438-108448, June.
2020.
W. C. Hunter, A. D., Jasper, K. Barnes, L. L. Davis, K. Davis, J. D. Singleton, S.
Barton-Arwood, and T. M. Scott, ``Promoting positive teacher-student relationships
through creating a plan for Classroom Management On-boarding,'' Multicultural Learning
and Teaching, Vol. 18, No. 1, Feb. 2021.
J. D. McLennan, H. Sampasa-Kanyinga, K. Georgiades, and E. Duku, ``Variation in Teachers'
Reported Use of Classroom Management and Behavioral Health Strategies by Grade Level,''
School Mental Health, Vol. 12, No. 1, pp. 67-76, March. 2020.
A. Al-Bahrani, ``Classroom management and student interaction interventions: Fostering
diversity, inclusion, and belonging in the undergraduate economics classroom,'' The
Journal of Economic Education, Vol. 53, No. 3, pp. 259-272, May. 2022.
J. Zhang, ``Computer Assisted Instruction System Under Artificial Intelligence Technology,''
Pediatric Obesity, Vol. 16, No. 5, pp. 1-13, March. 2021.
N. P. Putra, S. Loppies, and R. Zubaedah, ``Prototype of College Student Attendance
Using Radio Frequency Identification (RFID) at Musamus University,'' IOP Conference
Series: Materials Science and Engineering, 2021, Vol. 1125, No. 1, pp. 1-8, May. 2021.
Author
Ms. Xing Su is currently a lecturer at Sias University in Zhengzhou. She graduated
from Fort Hays State University in the United States with a master degree. Her research
interests include economic management and educational management
Wei Wang is a lecturer at Sias University in Zhengzhou, China. He graduated from
Fort Hays State University in the United States with a master’s degree, and from Peking
University HSBC Business School with an EMBA. He is working on his PhD at the University
of Kuala Lumpur, Malaysia. His research interests include resource and environmental
economics and industrial management. He has published one paper and one book.