ZhiGuiheng
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Deep learning, Convolutional neural network, Movement recognition
1. Introduction
Recognition of movements in videos has been one of the problems to be solved in the
field of computer vision [1]. The computer’s ability to recognize human movements has a significant impact and
practical significance for people’s lives. For example, movement can be used to command
a computer to perform certain tasks, thus drawing attention to recognition of human
body movements. With the continuous development of science and technology, deep learning
is gaining more and more attention, ultimately still falling under machine learning.
The technology for human action recognition based on deep learning is also progressing,
with methods such as the abstract graph convolutional network (GCN) [2], the deep recurrent neural network (RNN) [3], and the convolutional neural network (CNN) [4]. Most existing approaches to recognizing human actions use deep learning techniques.
Previously, traditional computer vision methods that manually extract features were
commonly applied to recognizing human movements; however, changes in the environments
where movements occur can affect recognition results. Nowadays, human movement recognition
is mainly used in sports training [5], human-computer interaction [6], intelligent monitoring [7], elderly care, etc. Yu et al. tested whether a CNN can recognize human behaviors
in videos and found that the CNN could determine whether people are in a critical
state by recognizing human behaviors in order to give a timely warning, thus playing
a big role in first aid before entering hospital [8]. Xu and Qiu automatically extracted activity features related to human life using
a CNN algorithm and found it could recognize six human activities, such as sitting,
standing, walking, jogging, and going up stairs [9]. Liu et al. found that the accuracy of a 3D CNN in recognizing movements from the
EgoGesture dataset was 72.4%, surpassing current dynamic gesture recognition methods
and confirming its effectiveness [10]. Wang and Zhang proposed a CNN with multidimensional serial feature extraction modules
for obscured faces, and combined it with a deep learning method to improve the recognition
rate [11]. A review of research by domestic and international scholars found that most techniques
mainly use pictures or videos for human action recognition. Therefore, videos were
chosen because they can show a series of movements that people make when giving commands
to computers in their daily lives. This paper takes ballet movements as examples and
based on research into deep learning uses a CNN method to recognize and analyze those
movements, optimizing training of the CNN by using a particle swarm optimization (PSO)
algorithm. Then, the optimized CNN, traditional CNN, and SVM methods are compared
by using 1,000 ballet movement videos, which provided an effective reference for recognition
and training support.
2. Dance Movement Recognition Methods
2.1 Convolutional Neural Networks
The CNN [12] is one of the most typical and frequently used approaches in deep learning. Its features,
such as rotation, translation, and size scaling, make it highly suitable for processing
image data. Here is a brief introduction to the most important parts of the CNN approach.
The convolutional layer is responsible for extracting features from the input and
is the most important part of the CNN approach. It is executed multiple times with
different convolutional kernels each time due to variations in extracted data. Its
calculation is as follows:
where $x_{m+i,n+j}$ represents the image pixel value of point (m+i, n+j), $w_{ij}$
represents the weight of the convolutional kernel scale at point (i, j), b represents
the size of the bias in the layer, and f is the network activation function.
The pooling layer is mainly responsible for subsampling the dance movement feature
map obtained in the convolutional layer. It can compress a large amount of image data
while maintaining the scale feature. The specific calculation is:
where $x_{m*{S_{1+i}},n*{S_{2}}+j}$ represents the pixel value of point $\left(m*S_{1+i},\,\,\,n*S_{2+j}\right)$,
and $y_{mn}$ represents the output value after the pooling operation.
The fully connected layer connects the extracted feature map and classifies input
images based on the training data using the features. Every neuron in this layer is
connected to the neuron of the last layer. The input of this layer is the vector output
from the last layer, and the output of this layer is the final output of the CNN.
The activation function is a crucial component of a CNN. If the activation function
is missing, the network input and output will be linear. For a nonlinear structure,
the activation function should be added as a nonlinear unit. Commonly used activation
functions include sigmoid, tanh, and ReLU [13]. All values of the ReLU function are compared with 0, and the largest value is selected.
Therefore, when the input value is less than or equal to 0, the output value is also
0. The function only needs to judge whether input value x of the function falls within
a positive interval, resulting in a small computation amount, i.e., its performance
in network convergence speed surpasses that of the aforementioned two functions. This
paper uses ReLU not only because it reduces network fit issues; it also improves network
convergence speed. Its formula is
When training the network, if the model predicts unknown data, error in the test set
for the model will be larger than in the training set, indicating poor prediction
performance. To prevent this kind of network fit phenomenon, this paper adopts the
dropout method [14] to optimize the model. The dropout principle is as follows. In the process of network
training via deep learning, some neurons are removed while the remaining neurons continue
to participate in network training without any changes. The removed neurons are randomly
selected and their previous values are restored in subsequent network training sessions.
This operation reduces the correlation between neurons, and prevents local features
from fitting into the network. To prevent network fit issues, a dropout parameter
of 0.5 was set.
2.2 Improving Movement Recognition
The above CNN method is able to recognize ballet movements, but the traditional CNN
easily falls into the problem of overfitting during the training process. To improve
recognition performance from the CNN, the PSO algorithm replaces the backward adjustment
of weight parameters in the traditional CNN method. The forward calculation of the
improved CNN method in the training process is consistent with the traditional CNN,
and when the error obtained from the forward calculation does not converge within
the set range, the iterative formula of the PSO algorithm is used to iterate the position
and velocity of the particle swarm. The coordinates of each particle in the swarm
represent a parameter scheme, and the iterative formula of the PSO algorithm is
where $v_{i}(t+1)$ and $x_{i}(t+1)$ are the speed and position of particle $i$ after
one iteration, $v_{i}(t)$ and $x_{i}(t)$ are the speed and position of particle $i$
before the iteration, $\varpi $ is the inertia weight of the particle, $c_{1}$ and
$c_{2}$ are learning factors, $r_{1}$ and $r_{2}$ are random numbers between 0 and
1, $P_{i}(t)$ denotes the optimal position experienced by particle~$i$ (excluding
particles exceeding the limit), and $G_{g}(t)$ is the best position experienced by
the particle population after excluding particles that exceed the limit. After the
iteration, the parameter scheme represented by the particle is substituted into the
CNN method, which performs forward calculation again, repeating the above process
until the error converges to within a preset range.
The improved CNN approach autonomously extracts features from input images of real
dance movements, eliminating the need for manual feature input. The specific process
of dance movement recognition using the CNN approach is as follows. (1) A sufficient
number of dance action videos are collected to form an initial dataset. (2) In data
pre-processing after segmentation and denoising, image frames are extracted from the
collected dance action data videos and are modified in terms of size, color, etc.
to meet the standard requirements for image input network models. (3) The network
model is trained, the processed image frames are divided into two parts, one of which
is used as the training set, constantly adjusting parameters for the image data in
that set until optimal performance is achieved. (4) After obtaining the optimal network
model, the remaining image frames are used as input for the test set, and the dance
movements are recognized by the network model.
3. Experiment Analysis
3.1 Image Pre-processing
The pre-processing of images in dance action videos is an important link in the recognition
system because it determines the effect of CNN training and testing. To ensure the
validity of the experiment results, this paper randomly selected 1,000 ballet training
video samples as a dataset. The dance movements in the videos were divided into five
main categories: drawing circles with legs (Fig. 1(1)), small kicks (Fig. 1(2)), two-position mid-jumps (Fig. 1(3)), single-leg squats (Fig. 1(4)), and large squats (Fig. 1(5)). The number of videos for each specific type is shown in Table 1. The reason for choosing ballet movements is that different types of dances have
their own characteristics, and conducting comprehensive training and recognition would
result in an overwhelming workload. Therefore, ballet was chosen as the focus of recognition.
Ballet is a graceful art form originating from the Italian Renaissance which flourished
in France throughout its development and perfection. It combines elements of dance,
music, and drama while highlighting dancers' body posture techniques as well as expressions.
Being a classical dance style, it not only showcases elegance with artistic beauty
but carries on a historical and cultural heritage. At the same time, the captured
dance video images underwent pre-processing that mainly involved noise removal, contrast
adjustment, grayscale transformation, video length reduction, and uniform resizing
to 224 ${\times}$ 224. Additionally, image processing operations such as Gaussian
blur were applied to prevent image blur after cropping. These image pre-processing
operations ensured that the dance videos in the dataset had similar resolutions, durations,
contrast, and sizes for faster training of the CNN model.
Fig. 1. Five types of dance movement.
Table 1. The Dance Video Input.
Movement
|
Number of videos
|
Proportion
|
Drawing circles with legs
|
189
|
18.9%
|
Small kick
|
213
|
21.3%
|
Two-position mid-jump
|
147
|
14.7%
|
Single-leg squat
|
218
|
21.8%
|
Large squat
|
233
|
23.3%
|
3.2 Experiment Design
In order to verify the performance of the optimized CNN method in recognizing ballet
movements, traditional CNN and SVM methods were also tested. The optimized CNN was
obtained by adding PSO to the traditional CNN, so both methods were compared to test
the improvement from PSO on the recognition performance of the CNN. The SVM method
was a traditional machine learning classification algorithm. The recognition of ballet
movements in this paper can also be considered recognition of dance movement types,
and recognition was used to verify the performance of the optimized CNN method compared
to other recognition algorithms [15]. The 1,000 ballet videos in the initial dataset were divided into a training set
and a test set at an 8:2 ratio. The parameters of the traditional CNN are shown in
Table 2. The parameters of the optimized CNN were the same as the traditional CNN. The parameters
of the PSO are as follows: particle swarm size: 20, learning factor: 1.5, and inertia
weight: 0.8. The parameters of the SVM method are as follows. The kernel used a sigmoid
function, and the penalty factor was set to 1.
Table 2. Initial CNN Parameter Settings.
Parameter
|
Value
|
Batch size
|
5
|
Learning rate
|
0.001
|
Optimizer
|
Adam
|
Activation function
|
ReLU
|
Number of Iterations
|
30
|
3.3 Analysis of Results
The SVM method directly computed the support vector hyperplane based on the data in
the training set during the training process, which was different from the step-by-step
iterative process of the traditional and improved CNN methods. Fig. 2 shows the error convergence curves of the traditional and optimized CNN methods during
the training process. We can see that the recognition error in both algorithms decreased
and converged to stability with an increase in the number of iterations. The improved
CNN was the fastest, converging to stability after about five iterations, and the
traditional method converged to stability after about 20 iterations. After convergence
stability, the error of the optimized CNN method was smaller than the traditional
CNN.
Fig. 2. The convergence curves of the traditional and optimized CNN methods.
The impact from the number of consecutive frames on network recognition accuracy was
initially determined. The 1,000 dance videos were converted to images. Six sets of
frames at five-frame intervals (from 5 frames up to 30 frames) were extracted for
experimentation. We see from Fig. 3 that as the number of consecutive frames increased, recognition accuracy also increased
and gradually tended to be stable, and the increase in the accuracy was very little
after the number of frames exceeded 25. It is known that more consecutive frames require
more computations and more time. Therefore, considering the experiment’s length, the
final number of frames was set at 25.
Fig. 3. The influence of the number of consecutive frames from an image on recognition accuracy.
The recognition accuracy and speed of the three algorithms are in Table 3. From the comparison, we can see that the SVM method had the lowest accuracy and
the least efficiency. The traditional CNN method had greater accuracy and efficiency,
and the optimized CNN method had the highest accuracy and efficiency.
Table 3. Results of ballet movement recognition by the different methods.
|
Accuracy (%)
|
Speed (in seconds)
|
SVM method
|
Traditional CNN
|
Optimized CNN
|
SVM
|
Traditional CNN
|
Optimized CNN
|
Drawing circles with legs
|
82.03
|
90.31
|
95.79
|
3.47
|
2.03
|
1.13
|
Small kick
|
85.69
|
91.22
|
96.62
|
3.56
|
2.15
|
1.69
|
Two-position
mid-jump
|
85.27
|
89.45
|
94.27
|
2.98
|
1.87
|
1.02
|
Single-leg squat
|
86.16
|
92.32
|
97.09
|
3.19
|
2.36
|
1.55
|
Large squat
|
81.72
|
89.24
|
94.51
|
3.41
|
2.47
|
1.36
|
Average
|
84.17
|
90.16
|
95.66
|
3.32
|
2.68
|
1.35
|
4. Discussion
As an art form, dance not only cultivates the emotions, but exercises the body. Traditionally,
during dance practice, a coach needs to help correct the dance movements, but coaches
bring their own habits to the teaching process and have limited energy, which may
result in incorrect movements by the dancers and limited teaching efficiency. As they
develop, intelligent algorithms are gradually applied in the field of image recognition.
For this paper, intelligent algorithms were applied to recognize dance movements.
In order to facilitate the research, this paper focuses on ballet, uses the CNN to
recognize movements, and introduces the PSO algorithm to adjust the weight parameters
and improve recognition performance of the CNN method. In the following analysis,
the optimized CNN method was compared with traditional CNN and SVM methods for the
final results shown above. In the comparison of results, the optimized CNN method
had the highest efficiency and recognition accuracy for ballet movements, the traditional
CNN method was second, and the SVM method was last. The SVM method obtained the image
features first when recognizing images of the dance movements. Since the features
were extracted manually, the information contained in them was not comprehensive enough.
Although the SVM method used kernel functions to project the image features into a
high-dimensional space, it was still difficult to effectively fit the hyperplane of
the SVM method to the nonlinear features of the images. Compared with the SVM, the
traditional CNN automatically extracted image features using convolutional kernels,
and combined global convolutional features using more than one convolutional kernel,
which made full use of image feature information. The activation function effectively
fit the nonlinear law, so it provided better recognition accuracy and efficiency.
The optimized CNN with the PSO algorithm to adjust the parameters retained the advantages
of the traditional CNN method. Moreover, PSO was used to iterate the particle swarm
to avoid overfitting during CNN training, so the recognition accuracy and efficiency
were the highest.
5. Conclusion
This paper introduced human movement recognition via CNN approaches. Based on deep
learning, the CNN was used to identify ballet movements. Training for the CNN was
optimized using the PSO algorithm. Then, 1,000 ballet movement videos were used as
the dataset for comparing the optimized CNN, traditional CNN, and SVM methods. Compared
to the traditional CNN, the optimized method converged faster during training with
less error after convergence to stability. The SVM method showed the least efficiency
and lowest recognition accuracy of the ballet movements, whereas the traditional CNN
was higher, and the optimized CNN method was the highest.
This paper used the convolutional kernel in the CNN method to automatically extract
image features to recognize ballet movements. To improve recognition performance from
the CNN algorithm, the PSO algorithm was introduced to adjust the weight parameters
during the training process, which provided an effective reference for intelligent
recognition of ballet movements and assisted exercise of dance movements. The limitation
of this paper is that the optimized CNN method was only used for ballet movement recognition,
so a future research direction is to generalize the CNN algorithm to recognize other
kinds of dance movements.
REFERENCES
A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, ``Action Recognition in
Video Sequences using Deep Bi-Directional LSTM With CNN Features,'' IEEE Access, Vol.
6, No. 99, pp. 1155-1166, 2018.
B. K. Gao, L. Dong, H. B. Bi, and Y. Z. Bi, ``Focus on temporal graph convolutional
networks with unified attention for skeleton-based action recognition,'' Applied Intelligence,
Vol. 52, pp. 5608-5616, 2022.
H. Wang, and L. Wang, ``Learning content and style: Joint action recognition and person
identification from human skeletons,'' Pattern Recognition, Vol. 81, pp. 23-35, Sep.
2018.
Z. Yang, Y. Li, J. Yang, and J. Luo, ``Action Recognition with Spatio-Temporal Visual
Attention on Skeleton Image Sequences,'' IEEE Transactions on Circuits and Systems
for Video Technology, Vol. 29, No. 8, pp. 2405-2415, Aug. 2018.
F. Malawski, and B. Kwolek, ``Recognition of Action Dynamics in Fencing Using Multimodal
Cues,'' Image and Vision Computing, Vol. 75, No. JUL., pp. 1-10, May. 2018.
Y. Xu, J. Cheng, L. Wang, H. Xia, F. Liu, and D. Tao, ``Ensemble One-dimensional Convolution
Neural Networks for Skeleton-based Action Recognition,'' IEEE Signal Processing Letters,
Vol. 25, No. 7, pp. 1044-1048, Jan. 2018.
H. Zhang, M. Xin, S. Wang, Y. Yang, L. Zhang, and H. Wang, ``End-to-end temporal attention
extraction and human action recognition,'' Machine Vision and Applications, Vol. 29,
No. 7, pp. 1127-1142, 2018.
Q. Yu, P. Jiang, Y. Wang, and Z. Wang, ``Research on first aid measures based on convolutional
neural network recognition human actions,'' Chinese Critical Care Medicine, Vol. 32,
No. 11, pp. 1385-1387, Nov. 2020.
Y. Xu, and T. T. Qiu, ``Human Activity Recognition and Embedded Application Based
on Convolutional Neural Network,'' Journal of Artificial Intelligence and Technology,
Vol. 2021, No. 1, pp. 51-60, Dec. 2021.
Y. Liu, D. Jiang, H. Duan, Y. Sun, and G. Li, ``Dynamic Gesture Recognition Algorithm
Based on 3D Convolutional Neural Network,'' Computational Intelligence and Neuroscience,
Vol. 2021, No. 12, pp. 1-12, 2021.
X. Wang, and W. Zhang, ``Anti-occlusion face recognition algorithm based on a deep
convolutional neural network,'' Computers & Electrical Engineering, Vol. 96, pp. 1-12,
2021.
X. Ran, Z. Shan, Y. Shi, and C. Lin, ``Short-Term Travel Time Prediction: A Spatiotemporal
Deep Learning Approach,'' International Journal of Information Technology & Decision
Making (IJITDM), Vol. 18, No. 04, pp. 1087-1111, April. 2019.
J. M. Kudari, A. Jebakumari, and S. Kumar, ``Adlin Jebakumari S and Sushma B S, Image
Classifier Using the Adam Optimizer and the Relu Activation Function,'' International
Journal of Advanced Research in Engineering & Technology, Vol. 12, No. 3, pp. 56-60,
March. 2021.
A. Poernomo, and D. K. Kang, ``Biased Dropout and Crossmap Dropout: Learning towards
effective dropout regularization in convolutional neural network,'' Neural Networks,
Vol. 104, pp. 60-67, April. 2018.
S. Mehrang, J. Pietilä, and I. Korhonen, ``An Activity Recognition Framework Deploying
the Random Forest Classifier and A Single Optical Heart Rate Monitoring and Triaxial
Accelerometer Wrist-Band,'' Sensors, Vol. 18, No. 2, pp. 1-13, Feb. 2018.
Guiheng Zhi, born in October 1983, graduated from Guangxi Arts University in 2007
with a major in choreography and then stayed there to teach. He studied at Xiamen
University in 2012 and obtained a master's degree in engineering in 2014. He is a
lecturer and teaches courses that include basic training of classical dance, national
folk dance, modern dance, and choreography techniques. His professional research directions
are performance and choreography.