Mobile QR Code QR CODE

  1. (Henan University of Animal Husbandry and Economy, Zhengzhou, Henan 450000, China)



Multi-feature fusion, Dance videos, Key frame extraction, Movement recognition

1. Introduction

Under the influence of continuous developments in multimedia technology, video has become an increasingly common form of communication, playing a greater role in everyday life and entertainment [1]. In order to effectively extract information from videos, video processing has become a hot topic for researchers [2]. Action recognition is very important in video analysis and processing [3]; it can provide services for intelligence surveillance [4], human-computer interaction [5], etc.; and it has great practical value. However, videos often contain a large number of frames. For larger and more complex videos, key frame extraction [6] can reduce the computational load of tasks such as action recognition, and can achieve higher efficiency and quality. With the advancement of algorithms such as deep learning, more and more methods have been applied to video processing [7].

Lee et al. [8] designed a three-dimensional convolutional neural network (3D-CNN) with spatiotemporal characteristics to analyze soccer game videos to extract motion information on each object. They found the method had superior speed and accuracy. Singh and Choubey [9] studied the analysis of surveillance videos, introduced the social force model, and achieved behavior recognition through frame pre-processing plus optical flow and social force calculations with a k-nearest neighbor classifier. Chakraborty and Pal [10] defined a motion granule to granulate video sequences, and then combined motion entropy to recognize different motion patterns. They demonstrated the effectiveness of the method through offline and real-time video analysis. Miao and Liu [11] detected motion in videos based on frame differences, extracted histogram of oriented gradient 3D features, and classified human motion recognition using an extreme learning machine (ELM). Currently, movement recognition research focuses more on areas like human-computer interaction and video surveillance, and there is less research on dance videos. The processing and analysis of dance videos are of great help in supporting dance education and organizing dance video resources. Moreover, dance movements are highly complex and repetitive. Therefore, this paper focuses on key frame extraction and action recognition in dance videos, and presents a multi-feature fusion method for processing dance videos.

2. Multi-feature Fusion Method for Movement Recognition

2.1 Key Frame Extraction

There are often a lot of frames in a dance video, and many movements are repeated, which imposes a significant computational burden on dance action recognition. Key frame extraction can be utilized to minimize the computational complexity while enhancing the efficiency of dance movement recognition. Key frame extraction refers to the removal of redundant content in the video and selection of representative image frames that play an essential role in subsequent recognition and classification tasks and that have significant application to video summarization and retrieval [12].

The approach used in this paper is based on K-means [13], which first requires the selection of appropriate features for effective clustering. The following features are used.

(1) Color: The RGB image of a dance video frame is converted to the hue, saturation, value (HSV) color space [14] to extract HSV features. The formulas are:

(1)
$H=\left\{\begin{array}{l} 60\times \frac{G-B}{V-min\left(R,G,B\right)},V=R\\ 120+60\times \frac{B-R}{V-min\left(R,G,B\right)},V=G\\ 240+60\times \frac{R-G}{V-min\left(R,G,B\right)},V=B \end{array}\right.$,
(2)
$S=\left\{\begin{array}{l} 1-\frac{min\left(R,G,B\right)}{V},V\neq 0\\ 0,V=0 \end{array}\right.$,

and

(3)
$V=max\left(R,G,B\right)$.

To further reduce the computational effort, HSV color is then synthesized as a one-dimensional vector based on the formula $L=9H+3S+V$.

(2) Texture: In order to extract texture features from a dance video frame, a local binary pattern (LBP) [15] is employed. To reduce the computational effort, an LBP equivalent pattern is used to achieve dimensionality reduction processing as follows:

(4)
$LBP_{P,R}^{riu2}=\left\{\begin{array}{l} \sum _{i=0}^{P-1}s\left(b_{i}-b_{c}\right)\\ P+1,otherwise \end{array}\right.$,

and

(5)
$s\left(x\right)=\left\{\begin{array}{l} 1,x>0\\ 0,x\leq 0 \end{array}\right.$,

where $P$ is the number of sampling points within a circular neighborhood with radius $R$, $b_{i}$ is the pixel value of the pixel point, $b_{c}$ is the pixel value of the center point, and $s\left(x\right)$ is used to determine the value of $b_{i}-b_{c}$.

Based on the above two features, the difference is used to describe the similarity between dance video frames $\mathrm{I}_{\mathrm{c}}$ and $\mathrm{I}_{\mathrm{c}+1}$, which is expressed as follows:

(6)
$\begin{array}{l} D\left(I_{c},I_{c+1}\right)=a_{1}\cdot D\left(C_{c},C_{c+1}\right)+a_{2}\cdot D\left(V_{c},V_{c+1}\right)\\ =a_{1}\sqrt{\left(C_{c}-C_{c+1}\right)^{2}}+a_{2}\sqrt{\left(V_{c}-V_{c+1}\right)^{2}} \end{array}$,

where $D\left(C_{c},C_{c+1}\right)$ denotes the color feature similarity, $D\left(V_{c},V_{c+1}\right)$ is the similarity in textures, and $a_{1}$ and $a_{2}$ are the weights of the two similarities, $a_{1}=a_{2}=0.5$.

Then, using $D\left(I_{c},I_{c+1}\right)$ as the feature, all the dance video frames are classified with the K-means algorithm. After classification, key frames are extracted from each class. For each class, if the number of frames is greater than 8, then the first, middle, and last frames are used as key frames, and if the number of frames is less than or equal to 8, then only the first and last frames are used as key frames.

2.2 Movement Recognition

The multi-feature fusion method presented in this paper is used for movement recognition. First, the features of key frame images are extracted from both spatial and temporal aspects.

(1) Spatial features

In order to better extract information from video frames, the proposed method uses a 3D-CNN [16] to extract spatial features from the dance videos. The 3D-CNN obtains spatial features of key frames through 3D convolutional kernels, and the formula can be written as

(7)
$f\left(x\right)=wx+b$,

where $f\left(x\right)$ is the eigenvalue after convolution, $x$ is the pixel value matrix, $w$ is the convolution kernel matrix, and $b$ is the bias.

After convolution, nonlinear processing is achieved from the activation function (the ReLU function here), and the formula is

(8)
$f\left(x\right)=\{0,x\leq 0x,\,\,x>0$.

Then, the feature map obtained in the previous step is compressed through a pooling operation, which can expressed as

(9)
$N_{P}=\frac{W_{P}-F_{P}}{S_{P}}+1$,

where $W_{P}$ is the size of the feature map before pooling, $F_{P}$ is the pooling size, $S_{P}$ is the pooling step length, and $N_{P}$ is the size of the feature map after pooling.

(2) Time features

For temporal features in dance videos, the proposed method uses a long short-term memory (LSTM) network [17] for extraction. The LSTM network structure is shown in Fig. 1.

As seen in Fig. 1, LSTM controls the cell state through three gates, where the cell state at the previous moment is $h_{t-1}$, and the input at the current moment is $x_{t}$. The LSTM temporal feature extraction process is shown below.

① The information to be forgotten is calculated through the forgetting gate: $f_{t}=\sigma \left(U_{f}\cdot \left[h_{t-1},x_{t}\right]+c_{f}\right)$.

② The information to be saved is calculated through the input gate: $i_{t}=\sigma \left(U_{i}\cdot \left[h_{t-1},x_{t}\right]+c_{i}\right)$.

③ The cell state is updated with the newly generated information denoted as the candidate information, $\overset{˜}{\mathrm{C}}_{\mathrm{t}}=\tanh \left(\mathrm{U}_{\mathrm{k}}\cdot \left[\mathrm{h}_{\mathrm{t}-1},\mathrm{x}_{\mathrm{t}}\right]+\mathrm{c}_{\mathrm{k}}\right)$, combined to obtain the new cell state: $C_{t}=f_{t}\times C_{t-1}+i_{t}\times \overset{˜}{C}_{t}$.

④ The output value of the current cell state is obtained through the output gate: $o_{t}=\sigma \left(U_{o}\cdot \left[h_{t-1},x_{t}\right]+c_{o}\right)\,,$ $h_{t}=o_{t}\tanh \left(C_{t}\right).$

The parameters in the above equations are presented in Table 1.

Based on the above extracted spatial and temporal features, the process of the multi-feature fusion movement recognition method is shown in Fig. 2.

Fig. 2 shows the steps of this method as follows.

(1) The key frame of a dance video is converted into a five-dimensional tensor (length, width, number of channels, batch, and number of images per frame) that is used as input for the 3D-CNN. After convolution and pooling operations, the spatial feature map is obtained.

(2) The spatial feature map is converted into a 3D tensor (batch, timing step, and the product of pixel matrix length and width) and is the input for the LSTM network to obtain spatial and temporal feature maps of key frames.

(3) The feature map obtained after multi-feature fusion is input for the fully connected layer and the softmax layer. Finally, one-dimensional data are the output for movement recognition in the dance video.

Fig. 1. The LSTM network structure.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig1.png
Fig. 2. The movement recognition method based on multi-feature fusion.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig2.png
Table 1. LSTM Parameters and their Meanings.
../../Resources/ieie/IEIESPC.2023.12.6.495/tb1.png

3. Results and Analysis

3.1 Experimental Setup

The experiments were conducted under the TensorFlow framework [18] on a Linux system. The change of tensor format is achieved with the reshape function, and 3${\times}$3 convolution is used for the convolution kernel of the first two layers of the 3D-CNN, with 1${\times}$1 convolution used for the last two layers. Max-pooling is used, and the Adam optimization algorithm is used for training the network at a learning rate of 0.001.

There were two kinds of experimental data. The first was the DanceDB dataset [19], which contained 48 dance videos involving 12 dance movements. Each movement was labeled with an emotion tag (e.g., scared, angry). The frame rate of the videos was 20 frames per second, and the size of each frame was 480${\times}$360.

Second was a self-built dataset containing 96 dance videos recorded by six students majoring in dance, all in the same context. These videos included three types of dance, each with complex and variable movements. The frame rate of the videos was also 30 FPS, and again, the size of each frame was 480${\times}$360. Some of the frames are in Fig. 3.

All the videos had key frames marked by professional dance teachers, and the following indicators were selected to assess the effects of key frame extraction.

(1) Recall ratio: the number of key frames correctly extracted, divided by the number of key frames correctly extracted plus the number of missed frames.

(2) Precision ratio: the number of key frames correctly extracted, divided by the number of key frames correctly extracted plus the number of frames falsely detected.

(3) Deletion factor: the number of frames falsely detected, divided by the number of key frames correctly extracted.

The effectiveness of movement recognition was evaluated using accuracy: the number of videos correctly recognized divided by the total number of videos.

Fig. 3. Example dance video frames from the self-built dataset.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig3.png

3.2 Analysis of Results

First, the key frame extraction method from this paper was compared with the following two methods:

① The color feature-based approach proposed by Jadhav and Jadhav [20], and

② The scale-invariant feature transform approach proposed by Hannane et al. [21].

The key frame extraction results of the three methods are shown in Table 2.

From Table 2, we see that, first, the recall ratios of the methods proposed by Jadhav and Jadhav and Hannane et al. were below 80% for the DanceDB dataset, while the recall ratio of the proposed multi-feature fusion method was 82.27% (10.82% higher than Jadhav and Jadhav, and 8.81% higher than Hannane et al.). Secondly, the precision ratios of the Jadhav and Jadhav and Hannane et al. methods were below 70%, while the precision ratio from multi-feature fusion was 72.84% (5.97% higher than Jadhav and Jadhav and 4.65% higher than Hannane et al.). Third, the deletion factor of the multi-feature fusion method on the DanceDB dataset was 3.01, which was 1.41 lower than the method by Jadhav and Jadhav, and 0.55 lower than the method by Hannane et al.

The recall and precision ratios of all the methods when processing the self-built dataset improved to some extent, compared to the DanceDB dataset, and the deletion factors were also smaller, which may be due to the relatively small number of dance types in the self-built dataset. In comparison, the recall rate of the multi-feature fusion method was 84.82%, and the precision ratio was 81.07%, both of which are higher than 80% and significantly higher than the other two methods. The deletion factor with multi-feature fusion was 2.25, which was significantly smaller than the other two methods. Based on the results with the two datasets, the multi-feature fusion method had fewer cases of missed and incorrect detections of key frames, as well as better extraction results.

Taking the dance called Searching as an example, the key frames output by all three methods are presented in Fig. 4. More key frames were extracted by the Jadhav and Jadhav and the Hannane et al. methods, but some frames are not clear. Key frames extracted by the multi-feature fusion method gave a complete description of the movement changes in the dance video, a good overview of the video, and the extracted movements are clear. Therefore, the multi-feature fusion method can be used to provide movement recognition services.

Based on the key frame extractions, the results of movement recognition were analyzed to compare the effects of different features and different classifiers on the results of dance video movement recognition. The compared methods are

Method 1: using only spatial features plus the softmax classifier,

Method 2: using only temporal features plus the softmax classifier, and

Method 3: using spatio-temporal features plus the SVM classifier [22].

Comparison of accuracy from the above methods and the multi-feature fusion method is presented in Fig. 5.

According to Fig. 5, movement recognition accuracy was low in all cases when only one feature was used. The multi-feature fusion method exhibited accuracy of 42.67% with the DanceDB dataset, which is 11% higher than Method 1 and 8.71% higher than Method 2. For the self-built dataset, multi-feature fusion had an accuracy of 50.64%, which is 17.26% higher than Method 1 and 14.72% higher than Method 2. This indicates that the extracted features could have some influence on movement recognition when using the same classifier. The comparison shows that using only spatial features or only temporal features led to a decrease in recognition accuracy, while the combination of spatio-temporal features produced better recognition of dance video movements.

Comparing Method 3 and the multi-feature fusion method, the difference in classifiers resulted in differences in accuracy. Recognition accuracy from Method 3 on the DanceDB dataset was 38.56% (4.11% lower than the method proposed in this paper), and with the self-built dataset, recognition accuracy from Method 3 was 40.11% (10.53% lower than the proposed method). This indicates that the softmax classifier provided better recognition of different dance movements than the SVM classifier. The SVM required a lot of computation time for multi-classification recognition, and the selection of the kernel function and parameters depended on manual experience, which is somewhat arbitrary. Therefore, the softmax classifier was more reliable.

A movement identification approach based on trajectory feature fusion was proposed by Megrhi et al. [23]. It was compared with the proposed multi-feature fusion method, and the results are presented in Fig. 6.

Fig. 6 shows that movement recognition accuracy from the method proposed in this paper was significantly higher for the two dance video datasets. The reason the accuracy of both methods was higher with the self-built dataset is that it included more dance movements. For DanceDB, the recognition accuracy of the Megrhi et al. method was 39.52% (3.15% lower than multi-feature fusion), and for the self-built dataset, recognition accuracy of the Megrhi et al. method was 46.62% (4.02% lower than multi-feature fusion). These results demonstrated the multi-feature fusion method is effective in identifying movements from dance videos.

The recognition accuracies of the method proposed by Megrhi et al. and the multi-feature fusion method for different dance movements in the self-built set were further analyzed, and the results are shown in Fig. 7.

Fig. 7 shows that accuracy from the Megrhi et al. method was below 50% for all three dances, among which the lowest accuracy (45.12%) was with Green Silk Gauze Skirt, and the highest accuracy (47.96%) was with Memories of the South. Compared with the Megrhi et al. method, recognition accuracy from multi-feature fusion was 5%, 2.55%, and 4.51% higher for the three dances, which proves the reliability of the multi-feature fusion method in recognizing different types of dance movement.

Table 2. Comparison of Key Frame Extraction Effects.

Dataset

Method

Recall ratio

Precision ratio

Deletion factor

DanceDB

Jadhav and Jadhav

71.45%

66.87%

4.42

Hannane

et al.

73.46%

68.19%

3.56

Multi-feature fusion

82.27%

72.84%

3.01

Self-built dataset

Jadhav and Jadhav

73.06%

71.29%

2.77

Hannane

et al.

75.77%

73.36%

2.41

Multi-feature fusion

84.82%

81.07%

2.25

Fig. 4. Comparison of key frame extraction results.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig4.png
Fig. 5. Comparison of accuracy from dance video movement recognition.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig5.png
Fig. 6. Comparison of recognition accuracy from multi-feature fusion and trajectory feature fusion.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig6.png
Fig. 7. Accuracy comparison with the self-built set.
../../Resources/ieie/IEIESPC.2023.12.6.495/fig7.png

4. Conclusion

This paper presents a multi-feature fusion method for key frame extraction and movement recognition from dance videos. The proposed method was assessed with the DanceDB dataset and a self-built dataset. The results showed that this method could accurately extract key frames from videos and provide high recall and precision ratios. Moreover, the softmax classifier under multi-feature fusion achieved accuracy of 42.67% with DanceDB and 50.64% with the self-built dataset, demonstrating the reliability of the proposed method for practical dance video analysis. In future research, more dance videos need to be used, and the proposed method needs to be tested for movement recognition in various fields to better understand its applicability.

REFERENCES

1 
F. Ikram, and H. Farooq, ``Multimedia Recommendation System for Video Game Based on High-Level Visual Semantic Features,'' Scientific Programming, Vol. 2022, pp. 1-12, 2022.DOI
2 
J. Yang, Y. Jiang, and S. Wang, ``Enhancement or Super-Resolution: Learning-based Adaptive Video Streaming with Client-Side Video Processing,'' in ICC 2022 - IEEE International Conference on Communications, Seoul, Korea, Republic of, pp. 739-744, May. 2022.DOI
3 
M. Majd, and R. Safabakhsh, ``A motion-aware ConvLSTM network for action recognition,'' Applied Intelligence, Vol. 49, No. 7, pp. 1-7, 2019.DOI
4 
K. Zhang, and W. Ling, ``Joint Motion Information Extraction and Human Behavior Recognition in Video Based on Deep Learning,'' IEEE Sensors Journal, Vol. 20, No. 20, pp. 11919-11926, 2020.DOI
5 
T. Hoshino, S. Kanoga, M. Tsubaki, and A. Aoyama, ``Comparing subject-to-subject transfer learning methods in surface electromyogram-based motion recognition with shallow and deep classifiers - ScienceDirect,'' Neurocomputing, Vol. 489, pp. 599-612, 2022.DOI
6 
I. Mademlis, A. Tefas, and I. Pitas, ``A salient dictionary learning framework for activity video summarization via key-frame extraction,'' Information Sciences, Vol. 432, pp. 319-331, 2018.DOI
7 
L. Xu, S. Yan, X. Chen, and P. Wang, ``Motion Recognition Algorithm Based on Deep Edge-Aware Pyramid Pooling Network in Human-Computer Interaction,'' IEEE Access, Vol. 7, pp. 163806-163813, 2019.DOI
8 
J. Lee, Y. Kim, M. Jeong, C. Kim, D. W. Nam, J. S. Lee, S. Moon, and W. Y. Yoo, ``3D convolutional neural networks for soccer object motion recognition,'' in 2018 20th International Conference on Advanced Communication Technology (ICACT), Chuncheon, Korea (South), pp. 354-358, Feb. 2018.DOI
9 
U. Singh, and M. K. Choubey, ``Motion Pattern Recognition from Crowded Video,'' in 2020 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, pp. 431-435, Jun. 2020.DOI
10 
D. B. Chakraborty, S. K. Pal, ``Rough video conceptualization for real-time event precognition with motion entropy,'' Information Sciences, Vol. 543, pp. 488-503, 2021.DOI
11 
A. Miao, and F. Liu, ``Application of human motion recognition technology in extreme learning machine:,'' International Journal of Advanced Robotic Systems, Vol. 18, No. 1, pp. 4-18, 2021.DOI
12 
Y. Zhang, S. Zhang, J. Zhang, K. Guo, and Z. Cai, ``Key Frame Extraction of Surveillance Video Based on Frequency Domain Analysis,'' Intelligent Automation and Soft Computing, Vol. 29, No. 1, pp. 259-272, 2021.DOI
13 
V. Agalya, M. Kandasamy, E. Venugopal, and B. Maram, ``CPRO: Competitive Poor and Rich Optimizer-Enabled Deep Learning Model and Holoentropy Weighted-Power K-Means Clustering for Brain Tumor Classification Using MRI,'' International Journal of Pattern Recognition and Artificial Intelligence, Vol. 36, No. 4, pp. 1-30, 2022.DOI
14 
I. Ivanov, and V. Skryshevsky, ``Porous Silicon Bragg Reflector Sensor: Applying HSV Color Space for Sensor Characterization,'' in 2021 IEEE 16th International Conference on the Experience of Designing and Application of CAD Systems (CADSM), Lviv, Ukraine, pp. 15-19, Feb. 2021.DOI
15 
V. Phivin, and A. C. S. Jini, ``Performance Analysis of Fuzzy C-Means Clustering using Multichannel Decoded Local Binary Pattern,'' International Journal of Engineering Trends and Technology, Vol. 61, No. 1, pp. 49-55, 2018.DOI
16 
A. Caglayan, adn A. B. Caglayan, ``Volumetric Object Recognition Using 3-D CNNs on Depth Data,'' IEEE Access, Vol. 6, pp. 20058-20066, 2018.DOI
17 
K. Gavahi, P. Abbaszadeh, and H. Moradkhani, ``DeepYield: A combined convolutional neural network with long short-term memory for crop yield forecasting,'' Expert Systems with Application, Vol. 184, No. Dec., 2021.DOI
18 
S. Osah, A. A. Acheampong, C. Fosu, and I. Dadzie, ``Deep Learning model for predicting daily IGS Zenith Tropospheric Delays in West Africa using TensorFlow and Keras,'' Advances in Space Research, Vol. 68, No. 3, pp. 1243-1262, 2021.DOI
19 
https://dancedb.cs.ucy.ac.cy/main/performances.URL
20 
M. Jadhav, and D. S. Jadhav, ``Video Summarization Using Higher Order Color Moments (VSUHCM),'' Procedia Computer Science, Vol. 45, pp. 275-281, 2015.DOI
21 
R. Hannane, A. Elboushaki, and K. Afdel, ``Efficient Video Summarization Based on Motion SIFT-Distribution Histogram,'' in 2016 13th International Conference on Computer Graphics, Imaging and Visualization (CGiV), Beni Mellal, Morocco, pp. 312-317, Apr. 2016.DOI
22 
F. Camastra, V. Capone, A. Ciaramella, A. Riccio, and A. Staiano, ``Prediction of environmental missing data time series by Support Vector Machine Regression and Correlation Dimension estimation,'' Environmental Modelling & Software, Vol. 150, pp. 1-7, 2022.DOI
23 
S. Megrhi, A. Megrhi, and W. Souideǹe, ``Trajectory feature fusion for human action recognition,'' in 2014 5th European Workshop on Visual Information Processing (EUVIP), Paris, France, pp. 1-6, Dec. 2014.DOI

Author

Jie Yan
../../Resources/ieie/IEIESPC.2023.12.6.495/au1.png

Jie Yan obtained her master’s degree. She works at the Henan University of Animal Husbandry and Economy, and her interests include music and dance.