3.1 Experimental Setup
                  The experiments were conducted under the TensorFlow framework [18] on a Linux system. The change of tensor format is achieved with the reshape function,
                     and 3${\times}$3 convolution is used for the convolution kernel of the first two layers
                     of the 3D-CNN, with 1${\times}$1 convolution used for the last two layers. Max-pooling
                     is used, and the Adam optimization algorithm is used for training the network at a
                     learning rate of 0.001.
                  
                  There were two kinds of experimental data. The first was the DanceDB dataset [19], which contained 48 dance videos involving 12 dance movements. Each movement was
                     labeled with an emotion tag (e.g., scared, angry). The frame rate of the videos was
                     20 frames per second, and the size of each frame was 480${\times}$360.
                  
                  Second was a self-built dataset containing 96 dance videos recorded by six students
                     majoring in dance, all in the same context. These videos included three types of dance,
                     each with complex and variable movements. The frame rate of the videos was also 30
                     FPS, and again, the size of each frame was 480${\times}$360. Some of the frames are
                     in Fig. 3.
                  
                  All the videos had key frames marked by professional dance teachers, and the following
                     indicators were selected to assess the effects of key frame extraction.
                  
                  (1) Recall ratio: the number of key frames correctly extracted, divided by the number
                     of key frames correctly extracted plus the number of missed frames.
                  
                  (2) Precision ratio: the number of key frames correctly extracted, divided by the
                     number of key frames correctly extracted plus the number of frames falsely detected.
                  
                  (3) Deletion factor: the number of frames falsely detected, divided by the number
                     of key frames correctly extracted.
                  
                  The effectiveness of movement recognition was evaluated using accuracy: the number
                     of videos correctly recognized divided by the total number of videos.
                  
                  
                        Fig. 3. Example dance video frames from the self-built dataset.
 
                
               
                     3.2 Analysis of Results
                  First, the key frame extraction method from this paper was compared with the following
                     two methods:
                  
                  ① The color feature-based approach proposed by Jadhav and Jadhav [20], and
                  
                  ② The scale-invariant feature transform approach proposed by Hannane et al. [21].
                  
                  The key frame extraction results of the three methods are shown in Table 2.
                  
                  From Table 2, we see that, first, the recall ratios of the methods proposed by Jadhav and Jadhav
                     and Hannane et al. were below 80% for the DanceDB dataset, while the recall ratio
                     of the proposed multi-feature fusion method was 82.27% (10.82% higher than Jadhav
                     and Jadhav, and 8.81% higher than Hannane et al.). Secondly, the precision ratios
                     of the Jadhav and Jadhav and Hannane et al. methods were below 70%, while the precision
                     ratio from multi-feature fusion was 72.84% (5.97% higher than Jadhav and Jadhav and
                     4.65% higher than Hannane et al.). Third, the deletion factor of the multi-feature
                     fusion method on the DanceDB dataset was 3.01, which was 1.41 lower than the method
                     by Jadhav and Jadhav, and 0.55 lower than the method by Hannane et al.
                  
                  The recall and precision ratios of all the methods when processing the self-built
                     dataset improved to some extent, compared to the DanceDB dataset, and the deletion
                     factors were also smaller, which may be due to the relatively small number of dance
                     types in the self-built dataset. In comparison, the recall rate of the multi-feature
                     fusion method was 84.82%, and the precision ratio was 81.07%, both of which are higher
                     than 80% and significantly higher than the other two methods. The deletion factor
                     with multi-feature fusion was 2.25, which was significantly smaller than the other
                     two methods. Based on the results with the two datasets, the multi-feature fusion
                     method had fewer cases of missed and incorrect detections of key frames, as well as
                     better extraction results.
                  
                  Taking the dance called Searching as an example, the key frames output by all three
                     methods are presented in Fig. 4. More key frames were extracted by the Jadhav and Jadhav and the Hannane et al. methods,
                     but some frames are not clear. Key frames extracted by the multi-feature fusion method
                     gave a complete description of the movement changes in the dance video, a good overview
                     of the video, and the extracted movements are clear. Therefore, the multi-feature
                     fusion method can be used to provide movement recognition services.
                  
                  Based on the key frame extractions, the results of movement recognition were analyzed
                     to compare the effects of different features and different classifiers on the results
                     of dance video movement recognition. The compared methods are
                  
                  Method 1: using only spatial features plus the softmax classifier,
                  Method 2: using only temporal features plus the softmax classifier, and
                  Method 3: using spatio-temporal features plus the SVM classifier [22].
                  
                  Comparison of accuracy from the above methods and the multi-feature fusion method
                     is presented in Fig. 5.
                  
                  According to Fig. 5, movement recognition accuracy was low in all cases when only one feature was used.
                     The multi-feature fusion method exhibited accuracy of 42.67% with the DanceDB dataset,
                     which is 11% higher than Method 1 and 8.71% higher than Method 2. For the self-built
                     dataset, multi-feature fusion had an accuracy of 50.64%, which is 17.26% higher than
                     Method 1 and 14.72% higher than Method 2. This indicates that the extracted features
                     could have some influence on movement recognition when using the same classifier.
                     The comparison shows that using only spatial features or only temporal features led
                     to a decrease in recognition accuracy, while the combination of spatio-temporal features
                     produced better recognition of dance video movements.
                  
                  Comparing Method 3 and the multi-feature fusion method, the difference in classifiers
                     resulted in differences in accuracy. Recognition accuracy from Method 3 on the DanceDB
                     dataset was 38.56% (4.11% lower than the method proposed in this paper), and with
                     the self-built dataset, recognition accuracy from Method 3 was 40.11% (10.53% lower
                     than the proposed method). This indicates that the softmax classifier provided better
                     recognition of different dance movements than the SVM classifier. The SVM required
                     a lot of computation time for multi-classification recognition, and the selection
                     of the kernel function and parameters depended on manual experience, which is somewhat
                     arbitrary. Therefore, the softmax classifier was more reliable.
                  
                  A movement identification approach based on trajectory feature fusion was proposed
                     by Megrhi et al. [23]. It was compared with the proposed multi-feature fusion method, and the results are
                     presented in Fig. 6.
                  
                  Fig. 6 shows that movement recognition accuracy from the method proposed in this paper was
                     significantly higher for the two dance video datasets. The reason the accuracy of
                     both methods was higher with the self-built dataset is that it included more dance
                     movements. For DanceDB, the recognition accuracy of the Megrhi et al. method was 39.52%
                     (3.15% lower than multi-feature fusion), and for the self-built dataset, recognition
                     accuracy of the Megrhi et al. method was 46.62% (4.02% lower than multi-feature fusion).
                     These results demonstrated the multi-feature fusion method is effective in identifying
                     movements from dance videos.
                  
                  The recognition accuracies of the method proposed by Megrhi et al. and the multi-feature
                     fusion method for different dance movements in the self-built set were further analyzed,
                     and the results are shown in Fig. 7.
                  
                  Fig. 7 shows that accuracy from the Megrhi et al. method was below 50% for all three dances,
                     among which the lowest accuracy (45.12%) was with Green Silk Gauze Skirt, and the
                     highest accuracy (47.96%) was with Memories of the South. Compared with the Megrhi
                     et al. method, recognition accuracy from multi-feature fusion was 5%, 2.55%, and 4.51%
                     higher for the three dances, which proves the reliability of the multi-feature fusion
                     method in recognizing different types of dance movement.
                  
                  
                        Table 2. Comparison of Key Frame Extraction Effects.
                     
                           
                              
                                 | Dataset | Method | Recall ratio | Precision ratio | Deletion factor | 
                           
                                 | DanceDB | Jadhav and Jadhav | 71.45% | 66.87% | 4.42 | 
                           
                                 | Hannane et al. | 73.46% | 68.19% | 3.56 | 
                           
                                 | Multi-feature fusion | 82.27% | 72.84% | 3.01 | 
                           
                                 | Self-built dataset | Jadhav and Jadhav | 73.06% | 71.29% | 2.77 | 
                           
                                 | Hannane et al. | 75.77% | 73.36% | 2.41 | 
                           
                                 | Multi-feature fusion | 84.82% | 81.07% | 2.25 | 
                        
                     
                   
                  
                        Fig. 4. Comparison of key frame extraction results.
 
                  
                        Fig. 5. Comparison of accuracy from dance video movement recognition.
 
                  
                        Fig. 6. Comparison of recognition accuracy from multi-feature fusion and trajectory feature fusion.
 
                  
                        Fig. 7. Accuracy comparison with the self-built set.