Mobile QR Code QR CODE

2025

Reject Ratio

81.5%


  1. (School of Physical Education and Health Education, Hunan University of Information Technology, Changsha, 410151, China kiki19900708@163.com)



Cheerleading, Action recognition, Inertial sensors, Pose motion capture, 3D pose estimation, HGCN, Self-attention mechanism

1. Introduction

Recently, with the diversified development of sports, cheerleading has become a sport that combines gymnastics, dance, and music, which has received widespread attention and love. Cheerleading not only enhances participants’ physical fitness and coordination, but also cultivates teamwork spirit and artistic expression. Due to the complexity and diversity of cheerleading movements, accurately identifying the various technical movements of cheerleaders has gradually become a research difficulty [1, 2]. Currently, traditional action recognition methods often rely on manual annotation and visual observation, which have problems such as long processing time, large errors, and strong subjectivity. To enhance the accuracy and efficiency of cheerleading action recognition, many scholars have conducted a series of studies combining various sensor technologies and posture motion capture methods [3, 4]. Inertial sensors can capture real-time posture and motion data of athletes by measuring parameters such as acceleration and angular velocity. These data can provide more accurate motion information than visual methods, which helps improve the accuracy of action recognition. However, due to inherent errors in sensors, external interference, and complex changes in actual motion, how to effectively process and utilize this data has become a key issue. At the same time, posture motion capture technology also provides new possibilities for cheerleading motion recognition. Through multi view camera systems and deep learning algorithms, it is possible to accurately capture and analyze athletes’ three-dimensional pose data. In the field of combining neural networks and action recognition, Graph Convolutional Network (GCN) has shown outstanding performance in human skeletal action recognition in recent years [5]. This network can not only effectively process graph structured data, but also accurately recognize human movements by capturing complex relationships between skeletal points. However, traditional GCNs still have limitations such as inaccurate recognition and slow recognition speed when dealing with complex multi-joint dependencies. In response to the above limitations, the research aims to combine inertial sensors, pose motion capture technology, and improved GCN to build a new cheerleader technical motion recognition algorithm. The innovation of the research lies in optimizing the recognition performance of GCN for feature actions by introducing innovative methods such as Self-attention Mechanism (SAM), time and channel thinning hypergraphs. This study can not only help cheerleaders identify movement errors in a timely manner during training, but also provide new training and evaluation methods for related sports institutions.

The contributions of this study are as follows: (1) The spatiotemporal sparse Hypergraph Convolution (ISTHC) algorithm is proposed, which combines the self-attention mechanism with the spatiotemporal optimization of hypergraphs. Compared with traditional GCN, which can only model binary adjacency, ISTHC realizes the explicit expression of multi-joint cooperative relationship through dynamic hyperedges; (2) A dynamic hyperedge construction rule for cheerleading is defined, which breaks through the limitation of the traditional skeleton map only modeling binary connections and provides a new graph representation paradigm for sports action recognition. (3) The average accuracy of the four types of cheerleading action recognition is 97.6%, and the single recognition time is less than 0.1 seconds, and the action error can be feedback in real time. The test results of a provincial cheerleading team show that the system can increase the training efficiency by 42.3% and reduce the workload of manual evaluation by 68%.

2. Related Work

Currently, more lightweight and low-cost magnetic and inertial sensors are widely used for pose estimation. However, the inherent errors of sensors, external errors, and acceleration caused by the actual motion of the platform have a significant impact on the estimation accuracy. To reduce this impact, B. Candan and H.E. Soken proposed a robust pose estimation algorithm aimed at compensating for sensor errors and external accelerations. The simulation test results demonstrated that the designed approach had significant advantages compared with the benchmark algorithm [6]. In the field of orbit maintenance, accurately determining the pose and inertia parameters of non cooperative targets can improve the quality of orbit maintenance. Therefore, Q. Meng et al. proposed a model free approach by sequentially registering point clouds captured by depth cameras. This study combined a multiplicative extended Kalman filter and pose graph optimization to weaken the impact of measurement noise and drift errors. The experimental results demonstrated the effectiveness of this method [7]. Accurate pose measurement is required for drone flight control, and using optical flow sensors to detect drone motion relative to the ground can improve positioning accuracy. Therefore, X. Li et al. proposed a new optical flow measurement model suitable for discrete-time conditions. This model not only provided inter frame vector symmetry description, but also presented a data fusion scheme based on cubic transformation, aiming to directly enhance position information estimation. The flight test results showed that the model and scheme significantly improved the positioning accuracy in various outdoor environments [8]. T. Li and H. Yu proposed a visual inertial sensor system. The system consisted of three sensor modules attached to the torso, upper arm, and forearm, which calculated module direction through visual inertial fusion and calibrate through any arm movement. The experimental results showed that the shoulder elbow joint angle was highly correlated with the optical motion capture system, with a correlation coefficient of 0.986. Except for the forearm rotation angle, the root mean square error of the joint angle was less than 4° [9].

Currently, GCN has been extensively applied in human skeletal motion recognition. Although GCN performs well in bone action recognition, there are issues with channel sharing adjacency matrix and ignoring dependencies between different joints. Convolutional Neural Network (CNN) can better model complex dependency relationships. Therefore, W. Yang et al. proposed a method that integrated GCN and CNN. The research results indicated that the hybrid network retained structural information while modeling inter frame joint relationships. The model had significantly better recognition accuracy for human bones than existing methods [10]. K. Hu et al. proposed a dynamic GCN model based on attention weighting strategy. In addition, a new dynamic adjacency matrix was constructed to capture the dynamic relationships the skeleton under multiple actions using attention weighting mechanism, in order to fully extract discriminative action features. Numerous experiments showed that the model performed well on both the NTU-RGB+D and Skeleton Kinetics datasets [11]. GCN has good performance in bone action recognition, but there are limitations such as difficulty in defining semantic level adjacency matrices and inability to fully utilize joint velocity information. Therefore, J. Zhang et al. proposed a graph aware transformer aimed at fully utilizing joint velocity information in a data-driven manner to learn spatiotemporal motion features. The experimental results showed that the graph aware transformer achieved significant improvement compared with the GCN benchmark model on multiple publicly available datasets [12]. Y. Liu et al. proposed the skeleton large kernel attention operator and spatiotemporal skeleton large kernel attention module, aiming to expand the receptive field and improve channel adaptability. A joint motion modeling strategy was proposed to focus on important temporal interaction information. The experimental results showed that the spatiotemporal large kernel attention GCN achieved high recognition accuracy on publicly available skeletal datasets [13].

In summary, in recent years, many scholars have conducted in-depth research on action recognition technology and achieved many important research results. At the same time, the application of GCN in different technological fields has also shown significant results. However, there are still problems with low efficiency and low accuracy in recognizing technical movements of cheerleaders. In view of this, the research attempts to improve the GCN algorithm and apply it to cheerleader technical action recognition. It is expected to provide more effective technical support and theoretical basis for this field, promoting the development of recognition technology.

3. Action Recognition of Cheerleading Athletes Combining Inertial Sensors and ISTHC

To enhance the recognition effect of cheerleaders’ technical movements, the study first combines inertial sensors, posture motion capture technology, and human skeleton map to collect 3D posture data of cheerleaders. Secondly, the GCN is optimized and the extracted 3D pose data is applied to design the cheerleader action recognition model.

3.1. 3D Pose Data Acquisition Based on Inertial Sensors and Pose Motion Capture

Cheerleading is a comprehensive sports activity that combines dance, gymnastics, and music, emphasizing the smoothness and coordination of movements. Due to the complexity and diversity of cheerleading movements, there are non physical dependencies between body joints, which increases the difficulty of action recognition [14]. In order to accurately extract and analyze these joint relationships, the human skeleton diagram is introduced in the study. By constructing a human skeleton diagram, the connection and interaction relationships of each joint in cheerleading exercise can be intuitively represented. The skeleton relationship diagram of the cheerleading action drawn is shown in Fig. 1.

Fig. 1. Skeleton of cheerleading athletes.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig1.png

Fig. 1 shows the original movement and simplified skeleton of a cheerleader. From Fig. 1, multiple human skeleton vertices represent joint points, while connecting lines represent edges of the skeleton graph. Once fully connected, it can represent the attribute relationships between joints. Cheerleading exercises involve highly coordinated multi-joint movements (such as synchronized arm swings and leg jumps) that have non physical dependencies between joints. Traditional action recognition methods, such as CNN, are difficult to model such complex spatial relationships due to their inability to directly process non-Euclidean data. Therefore, this study chose GCN as the basic framework. GCN excels in processing graph structured data (such as human skeletal maps) and achieves feature propagation by encoding joint connections through adjacency matrices. This network structure can extend CNN from regular grids to unordered graphs of arbitrary structures, using skeleton information images as input to analyze the motion patterns of target objects [15, 16]. The main structure of GCN is shown in Fig. 2.

Fig. 2. GCN structure diagram.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig2.png

The GCN structure in Fig. 2 is similar to the CNN structure, consisting of an input layer, a graph convolutional layer, and an output layer. GCN typically utilizes the feature information of nodes and their neighboring nodes in the graph to obtain feature vectors through weighted averaging, thereby increasing the weight of nodes with lower degrees. Then, these feature vectors are trained and learned through neural networks to effectively utilize the structural information of the graph [17]. GCN calculates the new feature representation of the current node by weighted averaging the features of neighboring nodes, which can be achieved through matrix multiplication. The propagation mode between GCN layers is shown in Eq. (1) [18].

(1)
$ H^{(l+1)} = f(H^{(l)}, A),\nonumber\\ \sigma(\tilde{A}H^{(l)}W^{(l)}) = \sigma\left(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}H^{(l)}W^{(l)}\right). $

In Eq. (1), $A$ represents the adjacency matrix. $\tilde{A}$ represents adding an identity matrix based on the adjacency matrix. $H^{(l)}$ represents the features of layer $l$. $\sigma$ signifies the nonlinear activation function. $W^{(l)}$ represents the current trainable parameter matrix. $\tilde{D}$ represents the degree matrix of $\tilde{A}$. $\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}$ represents matrix normalization processing. The propagation mode $z$ of the entire GCN layer is shown in Eq. (2).

(2)
$ z = f(X, A) = softmax\left(\tilde{A}ReLU(AXW^{(0)})W^{(l)}\right). $

In Eq. (2), $W^{(0)}$ signifies the observable parameters of the input layer. $X$ represents a multidimensional eigenvector matrix. However, traditional GCN has two major limitations in cheerleading action recognition: firstly, the pre-defined adjacency matrix of GCN can only represent binary joint relationships (such as “hand elbow”), and cannot model high-order interactions (such as the collaborative relationship between “hand waist foot” in lifting actions); Secondly, cheerleading movements have spatiotemporal characteristics and require joint analysis of multiple frame motion sequences. To address these issues, the study upgraded GCN to Hypergraph Convolutional Network (HGCN). Unlike GCN, HGCN uses hyperedges to group multiple joints (such as simultaneously connecting hands, waist, and feet during lifting actions), thus modeling complex multi-joint dependencies [19, 20]. For example, a hyperedge in HGCN can simultaneously represent the collaborative relationship of three joints, which is crucial for capturing cheerleading specific movements.

3.2. Design of Action Recognition Algorithm for Cheerleaders Based on ISTHC

Although traditional HGCN has significant advantages in extracting high-order feature information using hyperedges, it can effectively simulate the relationships among multiple joints. However, the HGCN has shortcomings in handling coordinated movements of distant joints such as hands and feet, especially in recognizing cheerleading movements involving synchronized movements of the upper and lower body [21]. Therefore, in order to accurately identify the technical movements of cheerleading, an improved hypergraph convolutional network (IHGCN) is proposed. The framework structure of IHGCN is shown in Fig. 3.

Fig. 3. IHGCN structure diagram.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig3.png

In Fig. 3, IHGCN mainly consists of two parts: SAM module and topology module. To further optimize, the study introduced SAM, which dynamically assigns weights based on the importance of joints in specific actions. Although HGCN provides a hypergraph structure, its edge weights are still static. SAM can adaptively strengthen key joints (such as arms in swinging movements) and weaken secondary joints (such as head fine-tuning), thereby improving feature discrimination.

In the IHGCN model, a triplet in the hypergraph contains a set of hyperedges $E$, a set of vertices $V$, and a weight matrix $W$ for each edge. $v$, $e$ and $w$ respectively represent the weight matrices of a vertex, a connecting edge, and an edge. The convolution operation in IHGCN is shown in Eq. (3) [22].

(3)
$ x^{l+1} = \sigma\left(D_v^{\frac{1}{2}}HWD_e^{\frac{1}{2}}H^TD_v^{\frac{1}{2}}x^lw^l\right) = \sigma(H_mx^lW^l). $

In Eq. (3), $\sigma$ represents the nonlinear activation function. $H$ and $H^T$ respectively represent hypergraphs and their transposes. $w^l$ signifies the weight matrix of the 3D action sequence. $x^l$ represents the parameter matrix of the 3D action sequence. $W$ signifies the weight matrix of the hyperedge. $W^l$ signifies the parameter matrix of the hyperedge. $D_v$ represents the diagonal matrix of vertex degrees. $D_e$ represents the diagonal matrix of hyperedges. $H_m$ represents the adjacency matrix transformed from $H$. Through SAM, it is easy to obtain the representation of the initial correlation matrix of bone joints in both temporal and spatial dimensions. The temporal channel module refers to the ability to allocate attention weight values for each frame of the video action sequence in a reasonable manner, so that important dynamic joints can be identified and unimportant dynamic joints can be eliminated. The output layer of the self-attention layer after temporal channel optimization is shown in Eq. (4).

(4)
$ \alpha = softmax\left(\frac{QK^T}{\sqrt{d}}\right). $

In Eq. (4), $\alpha$ represents the attention matrix. After obtaining the output of the attention layer, the function $\tau$ is used to refine each frame of the hypergraph, as shown in Eq. (5).

(5)
$ H_t = \tau(H_m, a) = H_m \cdot a. $

In Eq. (5), $H_m$ represents the adjacency matrix derived from the hypergraph transformation. $H_t$ represents Time Sparse Hypergraph (TTH). The channel module is similar to a general CNN, which has independent spatial kernels to capture different spatial information. The channel module proposed in this study can dynamically recommend a unique hypergraph convolution kernel for each channel, allowing joint connections under different motion forms to be split and refined. After the convolution operation is completed, it is re-aggregated to obtain the final Channel Sparse Hypergraph (CTH). The process of generating CTH is shown in Eq. (6).

(6)
$ X' = \Omega(M(C(XW_\alpha, XW_\beta), H_m)). $

In Eq. (6), $\Omega$ represents an increasing function. $M$ represents the aggregation function. $C$ represents the push function. $W_\alpha$ and $W_\beta$ both belong to the weight matrix. The construction mode of dynamic recommendation is shown in Eq. (7).

(7)
$ Q = C(XW_\alpha, XW_\beta) = XW_\alpha - XW_\beta. $

In Eq. (7), all algebraic meanings remain the same as before. By reducing the dimensionality of input data, the difficulty of dynamic push can be reduced, making it easier to obtain the joint relationship matrix for each sample. The adjacency matrix of feature channels for different actions is shown in Eq. (8).

(8)
$ H_c = \sigma(QW_\gamma, H)m) = QW_\gamma + H_m. $

In Eq. (8), $W_\gamma$ represents the weight matrix. Due to the complexity of general action temporal series and channel sequences, in order to improve the spatiotemporal relationship of the fused hypergraph joints, the study introduces Spatio-Temporal Hypergraph Convolution (STHC) to establish a spatiotemporal relationship window for multiple joints [23]. Each action sequence is window decomposed and finally combined, as shown in Eq. (9).

(9)
$ H^\zeta = \begin{bmatrix} H & \cdots & H \\ \vdots & \ddots & \vdots \\ H & \cdots & H \end{bmatrix} \in R^{N \times T \times E}. $

In Eq. (9), $H^\zeta$ represents the combined spatiotemporal hypergraph. $\zeta$ represents a hyperedge frame. $R^{N \times T \times E}$ is the spatio-temporal feature matrix obtained by window update, and its dimension is $N \times T \times E$, where $N$ represents the number of joints, $T$ represents the time step, and $E$ represents the number of feature channels. By continuously updating the window, feature $X^*$ can be obtained, as shown in Eq. (10).

(10)
$ X^* = \sigma\left(D_v^{\frac{1}{2}}H^\varsigma WD_e^{\frac{1}{2}}H^\varsigma D_v^{\frac{1}{2}}x^lw^l\right). $

Finally, to reduce timing and channel redundancy, the study integrated TTH and CTH. TTH prunes redundant frames in the temporal sequence (such as transition poses between jumping actions), while CTH optimizes the channel dimension feature map to focus on task related joints. These modules together form the Improved Spatiotemporal Hypergraph Convolution (ISTHC) framework, which achieves unified modeling of spatial, temporal, and channel dimensions. The ISTHC framework is shown in Fig. 4.

Fig. 4. ISTHC structure diagram.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig4.png

In Fig. 4, the entire model adds the STHC module based on the IHGCN framework. Firstly, HGCN performs hypernode feature extraction on the technical action data of the original cheerleader. Secondly, TTH is performed in SAM, and CTH is performed in the topology module. Afterwards, STHC is used to perform window segmentation and recombination on the fused hypergraph of refined and channel hypergraphs, in order to extract more diverse motion joints of cheerleaders.

4. The Recognition Effect of Cheerleader Movements Combining Inertial Sensors and ISTHC

To verify the performance of the proposed model, a suitable experimental environment is established, and a series of tests are conducted. Firstly, ablation testing is conducted on the algorithm section. The benchmark performance is compared among similar models. Secondly, each model is applied to practical problems, and the performance of various models in the actual cheerleader action recognition environment is tested.

4.1. The Ablation Test Results of ISTHC Model

A high-level experimental platform is constructed for comparative testing. The hardware configuration of the experimental platform includes Intel® Core™ i9-10900K CPU @ 3.70GHz$\times$32, and NVIDIA GeForce RTX 3080 GPU. All data are run on the PyTorch framework with weight decay set to 0.002. Two publicly available bone datasets are selected for the experiment, PoseTrack and DanceTrack. PoseTrack is a video sequence dataset containing over 150000 human motion poses, suitable for various motion scenarios. DanceTrack is a large dataset focused on various dance movements, containing 50 different types of dance movements, with a total of approximately 70000 samples. During the experiment, a total of 20000 preprocessed valid data are collected from two datasets and divided into training and testing sets in an 8:2 ratio. The final ISTHC model is composed of multiple components, including HGCN, SAM, TTH, CTH, and STHC. It is necessary to first use ablation experiments to test the impact of different modules on the entire ISTHC algorithm model. HGCN-SAM, HGCN-SAM-TTR, HGCN-SAM-CTH, and ISTHC are compared. The action recognition accuracy is shown in Fig. 5.

Figs. 5(a) and 5(b) show the recognition accuracy of four ablation combinations, respectively. From Fig. 5(a), compared with the other three combinations, the recognition performance of HGCN combined with SAM was poor, and the highest recognition accuracy of this combination in the training set was only 94.1%. On the basis of HGCN-SAM, two modules, TTR and CTH were added separately to obtain HGCN-SAM-TTR and HGCN-SAM-CTH. It was found that the highest recognition accuracy of these two combinations in the training set was 95.8% and 96.2%, respectively. After adding TTR module and CTH module to HGCN-SAM, the overall performance of the algorithm was significantly improved. As shown in Fig. 5(b), ISTHC had the best recognition performance compared with the other three ablation combinations, with a recognition accuracy of 97.9% in the testing set. The recognition time of four ablation combinations on different datasets during the testing process is compared, as shown in Fig. 6.

Fig. 5. Recognition accuracy of different combinations.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig5.png

Fig. 6. Recognition time of different combinations.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig6.png

Fig. 6 shows the recognition time variation values of four ablation combinations during the testing process. As shown in Fig. 6, with the increase of sample size, the recognition time of the four ablation combinations fluctuated to varying degrees, and the fluctuation range of ISTHC was the smallest. Overall, the average recognition time of HGCN-SAM, HGCN-SAM-TTR, HGCN-SAM-CTH, and ISTHC during the testing of 1000 samples was 0.19s, 0.16s, 0.15s, and 0.07s, respectively. The average recognition time of ISTHC is lower compared with the other three ablation combinations, indicating that this combination has a faster recognition speed and higher efficiency.

4.2. Performance Comparison Testing of Different Algorithms

In addition to conducting ablation tests on ISTHC, the study further introduces other models from relevant fields for comparison to verify the good performance of ISTHC. Traditional HGCN, reference [24], and improved GCN from reference [25]are selected as comparison models, with loss values as reference indicators. The changes in loss curves of the four models in the training and testing sets are shown in Fig. 7.

Figs. 7(a) and 7(b) show the changes in loss curves of HGCN, reference [24], reference [25], and ISTHC in the training and testing sets, respectively. From Fig. 7, the loss values of ISTHC reached a stable state quickly in both datasets, while the other three models required longer iterations to reach stability. According to Fig. 7(a), HGCN, reference [24], reference [25], and ISTHC reached a stable state after 118, 82, 86, and 37 iterations, respectively. Similarly, in Fig. 7(b), HGCN, reference [24], reference [25], and ISTHC reached a stable state after 112, 85, 89, and 56 iterations, respectively. Overall, ISTHC has the best iterative performance and good stability, which can quickly adapt to the action recognition work of cheerleaders. In order to more accurately quantify the performance comparison results of various models, the study tests the above models using precision, recall, F1, Mean Squared Error (MSE), and Mean Absolute Error (MAE) as reference indicators, as displayed in Table 1.

Fig. 7. Loss curves of the four comparison models.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig7.png

Table 1. Benchmark Performance of Four Comparison Models

Algorithm Precision Recall F1 MSE MAE
HGCN 0.75 0.82 0.78 0.43 0.58
Reference [24] 0.86 0.88 0.87 0.25 0.35
Reference [25] 0.89 0.90 0.91 0.26 0.32
ISTHC 0.96 0.98 0.97 0.12 0.25

According to Table 1, ISTHC had better benchmark performance compared with HGCN, reference [24], and reference [25], with precision, recall, and F1 values as high as 0.96, 0.98, and 0.97, respectively. Traditional HGCN had the worst benchmark performance, with precision, recall, and F1 values as low as 0.75, 0.82, and 0.78, respectively. In addition, ISTHC also performed the best in terms of error, with MSE values of 0.43, 0.25, 0.26, and 0.12 for HGCN, reference [24], reference [25], and ISTHC, and MAE values of 0.58, 0.35, 0.32, and 0.25, respectively. In summary, the benchmark performance of the new algorithm proposed in the study is the best, with more stable and superior recognition ability compared with similar recognition algorithms.

4.3. Analysis of Model Application Effectiveness

To verify the recognition performance of the ISTHC in practical applications, the study first collects and preprocesses the 3D pose data of cheerleaders based on inertial sensors. Secondly, the posture dataset is divided into four categories: basic actions, formation transformation actions, strength actions, and difficulty actions, each containing different specific actions. The processed pose dataset is used to test the performance of four models in practical applications. The accuracy of different models in recognizing various actions is displayed in Table 2.

Table 2. The Accuracy of Different Algorithm Models to Recognize Various Actions (%)

Action classification Specific action HGCN Reference [24] Reference [25] ISTHC
Basic Movements (Action 1) Jumping 85.2 88.6 91.3 98.8
Kick 83.4 88.9 90.8 98.5
Swinging arm 84.1 89.4 92.1 99.2
Formation Change Action (Action 2) Global transformation 85.5 89.2 89.6 96.3
Partial transformation 81.9 88.7 90.2 97.1
Power Movement (Action 3) Lift up 82.7 86.9 91.1 97.5
Toss joint 83.3 86.3 89.8 96.8
Difficult Moves (Action 4) Writhe 82.8 88.1 90.5 97.7
Multiple flip 84.5 86.7 89.5 96.6

According to Table 2, the accuracy of ISTHC in identifying 9 different cheerleading movements was above 95%, with the highest accuracy of 99.2% for identifying swinging arm action. The accuracy of identifying global transformation was relatively low, only at 96.3%. The accuracy of the model in reference [25] in identifying various cheerleading movements reached as high as 92.1% and as low as 89.5%. The accuracy of the model in reference [24] in identifying various cheerleading movements reached as high as 89.4% and as low as 86.3%. The actual performance of the HGCN is the worst, with the highest accuracy of identifying various cheerleading movements being only 85.5% and the lowest being 81.9%. The human bone matching points of cheerleading movements identified by four models are compared, as shown in Fig. 8.

The four subgraphs in Fig. 8 respectively show the bone point matching of HGCN, reference [24], reference [25], and ISTHC models for identifying a cheerleading action. Based on Fig. 8, ISTHC performs the best in matching bone points in practical recognition problems, which can accurately identify various key bone points in cheerleading movements.

Fig. 8. Bone point anastomosing performance of cheerleading movement recognized by different models.

../../Resources/ieie/IEIESPC.2026.15.3.385/fig8.png

Table 3 provides a detailed comparison of the performance of five different algorithms (OpenPose, 3D CNN, ST-GCN, Transformer, and ISTHC) in human pose estimation tasks from three key dimensions of accuracy, real-time performance, and computational efficiency. In terms of accuracy, ISTHC performs best in conventional, occlused-out and low-light scenes, with accuracy rates of 97.60%, 94.50% and 93.80%, respectively, while the accuracy of other algorithms such as Transformer, ST-GCN, 3D CNN and OpenPose decreases in turn. In terms of real-time performance, ISTHC has the lowest processing delay (0.1 ms/frame) and the highest maximum frame rate (1000fps), far exceeding other algorithms, while algorithms such as OpenPose and Transformer lag behind in terms of processing delay and frame rate. In terms of computational efficiency, ISTHC has the smallest number of parameters (1.2M) and the shortest training time (4.9 hours), compared with 3D CNN and OpenPose, which have higher number of parameters and training time. Overall, ISTHC performs best in the task of human pose estimation, especially in the accuracy and real-time performance, while other algorithms have shortcomings in some aspects.

Table 3. Comparative analysis with the existing techniques.

Index Contrast dimension OpenPose 3D CNN ST-GCN Transformer ISTHC
Accuracy rate /% Conventional scene 92.10 88.70 89.30 90.50 97.60
Occlusion scene 78.20 82.40 85.10 83.80 94.50
Low light scene 65.30 72.10 80.20 77.60 93.80
Real-time performance Processing delay (ms/frame) 50 35 25 40 0.1
Maximum frame rate /fps 20 28 40 25 1000
Computational efficiency Parameter quantity /M 25.5 18.3 4.7 12.1 1.2
Training time /h 15.2 12.8 8.2 11.5 4.9

5. Conclusion

In order to improve the technical motion recognition results of cheerleaders, an ISTHC algorithm model was developed by combining inertial sensors, posture motion capture technology, and improved HGCN. Due to the fact that ISTHC consisted of multiple modules, ablation testing was conducted first. From the results, ISTHC had higher recognition accuracy in the training and testing sets compared with other combinations, with 98.3% and 97.9%, respectively. By introducing traditional HGCN, reference [24], and reference [25] as comparisons, ISTHC reached a stable state in the training and testing sets with only 37 and 56 iterations, respectively. This indicated that the iterative stability of the model was higher. In addition, ISTHC achieved high precision, recall, and F1 values of 0.96, 0.98, and 0.97, in benchmark performance tests, with MSE and MAE values as low as 0.12 and 0.25, respectively. In the confusion matrix, the recognition scores of ISTHC for the five types of cheerleading movements were all above 90 points, far higher than the other three comparison models. In practical applications, the accuracy of ISTHC in recognizing different movements reached up to 99.2%, with an average recognition time of less than 0.1 s. It can accurately identify the key movements of cheerleaders, and the recognition accuracy of bone points is higher. In summary, ISTHC demonstrates significant superiority and feasibility in cheerleading technique action recognition. However, considering the numerous combination movements in cheerleading, it can affect the effectiveness of action recognition. Therefore, subsequent research can increase the recognition and analysis of various combination actions, thereby further improving the application scope of this technology.

Acknowledgment

This research was funded by the 2023 Scientific Research Project of the Education Department of Hunan Province, “Research on the Construction Ideas and Promotion Strategies of Physical Education Curriculum Ideological and Political Education in Ordinary Colleges and Universities in Hunan Province” (23B1032), and the 2025 Project of the Hunan Social Science Achievement Review Committee, “Research on the Construction and Enhancement Strategies of Core Competencies of Physical Education Teachers in Colleges and Universities in the New Era” (XSP25YBC091).

References

1 
Y. Xing , J. Zhu , Y. Li , J. Huang , J. Song , An improved spatial temporal graph convolutional network for robust skeleton-based action recognition, Applied Intelligence, Vol. 53, No. 4, pp. 4592-4608, 2023DOI
2 
H. Mokayed , T. Z. Quan , L. Alkhaled , V. Sivakumar , Real-time human detection and counting system using deep learning computer vision techniques, Artificial Intelligence and Applications, Vol. 1, No. 4, pp. 221-229, 2023DOI
3 
Q. Zhu , H. Deng , Spatial adaptive graph convolutional network for skeleton-based action recognition, Applied Intelligence, Vol. 53, No. 14, pp. 17796-17808, 2023DOI
4 
Q. Cheng , J. Cheng , Z. Ren , Q. Zhang , J. Liu , Multi-scale spatial-temporal convolutional neural network for skeleton-based action recognition, Pattern Analysis and Applications, Vol. 26, No. 3, pp. 1303-1315, 2023DOI
5 
M. Rahevar , A. Ganatra , Spatial-temporal gated graph attention network for skeleton-based action recognition, Pattern Analysis and Applications, Vol. 26, No. 3, pp. 929-939, 2023DOI
6 
B. Candan , H. E. Soken , Robust attitude estimation using magnetic and inertial sensors, IFAC-PapersOnLine, Vol. 56, No. 2, pp. 4502-4507, 2023DOI
7 
Q. Meng , D. Han , Z. Wang , A model-free method for attitude estimation and inertial parameter identification of a noncooperative target, Advances in Space Research, Vol. 71, No. 3, pp. 1735-1751, 2023DOI
8 
X. Li , Q. Xu , Y. Tang , C. Hu , J. Niu , C. Xu , Unmanned aerial vehicle position estimation augmentation using optical flow sensor, IEEE Sensors Journal, Vol. 23, No. 13, pp. 14773-14780, 2023DOI
9 
T. Li , H. Yu , Upper body pose estimation using a visual-inertial sensor system with automatic sensor-to-segment calibration, IEEE Sensors Journal, Vol. 23, No. 6, pp. 6292-6302, 2023DOI
10 
W. Yang , J. Zhang , J. Cai , Z. Xu , HybridNet: integrating GCN and CNN for skeleton-based action recognition, Applied Intelligence, Vol. 53, No. 1, pp. 574-585, 2023DOI
11 
K. Hu , J. Jin , C. Shen , M. Xia , L. Weng , Attentional weighting strategy-based dynamic GCN for skeleton-based action recognition, Multimedia Systems, Vol. 29, No. 4, pp. 1941-1954, 2023DOI
12 
J. Zhang , W. Xie , C. Wang , R. Tu , Z. Tu , Graph-aware transformer for skeleton-based action recognition, The Visual Computer, Vol. 39, No. 10, pp. 4501-4512, 2023DOI
13 
Y. Liu , H. Zhang , Y. Li , K. He , D. Xu , Skeleton-based human action recognition via large-kernel attention graph convolutional network, IEEE Transactions on Visualization and Computer Graphics, Vol. 29, No. 5, pp. 2575-2585, 2023DOI
14 
L. Weng , W. Lou , X. Shen , F. Gao , A 3D graph convolutional networks model for 2D skeleton-based human action recognition, IET Image Processing, Vol. 17, No. 3, pp. 773-783, 2023DOI
15 
M. Lovanshi , V. Tiwari , Human skeleton pose and spatio-temporal feature-based activity recognition using ST-GCN, Multimedia Tools and Applications, Vol. 83, No. 5, pp. 12705-12730, 2024DOI
16 
D. Dhahbane , S. Sakhi , A. Nemra , Hardware implementation of attitude estimation methods using multiple GPS receivers, Unmanned Systems, Vol. 11, No. 4, pp. 301-315, 2023DOI
17 
J. Yu , S. Xian , Z. Zhang , X. Hou , J. He , J. Mu , X. Chou , Synergistic piezoelectricity enhanced BaTiO3/polyacrylonitrile elastomer-based highly sensitive pressure sensor for intelligent sensing and posture recognition applications, Nano Research, Vol. 16, No. 4, pp. 5490-5502, 2023DOI
18 
M. Ciccarelli , F. Corradini , M. Germani , G. Menchi , L. Mostarda , A. Papetti , M. Piangerelli , SPECTRE: a deep learning network for posture recognition in manufacturing, Journal of Intelligent Manufacturing, Vol. 34, No. 8, pp. 3469-3481, 2023DOI
19 
S. Mefteh , M. B. Kaaniche , R. Ksantini , A. Bouhoula , A novel multispectral corner detector and a new local descriptor: an application to human posture recognition, Multimedia Tools and Applications, Vol. 82, No. 19, pp. 28937-28956, 2023DOI
20 
H. Zhang , X. Liu , D. Yu , L. Guan , D. Wang , C. Ma , Z. Hu , Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network, Applied Intelligence, Vol. 53, No. 14, pp. 17629-17643, 2023DOI
21 
C. L. Yang , S. C. Hsu , Y. W. Hsu , Y. C. Kang , HAR-time: human action recognition with time factor analysis on worker operating time, International Journal of Computer Integrated Manufacturing, Vol. 36, No. 8, pp. 1219-1237, 2023DOI
22 
D. T. Pham , Q. T. Pham , T. T. Nguyen , T. L. Le , H. Vu , A lightweight graph convolutional network for skeleton-based action recognition, Multimedia Tools and Applications, Vol. 82, No. 2, pp. 3055-3079, 2023Google Search
23 
L. Yu , L. Tian , Q. Du , J. A. Bhutto , Multi-stream adaptive 3D attention graph convolution network for skeleton-based action recognition, Applied Intelligence, Vol. 53, No. 12, pp. 14838-14854, 2023DOI
24 
Y. Xie , S. Li , C. T. Wu , Z. Lai , M. Su , A novel hypergraph convolution network for wafer defect patterns identification based on an unbalanced dataset, Journal of Intelligent Manufacturing, Vol. 35, No. 2, pp. 633-646, 2024DOI
25 
P. Xuan , S. Lu , H. Cui , S. Wang , T. Nakaguchi , T. Zhang , Learning association characteristics by dynamic hypergraph and gated convolution enhanced pairwise attributes for prediction of disease-related lncRNAs, Journal of Chemical Information and Modeling, Vol. 64, No. 8, pp. 3569-3578, 2024DOI
Yulin Kuang
../../Resources/ieie/IEIESPC.2026.15.3.385/au1.png

Yulin Kuang was born in July 1990 in Chenzhou, Hunan Province, China. She received her bachelor’s degree in sports training from Hunan Normal University in 2013 and a master’s degree in physical education from Hunan Normal University in 2015. Her research direction is physical education teaching and sports training. Since 2015, she has been a full-time physical education teacher in the School of General Education, Hunan University of Information Technology. She has published 8 academic papers and has participated in 6 scientific research projects, 6 patents, and 3 academic awards.