To enhance the recognition effect of cheerleaders’ technical movements, the study
first combines inertial sensors, posture motion capture technology, and human skeleton
map to collect 3D posture data of cheerleaders. Secondly, the GCN is optimized and
the extracted 3D pose data is applied to design the cheerleader action recognition
model.
3.1. 3D Pose Data Acquisition Based on Inertial Sensors and Pose Motion Capture
Cheerleading is a comprehensive sports activity that combines dance, gymnastics, and
music, emphasizing the smoothness and coordination of movements. Due to the complexity
and diversity of cheerleading movements, there are non physical dependencies between
body joints, which increases the difficulty of action recognition [14]. In order to accurately extract and analyze these joint relationships, the human
skeleton diagram is introduced in the study. By constructing a human skeleton diagram,
the connection and interaction relationships of each joint in cheerleading exercise
can be intuitively represented. The skeleton relationship diagram of the cheerleading
action drawn is shown in Fig. 1.
Fig. 1. Skeleton of cheerleading athletes.
Fig. 1 shows the original movement and simplified skeleton of a cheerleader. From Fig. 1, multiple human skeleton vertices represent joint points, while connecting lines
represent edges of the skeleton graph. Once fully connected, it can represent the
attribute relationships between joints. Cheerleading exercises involve highly coordinated
multi-joint movements (such as synchronized arm swings and leg jumps) that have non
physical dependencies between joints. Traditional action recognition methods, such
as CNN, are difficult to model such complex spatial relationships due to their inability
to directly process non-Euclidean data. Therefore, this study chose GCN as the basic
framework. GCN excels in processing graph structured data (such as human skeletal
maps) and achieves feature propagation by encoding joint connections through adjacency
matrices. This network structure can extend CNN from regular grids to unordered graphs
of arbitrary structures, using skeleton information images as input to analyze the
motion patterns of target objects [15,
16]. The main structure of GCN is shown in Fig. 2.
Fig. 2. GCN structure diagram.
The GCN structure in Fig. 2 is similar to the CNN structure, consisting of an input layer, a graph convolutional
layer, and an output layer. GCN typically utilizes the feature information of nodes
and their neighboring nodes in the graph to obtain feature vectors through weighted
averaging, thereby increasing the weight of nodes with lower degrees. Then, these
feature vectors are trained and learned through neural networks to effectively utilize
the structural information of the graph [17]. GCN calculates the new feature representation of the current node by weighted averaging
the features of neighboring nodes, which can be achieved through matrix multiplication.
The propagation mode between GCN layers is shown in Eq. (1)
[18].
In Eq. (1), $A$ represents the adjacency matrix. $\tilde{A}$ represents adding an identity matrix
based on the adjacency matrix. $H^{(l)}$ represents the features of layer $l$. $\sigma$
signifies the nonlinear activation function. $W^{(l)}$ represents the current trainable
parameter matrix. $\tilde{D}$ represents the degree matrix of $\tilde{A}$. $\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}$
represents matrix normalization processing. The propagation mode $z$ of the entire
GCN layer is shown in Eq. (2).
In Eq. (2), $W^{(0)}$ signifies the observable parameters of the input layer. $X$ represents
a multidimensional eigenvector matrix. However, traditional GCN has two major limitations
in cheerleading action recognition: firstly, the pre-defined adjacency matrix of GCN
can only represent binary joint relationships (such as “hand elbow”), and cannot model
high-order interactions (such as the collaborative relationship between “hand waist
foot” in lifting actions); Secondly, cheerleading movements have spatiotemporal characteristics
and require joint analysis of multiple frame motion sequences. To address these issues,
the study upgraded GCN to Hypergraph Convolutional Network (HGCN). Unlike GCN, HGCN
uses hyperedges to group multiple joints (such as simultaneously connecting hands,
waist, and feet during lifting actions), thus modeling complex multi-joint dependencies
[19,
20]. For example, a hyperedge in HGCN can simultaneously represent the collaborative
relationship of three joints, which is crucial for capturing cheerleading specific
movements.
3.2. Design of Action Recognition Algorithm for Cheerleaders Based on ISTHC
Although traditional HGCN has significant advantages in extracting high-order feature
information using hyperedges, it can effectively simulate the relationships among
multiple joints. However, the HGCN has shortcomings in handling coordinated movements
of distant joints such as hands and feet, especially in recognizing cheerleading movements
involving synchronized movements of the upper and lower body [21]. Therefore, in order to accurately identify the technical movements of cheerleading,
an improved hypergraph convolutional network (IHGCN) is proposed. The framework structure
of IHGCN is shown in Fig. 3.
Fig. 3. IHGCN structure diagram.
In Fig. 3, IHGCN mainly consists of two parts: SAM module and topology module. To further optimize,
the study introduced SAM, which dynamically assigns weights based on the importance
of joints in specific actions. Although HGCN provides a hypergraph structure, its
edge weights are still static. SAM can adaptively strengthen key joints (such as arms
in swinging movements) and weaken secondary joints (such as head fine-tuning), thereby
improving feature discrimination.
In the IHGCN model, a triplet in the hypergraph contains a set of hyperedges $E$,
a set of vertices $V$, and a weight matrix $W$ for each edge. $v$, $e$ and $w$ respectively
represent the weight matrices of a vertex, a connecting edge, and an edge. The convolution
operation in IHGCN is shown in Eq. (3)
[22].
In Eq. (3), $\sigma$ represents the nonlinear activation function. $H$ and $H^T$ respectively
represent hypergraphs and their transposes. $w^l$ signifies the weight matrix of the
3D action sequence. $x^l$ represents the parameter matrix of the 3D action sequence.
$W$ signifies the weight matrix of the hyperedge. $W^l$ signifies the parameter matrix
of the hyperedge. $D_v$ represents the diagonal matrix of vertex degrees. $D_e$ represents
the diagonal matrix of hyperedges. $H_m$ represents the adjacency matrix transformed
from $H$. Through SAM, it is easy to obtain the representation of the initial correlation
matrix of bone joints in both temporal and spatial dimensions. The temporal channel
module refers to the ability to allocate attention weight values for each frame of
the video action sequence in a reasonable manner, so that important dynamic joints
can be identified and unimportant dynamic joints can be eliminated. The output layer
of the self-attention layer after temporal channel optimization is shown in Eq. (4).
In Eq. (4), $\alpha$ represents the attention matrix. After obtaining the output of the attention
layer, the function $\tau$ is used to refine each frame of the hypergraph, as shown
in Eq. (5).
In Eq. (5), $H_m$ represents the adjacency matrix derived from the hypergraph transformation.
$H_t$ represents Time Sparse Hypergraph (TTH). The channel module is similar to a
general CNN, which has independent spatial kernels to capture different spatial information.
The channel module proposed in this study can dynamically recommend a unique hypergraph
convolution kernel for each channel, allowing joint connections under different motion
forms to be split and refined. After the convolution operation is completed, it is
re-aggregated to obtain the final Channel Sparse Hypergraph (CTH). The process of
generating CTH is shown in Eq. (6).
In Eq. (6), $\Omega$ represents an increasing function. $M$ represents the aggregation function.
$C$ represents the push function. $W_\alpha$ and $W_\beta$ both belong to the weight
matrix. The construction mode of dynamic recommendation is shown in Eq. (7).
In Eq. (7), all algebraic meanings remain the same as before. By reducing the dimensionality
of input data, the difficulty of dynamic push can be reduced, making it easier to
obtain the joint relationship matrix for each sample. The adjacency matrix of feature
channels for different actions is shown in Eq. (8).
In Eq. (8), $W_\gamma$ represents the weight matrix. Due to the complexity of general action
temporal series and channel sequences, in order to improve the spatiotemporal relationship
of the fused hypergraph joints, the study introduces Spatio-Temporal Hypergraph Convolution
(STHC) to establish a spatiotemporal relationship window for multiple joints [23]. Each action sequence is window decomposed and finally combined, as shown in Eq.
(9).
In Eq. (9), $H^\zeta$ represents the combined spatiotemporal hypergraph. $\zeta$ represents
a hyperedge frame. $R^{N \times T \times E}$ is the spatio-temporal feature matrix
obtained by window update, and its dimension is $N \times T \times E$, where $N$ represents
the number of joints, $T$ represents the time step, and $E$ represents the number
of feature channels. By continuously updating the window, feature $X^*$ can be obtained,
as shown in Eq. (10).
Finally, to reduce timing and channel redundancy, the study integrated TTH and CTH.
TTH prunes redundant frames in the temporal sequence (such as transition poses between
jumping actions), while CTH optimizes the channel dimension feature map to focus on
task related joints. These modules together form the Improved Spatiotemporal Hypergraph
Convolution (ISTHC) framework, which achieves unified modeling of spatial, temporal,
and channel dimensions. The ISTHC framework is shown in Fig. 4.
Fig. 4. ISTHC structure diagram.
In Fig. 4, the entire model adds the STHC module based on the IHGCN framework. Firstly, HGCN
performs hypernode feature extraction on the technical action data of the original
cheerleader. Secondly, TTH is performed in SAM, and CTH is performed in the topology
module. Afterwards, STHC is used to perform window segmentation and recombination
on the fused hypergraph of refined and channel hypergraphs, in order to extract more
diverse motion joints of cheerleaders.