3.1 Technology and Theoretical Basis of Motion Recognition
                  Tennis is extremely demanding, not only in terms of technique, but also in terms of
                     physical conditions [23]. Because of its complicated technical actions and various other reasons, it is easy
                     to make mistakes in the learning process. Especially for beginners, starting it is
                     also quite difficult, and incorrect technical movements may even cause sports injuries
                     such as muscle strain. If the wrong actions are not corrected in time in the early
                     stage, students will easily form wrong technical stereotypes, which will affect the
                     mastery and improvement of the next stage of technique [24]. Therefore, in the process of teaching, teachers should diagnose and analyze students’
                     wrong actions in time and put forward correction methods for students so as to better
                     improve the quality of tennis teaching. 
                  
                  Human motion recognition based on video is a basic subject in computer vision research.
                     A DL model is a nonlinear network model with many hidden layers. Through training
                     with large-scale original data, the network can extract the features that can best
                     express the original data and then predict or classify samples. The DL model architecture
                     is shown in Fig. 1. With rapid development, DL technology has an advantage in the fields of computer
                     vision, natural language processing, and so on [25].
                  
                  
                        Fig. 1. DL model architecture diagram.
 
                  CNN starts at the bottom of the image and gradually extracts features towards the
                     top. At a lower level, CNN learns to extract simple edge and color features, such
                     as lines, curves, and colors. As the hierarchy increases, CNN gradually learns more
                     complex features, such as shapes, object parts, and ultimately complete objects. This
                     feature extraction process from bottom to top is the strength of CNN, as it allows
                     the network to automatically find the most useful features for recognition tasks.
                     In addition, CNN also has good robustness and can handle changes in image size, rotation,
                     and flipping, which makes it perform well in many tasks, especially in image recognition
                     tasks. Compared with the traditional way of extracting data features manually, a CNN
                     automatically extracts richer and more abstract features of objects by knowing the
                     objects themselves. An NN can approximate any nonlinear continuous function with arbitrary
                     precision. Many problems in the modeling process are highly nonlinear. With the continuous
                     development of DL technology, a CNN has been widely used by researchers, and its effect
                     has been well verified in many network models. A CNN adopts the form of partial connection,
                     and only some neurons in the network are connected. Generally, a CNN consists of three
                     parts: a convolution layer for extracting features, a pool layer for reducing the
                     size of feature graph, and a full connection layer. A CNN is characterized by its
                     convolution operation, and the convolution process is shown in Fig. 2.
                  
                  
                        Fig. 2. Convolution process.
 
                  In the pixel block of an image, a pixel value undergoes an operation with a corresponding
                     convolution kernel to yield an output value for that specific block. Subsequently,
                     a new pixel block is selected, and the convolution kernel is shifted, allowing for
                     the computation of convolutions across the entire image. This series of operations
                     is referred to as the initial convolution process.
                  
                
               
                     3.2 Human Motion Recognition Method
                  Human tracking is a technology that uses various sensors, algorithms, and computer
                     vision technologies to identify and track the position and posture of the human body
                     in space in real-time. By establishing a motion model of the human body, it is possible
                     to track and recognize the human body. For example, a 3D human model can be used to
                     simulate human motion and algorithms can be used to fit actual human motion data.
                     By comparing motion models over several consecutive days, the corresponding relationship
                     between the human body or joint points can be determined. There are many ways of matching,
                     such as location-based matching, material and color-based matching, and speed-based
                     matching. The feature extraction method and recognition algorithm are the two most
                     important parts in the recognition process. A diagram of human motion recognition
                     is shown in Fig. 3.
                  
                  
                        Fig. 3. Diagram of human motion recognition.
 
                  After different features are used to represent human movements, the recognition of
                     human movements becomes a pattern classification problem. Classifiers can be divided
                     into linear classifiers and nonlinear classifiers according to different classification
                     planes. A nonlinear classification algorithm is difficult to solve, and a large number
                     of human motion classification methods use a linear classifier.
                  
                  In the task of human motion recognition, data modalities are generally divided into
                     three categories: video data, depth images, and bone motion sequences. According to
                     the different data modalities of the recognition task, different algorithms or models
                     are designed to complete it. The core idea of the method is to extract the whole human
                     body contour (which includes motion features, the whole structure, and the external
                     shape of the human body). We make use of these three characteristics in the model.
                     Finally, the motion recognition is completed by the constructed model.
                  
                
               
                     3.3 Construction of Model
                  In the modal skeletal motion sequence data, each sample uses human bones and 25 joints
                     to represent a moving individual, and the 3D position change of joint points with
                     time represents human movements. At the same time, the coordinates of these joint
                     points are normalized so that the coordinate data is not affected by the scale. A
                     batch standardization layer was used to optimize the model and enhance the generalization
                     ability of the network. It can unify the scattered data, normalize them, accelerate
                     the convergence of the loss function, and help to reduce the gradient dispersion and
                     spread the gradient. 
                  
                  We collected tennis training action videos and extracted tennis training action features.
                     For a 2D input of size $N\times N$, the convolution kernel is $k\times k$, the stride
                     is $s$, and the zero padding is a convolution operation of $p$. If the output size
                     after convolution is $m\times m$, the calculation of $m$ is as follows:
                  
                  
                  $\left[\cdot \right]$ means rounding down. When appropriate convolution parameters
                     are set, the scale after convolution can be kept unchanged, and not all convolution
                     will reduce the input dimension, although it is necessary in most cases. However,
                     in general, in convolution networks, the pool layer is used for down sampling to achieve
                     the goal of feature fusion and smoothing.
                  
                  
                        Fig. 4. ReLU function diagram.
 
                  The ReLU function is shown in Fig. 4. When the input value is negative, its output is 0. When the input is positive, the
                     output is the input. Its expression is as follows:
                  
                  
                  A sigmoid function maps the input to the interval of (0,1) so that the transmission
                     process information does not diverge. Its expression is as follows:
                  
                  
                  The tanh function maps the input to the interval (-1,1), and its average value is
                     0. The function expression is as follows:
                  
                  
                  Suppose that in a 2D CNN, an input tensor of size $C^{l}\times H^{l}\times W^{l}$
                     is layer $l$. $C^{l}$ is the number of input channels of layer $l$. The size of a
                     single convolution kernel is $C^{l}\times h^{l}\times w^{l}$. Then, corresponding
                     to a convolutional layer with $C^{l+1}$ hidden neurons, the output of the corresponding
                     position is as follows:
                  
                  
                  $d$ is the neuron number in layer$l$, and $i^{l}$ and $j^{l}$ represent the location
                     information. The constraints are Eqs. (6) and (7):
                  
                  
                  
                  $p$ is the convolution kernel parameter, $b$ is the bias parameter in the convolution,
                     and $\sigma \left(\cdot \right)$ is the activation function.
                  
                  Given an image $P$ of size $M\times N$, the image matrix can be regarded as a one-dimensional
                     vector in row-major order, and a one-dimensional label vector is defined:
                  
                  
                  The value range of element $A_{i}$ in $A$ is $\left\{0,1\right\}$. The image segmentation
                     effect can be evaluated by calculating the cost function labeled $A$.
                  
                  
                  
                  
                  Let $p_{\lambda }$ be the probability density function, $\lambda =\left[\lambda _{1},\lambda
                     _{2},\lambda _{3},\ldots ,\lambda _{M}\right]$ be the $M$ parameter vectors of $p_{\lambda
                     }$, and $X_{1}=\left[x_{t},t=1,2,3,\ldots ,T_{I}\right]$ be the effective features
                     of tennis training action videos. $d$ is the feature dimension after dimension reduction,
                     including $K$ Gaussian unit parameter sets:
                  
                  
                  Its Gaussian mixture model is:
                  
                  $w_{i}$, $u_{i}$, and $\Sigma _{i}$ are the mixed weight, mean vector and covariance
                     matrix, respectively, and $p_{i}\left(x_{t}\right)$ is the $i$th Gaussian unit of
                     $x_{t}$. According to the Bayesian equation, the calculation equation of the probability
                     that $x_{t}$ is assigned to the $i$th Gaussian unit is:
                  
                  
                  Then, the gradients of $x_{t}$ with respect to $\lambda =\left\{w_{i},u_{i},\Sigma
                     _{i}\right\}\,\,\,i=1,2,3,\ldots ,k$ are expressed as:
                  
                  
                  
                  
                  where $\sigma _{i}^{k}$ represents the standard deviation in the covariance matrix
                     $\Sigma _{i}$.
                  
                  The identification process essentially selects a model from the set of models that
                     best describes the observed signal. We first perform feature extraction on the input
                     action sequence to obtain an observation feature sequence $O$ corresponding to the
                     action sequence. We then calculate the probability $P\left(S\left| O,\theta \right.\right)$
                     corresponding to each model in the model set. Finally, the category of the action
                     is determined according to the following maximum likelihood equation:
                  
                  
                  We calculate the conditional probability $P\left(S\left| O\right.\right)$ of the label
                     sequence $S$ given the observation sequence $O$ and use the forward-backward dynamic
                     programming algorithm to calculate the labeling action with the highest probability
                     as the recognition result of this step. In this study, after the three-dimensional
                     pool processing, both the space size and the time size of the feature graph are reduced,
                     which greatly reduces the calculation amount of the subsequent network. The maximum
                     pooling operation can effectively reduce the number of parameters and computational
                     complexity of the network. This enables the network to complete calculations faster
                     when processing input data, improving time vitality.