3.1 Technology and Theoretical Basis of Motion Recognition
Tennis is extremely demanding, not only in terms of technique, but also in terms of
physical conditions [23]. Because of its complicated technical actions and various other reasons, it is easy
to make mistakes in the learning process. Especially for beginners, starting it is
also quite difficult, and incorrect technical movements may even cause sports injuries
such as muscle strain. If the wrong actions are not corrected in time in the early
stage, students will easily form wrong technical stereotypes, which will affect the
mastery and improvement of the next stage of technique [24]. Therefore, in the process of teaching, teachers should diagnose and analyze students’
wrong actions in time and put forward correction methods for students so as to better
improve the quality of tennis teaching.
Human motion recognition based on video is a basic subject in computer vision research.
A DL model is a nonlinear network model with many hidden layers. Through training
with large-scale original data, the network can extract the features that can best
express the original data and then predict or classify samples. The DL model architecture
is shown in Fig. 1. With rapid development, DL technology has an advantage in the fields of computer
vision, natural language processing, and so on [25].
Fig. 1. DL model architecture diagram.
CNN starts at the bottom of the image and gradually extracts features towards the
top. At a lower level, CNN learns to extract simple edge and color features, such
as lines, curves, and colors. As the hierarchy increases, CNN gradually learns more
complex features, such as shapes, object parts, and ultimately complete objects. This
feature extraction process from bottom to top is the strength of CNN, as it allows
the network to automatically find the most useful features for recognition tasks.
In addition, CNN also has good robustness and can handle changes in image size, rotation,
and flipping, which makes it perform well in many tasks, especially in image recognition
tasks. Compared with the traditional way of extracting data features manually, a CNN
automatically extracts richer and more abstract features of objects by knowing the
objects themselves. An NN can approximate any nonlinear continuous function with arbitrary
precision. Many problems in the modeling process are highly nonlinear. With the continuous
development of DL technology, a CNN has been widely used by researchers, and its effect
has been well verified in many network models. A CNN adopts the form of partial connection,
and only some neurons in the network are connected. Generally, a CNN consists of three
parts: a convolution layer for extracting features, a pool layer for reducing the
size of feature graph, and a full connection layer. A CNN is characterized by its
convolution operation, and the convolution process is shown in Fig. 2.
Fig. 2. Convolution process.
In the pixel block of an image, a pixel value undergoes an operation with a corresponding
convolution kernel to yield an output value for that specific block. Subsequently,
a new pixel block is selected, and the convolution kernel is shifted, allowing for
the computation of convolutions across the entire image. This series of operations
is referred to as the initial convolution process.
3.2 Human Motion Recognition Method
Human tracking is a technology that uses various sensors, algorithms, and computer
vision technologies to identify and track the position and posture of the human body
in space in real-time. By establishing a motion model of the human body, it is possible
to track and recognize the human body. For example, a 3D human model can be used to
simulate human motion and algorithms can be used to fit actual human motion data.
By comparing motion models over several consecutive days, the corresponding relationship
between the human body or joint points can be determined. There are many ways of matching,
such as location-based matching, material and color-based matching, and speed-based
matching. The feature extraction method and recognition algorithm are the two most
important parts in the recognition process. A diagram of human motion recognition
is shown in Fig. 3.
Fig. 3. Diagram of human motion recognition.
After different features are used to represent human movements, the recognition of
human movements becomes a pattern classification problem. Classifiers can be divided
into linear classifiers and nonlinear classifiers according to different classification
planes. A nonlinear classification algorithm is difficult to solve, and a large number
of human motion classification methods use a linear classifier.
In the task of human motion recognition, data modalities are generally divided into
three categories: video data, depth images, and bone motion sequences. According to
the different data modalities of the recognition task, different algorithms or models
are designed to complete it. The core idea of the method is to extract the whole human
body contour (which includes motion features, the whole structure, and the external
shape of the human body). We make use of these three characteristics in the model.
Finally, the motion recognition is completed by the constructed model.
3.3 Construction of Model
In the modal skeletal motion sequence data, each sample uses human bones and 25 joints
to represent a moving individual, and the 3D position change of joint points with
time represents human movements. At the same time, the coordinates of these joint
points are normalized so that the coordinate data is not affected by the scale. A
batch standardization layer was used to optimize the model and enhance the generalization
ability of the network. It can unify the scattered data, normalize them, accelerate
the convergence of the loss function, and help to reduce the gradient dispersion and
spread the gradient.
We collected tennis training action videos and extracted tennis training action features.
For a 2D input of size $N\times N$, the convolution kernel is $k\times k$, the stride
is $s$, and the zero padding is a convolution operation of $p$. If the output size
after convolution is $m\times m$, the calculation of $m$ is as follows:
$\left[\cdot \right]$ means rounding down. When appropriate convolution parameters
are set, the scale after convolution can be kept unchanged, and not all convolution
will reduce the input dimension, although it is necessary in most cases. However,
in general, in convolution networks, the pool layer is used for down sampling to achieve
the goal of feature fusion and smoothing.
Fig. 4. ReLU function diagram.
The ReLU function is shown in Fig. 4. When the input value is negative, its output is 0. When the input is positive, the
output is the input. Its expression is as follows:
A sigmoid function maps the input to the interval of (0,1) so that the transmission
process information does not diverge. Its expression is as follows:
The tanh function maps the input to the interval (-1,1), and its average value is
0. The function expression is as follows:
Suppose that in a 2D CNN, an input tensor of size $C^{l}\times H^{l}\times W^{l}$
is layer $l$. $C^{l}$ is the number of input channels of layer $l$. The size of a
single convolution kernel is $C^{l}\times h^{l}\times w^{l}$. Then, corresponding
to a convolutional layer with $C^{l+1}$ hidden neurons, the output of the corresponding
position is as follows:
$d$ is the neuron number in layer$l$, and $i^{l}$ and $j^{l}$ represent the location
information. The constraints are Eqs. (6) and (7):
$p$ is the convolution kernel parameter, $b$ is the bias parameter in the convolution,
and $\sigma \left(\cdot \right)$ is the activation function.
Given an image $P$ of size $M\times N$, the image matrix can be regarded as a one-dimensional
vector in row-major order, and a one-dimensional label vector is defined:
The value range of element $A_{i}$ in $A$ is $\left\{0,1\right\}$. The image segmentation
effect can be evaluated by calculating the cost function labeled $A$.
Let $p_{\lambda }$ be the probability density function, $\lambda =\left[\lambda _{1},\lambda
_{2},\lambda _{3},\ldots ,\lambda _{M}\right]$ be the $M$ parameter vectors of $p_{\lambda
}$, and $X_{1}=\left[x_{t},t=1,2,3,\ldots ,T_{I}\right]$ be the effective features
of tennis training action videos. $d$ is the feature dimension after dimension reduction,
including $K$ Gaussian unit parameter sets:
Its Gaussian mixture model is:
$w_{i}$, $u_{i}$, and $\Sigma _{i}$ are the mixed weight, mean vector and covariance
matrix, respectively, and $p_{i}\left(x_{t}\right)$ is the $i$th Gaussian unit of
$x_{t}$. According to the Bayesian equation, the calculation equation of the probability
that $x_{t}$ is assigned to the $i$th Gaussian unit is:
Then, the gradients of $x_{t}$ with respect to $\lambda =\left\{w_{i},u_{i},\Sigma
_{i}\right\}\,\,\,i=1,2,3,\ldots ,k$ are expressed as:
where $\sigma _{i}^{k}$ represents the standard deviation in the covariance matrix
$\Sigma _{i}$.
The identification process essentially selects a model from the set of models that
best describes the observed signal. We first perform feature extraction on the input
action sequence to obtain an observation feature sequence $O$ corresponding to the
action sequence. We then calculate the probability $P\left(S\left| O,\theta \right.\right)$
corresponding to each model in the model set. Finally, the category of the action
is determined according to the following maximum likelihood equation:
We calculate the conditional probability $P\left(S\left| O\right.\right)$ of the label
sequence $S$ given the observation sequence $O$ and use the forward-backward dynamic
programming algorithm to calculate the labeling action with the highest probability
as the recognition result of this step. In this study, after the three-dimensional
pool processing, both the space size and the time size of the feature graph are reduced,
which greatly reduces the calculation amount of the subsequent network. The maximum
pooling operation can effectively reduce the number of parameters and computational
complexity of the network. This enables the network to complete calculations faster
when processing input data, improving time vitality.