3.1 Construction of the V-DBN Model
The characteristics of children’s physical and mental development are different from
those of adults. It is necessary to grasp the characteristics of children’s sports
development when carrying out football training to avoid injuries and ensure the development
of children’s football education. Therefore, the characteristics of children’s football
movements are collected based on 3D bone recognition, and recognition of children’s
movements can be completed through the proposed V-DBN model. In children’s movements,
3D skeleton recognition completes the extraction of motion features with aggregate
data (skeleton position data), and provides them to the human motion recognition model
[9]. The human motion data are collected by the Kinect device, and the human skeleton
point collection structure can be seen in Fig. 1.
Position information for key human bones is collected, and each point represents position
information and time data, with a bone point in each frame represented by a row. In
this research, bone recognition is divided into two types. Scheme A uses the frame
selection range of bone points as a group to solve time series features, as shown
in Fig. 2(a). Scheme B divides the skeleton into three layers from the outside to the inside,
as shown in Fig. 2(b) [10]. The skeletal point division method of Scheme A can better distinguish changes in
the human body, but lacks consideration of changes within the bones, whereas Scheme
B fully considers the effects of each bone joint, but lacks consideration of changes
in specific parts. On the whole, Scheme B considers the positional relationship of
each bone point, so it was used to calculate the time series characteristics of human
actions.
Time series and spatial features are used to reflect the changes in human motion,
so in the description of the time series relationship, the bone points are represented
by displacement $x$. The time relationship description is represented by acceleration
$a$and velocity $v$, and the displacement expression is shown in formula (1) [11]:
In formula (1), $p$ is a bone’s three-dimensional space position parameter $(x,y,z)$, $i$ represents
a certain bone point of the person, and $f$represents the current frame. The speed
expression is shown in formula (2):
In formula (2), $\Delta t$ represents $\left[f-1,f+1\right]$, the number of frames between bone
points, and the acceleration expression is shown in formula (3):
Editor\textemdash{}Highlight\textemdash{}Is this the intended meaning? If not, please
clarify (i.e., between what and what?).
In order to describe spatial positions in human motion more conveniently, three time
series are defined and processed, as shown in formula (4):
Considering that human action features have the characteristics of both space sequences
and time series, the time series describes the movement of bone points through three
types of motion features, and there are certain differences between the features due
to the differences in human actions [12]. Especially when people are moving slowly, there is not a big difference between
speed and acceleration in people’s movement, and movement is mainly described by displacement.
Then, the positional relationship between the reference point and the bone point is
shown in formula (5):
In formula (5), $p_{i}^{f}$ and $p_{j}^{f}$ represent spatial relative position parameter $f$ of
skeleton node $i$ and at frame time $j$, respectively. Table 1 shows the relationships between the four relative spatial position parameters.
The spatial position feature is similar to the time series expression, and can be
described by concatenation, as shown in formula (6):
In the description of a single-action feature, similar individual actions will be
the same in the description process, especially when the distance between classes
is small and the actions cannot be distinguished [13]. For example, running and walking are close in description, which will affect the
action recognition effect. Therefore, combining multiple features and expanding the
distance between different actions can improve the effectiveness of human action recognition.
Therefore, the V-DBN action recognition model (combining the VLAD model and the DBN
network) is used for optimization. By unifying the length of action features through
the VLAD model, it is necessary to encode the action features so as to more fully
describe them [14]. Arbitrary time series expressions are seen in formula (7):
In (7), the $n$ value is taken from $\left\{1,2,3\right\}$ so $t_{n}^{l}$ represents the
$l$ characteristic subsequence in the time dimension, with $L$representing the total
number. Then, any spatial position can be expressed with (8):
In formula (8), $S_{m}^{l}$ represents the $l$ feature subsequence in the mth spatial position,
and the $m$ value is taken from $\left\{1,2,3,4\right\}$. The temporal and spatial
feature sequences are converted into frames, and a total of 24 cluster centers can
be obtained through the clustering algorithm [15]. Use Euclidean distance to calculate the distance between the cluster center and
each of the feature data to get 24 clusters; then, the expression of time clusters
is as seen in test (9):
Editor - Highlight - Is this the intended meaning? If not, please define as intended.
In test (9), $N_{u,n,qn}$ represents the time clustering center, $t_{n,f}$ represents the sequence
unit time frame, and $\mu _{u,np}$ represents the distance coefficient corresponding
to time. Then, the spatial cluster expression is seen in (10):
Editor\textemdash{}Highlight\textemdash{}Is this intended (not NN?)
In (10), $NN_{u,m,qm}$ represents the spatial clustering center, $s_{n,f}$ represents the
sequence unit spatial frame, and $\mu _{u,mp}$ represents the distance coefficient
corresponding to space. Each human action, $U\times 23$, is described by vectors of
the same dimension, and each action vector contains four space sequences and three
time series features [16]. The large amount of data will affect human action recognition. Therefore, the contrastive
divergence method is used to reduce data processing, and the principle of the V-DBN’s
football action recognition for children can be seen in Fig. 3.
In Fig. 3, the VLAD model is responsible for unification of feature data. In the RBM layer,
the contrastive divergence algorithm is mainly used to complete the simulation of
the initial feature data, and optimizes the data to improve the feature extraction
effect.
Fig. 1. Distributions of the different skeleton points.
Fig. 2. Divisions for two kinds of sports bones.
Fig. 3. Schematic diagram of the V-DBN action recognition model.
Table 1. Corresponding results of spatial positions.
Spatial description operator
|
Space 1
|
Space 2
|
Space 3
|
Space 4
|
Skeletal Point Subset
|
Right Hand and Left Hand
|
Right Foot, Right Hand, Head
|
Left Hand, Left Foot, Head
|
Right Foot, Left Foot, Head
|
Reference Frame
|
Head
|
Left hip
|
Right hip
|
Spine
|
3.2 Construction of the LSTM Model
In the V-DBN action recognition model, the VLAD model unifies the length of the action
feature data, but the relationship between the upper and lower frames of continuous
skeletal action is interrupted, which is not conducive to the recognition of human
action. To overcome the above problems, the LSTM model optimizes the process to recognize
human actions [17]. Since V-DBN human action model recognition mainly uses the clustering method to
realize the operation of the unit frame, and puts frames with similar action structures
into one category, the relationship between the front and back sequences of the skeleton
is destroyed, and continuous feature data of each moment are processed via LSTM. After
processing, training on (and learning) the human body feature data can be completed,
which is shown in Fig. 4 [18].
Fig. 4 is arranged in frame order, and sequence feature data are processed through the LSTM
model. The acquired feature data are input to the LSTM model in a time relationship,
so the distribution relationship of the action sequence features in the time dimension
can be obtained. For a more detailed description of bone distributions in the same
frame, the spatial angle feature is optimized on the basis of quaternions, and three
imaginary parts and one real part are used to describe the rotation point and retrograde
of the bone action. The corresponding coordinate relationship is constructed as seen
in Fig. 5 [19].
As shown in Fig. 5, a vector is used to describe the human skeleton relationship, and the angle between
the bones of each frame is solved. The first step is to collect human bone data, and
obtain two vectors, $v_{1}$ and $v_{2}$, where each vector has three pieces of coordinate
position information; then, the expression of the bone vector is seen in (11):
In (11), $u_{x},u_{y},u_{z}$ represents the coordinate position information of the bone vector.
We then calculate the rotation angle between the two vectors, as seen in (12):
By solving the rotation angle parameters between bone vectors, the coefficient corresponding
to the quaternion can be obtained from (11) and (12), and the real coefficient of the quaternion is as seen in (13) [20]:
In (13), $q_{0}$ represents the real coefficient, and the imaginary coefficient is $q_{1}$,
expressed as seen in (14):
In (14), $u_{x}$ represents axis information of bone vector $x$, and the imaginary coefficient
is $q_{2}$, as expressed in (15):
In (15), $u_{y}$ represents axis information of bone vector $y$, and the imaginary coefficient
is $q_{3}$, expressed in (16):
In (16), $u_{z}$ represents axis information of bone vector $z$, and the coefficient corresponding
to the quaternion is obtained by solving the angle between the vectors. Using bone
vector features in Table 2, recognition of character action features can be effectively realized.
The LSTM model is used to process the continuous feature data of the human skeleton
at each moment to solve the problem of breaking the relationship between the upper
and lower frames of the continuous skeleton movements. The entire process of young
children's football movement recognition is shown in Fig. 6.
Using Kinect devices to collect young children's football movement data, build a model
original motion recognition library. Extract the bone feature data, unify the length
of motion features through the VLAD model, optimize the feature data using the DBN
model, and optimize the continuous motion feature data using the LSTM model. Finally,
complete the recognition and analysis of young children's football movement.
Fig. 4. Schematic diagram of the LSTM model processing continuous feature data.
Fig. 5. Skeleton origin coordinates.
Fig. 6. The entire toddler soccer movement recognition process.
Fig. 6. Action recognition results for different iterations.
Table 2. Skeleton vector angle feature selection table.
Angular Features
|
B1
|
B2
|
B3
|
B4
|
B5
|
|
Skeletal Vector V1
|
(1,3), (7,9), (10,12), (13,15)
|
(1,3), (4,6), (13,15),
(11,12), (13,15)
|
(1,3), (4,6), (13,15), (8,9), (8,9)
|
(7,9), (10,12), (1,3), (5,6)
|
(12,6), (6,1), (1,9), (9,15)
|
|
|
Skeletal Vector V2
|
(4,6)
|
(7,9)
|
(10,12)
|
(13,15)
|
(9,15), (15,12), (6,1), (1,9)
|
|