Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 04, p.312-322

ISSN (online) :

2287-5255

Received : 20 February 2023Accepted : 28 March 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.4.312

Regular Paper

Children’s Football Action Recognition based on LSTM and a V-DBN

ChenZhaosheng¹ ChenNa^2,^*

(Department of Physical Education, Yang-En University, Quanzhou, 362014, China)
(Ministry of Sports, Xiamen Institute of Technology, Xiamen, 361021, China Na04Chen@outlook.com)

^* Corresponding Author: Na Chen

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In order to improve teaching children how to play football, combining the Vector of Locally Aggregated Descriptors (VLAD) model and a deep belief network (DBN) into a V-DBN is proposed based on 3D bone recognition that recognizes football actions. We use the contrast method to reduce the dimensionality of action features, and we complete the action recognition through analysis of key parameters. After experimental testing with the MSRAction3D data set, Grassmann manifold and graph-based action classification and recognition reach accuracies of 79.2% and 93.4%, respectively, after 100 iterations of training, but the V-DBN reaches 98.6%. In the UTKinect-Action database test, the average recognition rates of Grassmann manifold and graph-based action classification and recognition are 88.38% and 91.31% accurate, respectively, while the VLAD is 93.96% accurate, showing the best overall performance. However, the effect in single-action recognition is only average. Using the LSTM optimization model on results from infant football action recognition, the average accuracy rate of LSTM+V-DBN is 0.981 compared to the V-DBN at 0.892. Clearly, the optimized LSTM+V-DBN model performs better in toddler action recognition. This research provides important reference value to the application of human action recognition technology in children’s football education.

Keywords

Football, Action recognition, V-DBN, LSTM, Children’s features

1. Introduction

Football is one of the sports that modern society loves for its entertainment and fitness value. Football education is gradually becoming popular in early childhood education, and is deeply loved by teachers and students. However, children’s football education needs to pay attention to children’s physical and mental characteristics, including their physical strength, mental abilities, sports skills, etc., in order to customize a reasonable sports training plan according to their development requirements. Therefore, based on existing human motion recognition technology, identifying and tracking children’s movement characteristics will more effectively promote the development of football education and ensures safety for the children. At present, human motion recognition is widely used in security monitoring, somatosensory games, unmanned control, and medical and health fields. Considering the different application environments, there are certain differences in the actual effects from applications. Motion feature data of people is extracted through 3D vision technology, and human behavior and motion scenes will affect human action recognition. Therefore, based on human skeleton point recognition, the Vector of Locally Aggregated Descriptors (VLAD) model is fused with a deep belief network (DBN) into a V-DBN model for human action recognition. At the same time, considering the diversity of football sports environments, the V-DBN model cannot deal with the continuity problem between bone sequences very well, so spatio-temporal dual-stream Long Short-term Memory (LSTM) is used to optimize human sports scenes. Compared with traditional human motion recognition technology, 3D bone recognition is more accurate, but it still faces problems such as complex recognition scenes and difficulties calculating structural features. The content of this research offers important reference value for promoting the development of human motion recognition technology in the field of education.

2. Related Work

Human motion recognition is widely used in security monitoring, games, sports training, and other fields. Through real-time monitoring of human behavior and movements, human motion trajectories can be captured to improve ergonomics. Experts at home and abroad have conducted corresponding research on human action recognition. Jaouedi et al. conducted research on human motion features, and capturing such features is conducive to enhancing the security field ^[1]. An evaluation method was used to extract human behavioral characteristics, and neural network training captured the behaviors that result when humans appear in different environments. The features of the recognition data are extracted, and the effective recognition of human actions is completed through a mixed action recognition model. After testing, the proposed method can be effectively applied in the security field, and the accuracy rate of the data set test was above 96%. Kong and Fu conducted research on existing vision technology and found that it can effectively predict and infer the behavior of the target state ^[2]. A human behavior recognition model was proposed on the basis of vision technology, and the model was applied to the field of human motion prediction. This field is suitable for installation in unmanned vehicles, for security monitoring, and other fields, and the relevant database is selected to verify the effect of the model. Tests showed that the proposed recognition model had good predictive performance and can be effectively applied to future recognition environments. Tu et al. conducted a comparative study of convolutional networks. Although the technology showed good performance in motion recognition, it still faces difficulties in capturing complex human motions ^[3]. Therefore, based on motion-enhanced space vector technology, recognition of, and judgments about, human motion are completed through self-adaptive segmentation and self-adaptive feature data extraction, and the final motion capture is completed by selecting keyframes. The scheme was verified using various data sets, and the proposed method can effectively collect human body features and provide advanced recognition performance. Kong and Fu conducted research on the existing visual technology, and the use of visual technology in the field of video action recognition has been able to predict unknown states quite well. Therefore, on the basis of visual action technology, a technology was proposed to infer and capture the human action state ^[4]. Such a technology can be effectively applied to the field of unmanned driving and video surveillance. Tests of the proposed program showed good feasibility.

Football is very popular in modern society, but football training needs to grasp the characteristics of people’s movements. In recent years, deep learning has been applied to the field of football action recognition, providing important support for the development of football training. Quaid and Jalal studied the different behavioral characteristics of people and found that the signals change with time, but the behavioral characteristics of football players still face a recognition problem. Therefore, it is necessary to monitor physical and mental observations of athletes by monitoring muscles, daily life, and energy consumption ^[5]. On this basis, they proposed a genetic model idea that monitors the behavioral data of athletes through sensors, then analyzes the correlation between human behavior and the data, and extracts effective feature information. The authors verified that their proposed scheme had excellent recognition effects with the data set. Wang proposed a football recognition scheme to promote the development of football. The scheme first identifies and extracts static images of football and then uses visual technology to collect dynamic and static image data, and completes the data collection under a football platform. Finally, feature information from the data set is calculated, and the football trajectory is obtained. Through experiment, Wang found the greater the training volume for the proposed scheme, the higher the accuracy. Compared with the same type of scheme, the recognition rate of Wang’s scheme was higher than 80.1%, which meets football recognition requirements ^[6]. Cuperman et al.~found it more conducive to the development of sports to compile statistics on the existing sports football process. Research on existing action recognition technology found problems such as a low recognition effect, so a sensor was used to extract sports-related data, and a variety of human football sports data was tested under deep learning, including running, shooting, sprinting, etc. ^[7]. After testing, the proposed recognition technology proved better than traditional single-action football recognition, with an accuracy of 98.3%, which meets the football evaluation requirements. Cai and Zhao found that visual technology can effectively capture students’ facial expressions and movements, which is of great significance for teachers who want to optimize teaching and improve teaching quality. Therefore, visual action recognition technology was applied to the field of football training to collect student actions and expressions and to realize extraction of human action feature data in an attention model to complete the identification of student movement features ^[8]. A test of the program showed it has good motion recognition and is better than other recognition programs.

According to the above studies, human behavior and movement technology have made breakthroughs in recent years. In particular, the application of intelligent technologies, such as deep learning in the field of human motion recognition, has significantly improved the effect of human body recognition. In children’s football, the use of human action recognition technology integrated with deep learning will improve the effect of children’s football training, and has important reference value for the development of football education.

3. Algorithm for Recognizing Children’s Football Movements

3.1 Construction of the V-DBN Model

The characteristics of children’s physical and mental development are different from those of adults. It is necessary to grasp the characteristics of children’s sports development when carrying out football training to avoid injuries and ensure the development of children’s football education. Therefore, the characteristics of children’s football movements are collected based on 3D bone recognition, and recognition of children’s movements can be completed through the proposed V-DBN model. In children’s movements, 3D skeleton recognition completes the extraction of motion features with aggregate data (skeleton position data), and provides them to the human motion recognition model ^[9]. The human motion data are collected by the Kinect device, and the human skeleton point collection structure can be seen in Fig. 1.

Position information for key human bones is collected, and each point represents position information and time data, with a bone point in each frame represented by a row. In this research, bone recognition is divided into two types. Scheme A uses the frame selection range of bone points as a group to solve time series features, as shown in Fig. 2(a). Scheme B divides the skeleton into three layers from the outside to the inside, as shown in Fig. 2(b) ^[10]. The skeletal point division method of Scheme A can better distinguish changes in the human body, but lacks consideration of changes within the bones, whereas Scheme B fully considers the effects of each bone joint, but lacks consideration of changes in specific parts. On the whole, Scheme B considers the positional relationship of each bone point, so it was used to calculate the time series characteristics of human actions.

Time series and spatial features are used to reflect the changes in human motion, so in the description of the time series relationship, the bone points are represented by displacement $x$. The time relationship description is represented by acceleration $a$and velocity $v$, and the displacement expression is shown in formula (1) ^[11]:

(1)

$ x_{i}^{f}=p_{i}^{f}-p_{i}^{1},1<f<c,1\leq i\leq 15 $

In formula (1), $p$ is a bone’s three-dimensional space position parameter $(x,y,z)$, $i$ represents a certain bone point of the person, and $f$represents the current frame. The speed expression is shown in formula (2):

(2)

$ v_{i}^{f}=\frac{p_{i}^{f+1}-p_{i}^{f-1}}{\Delta t},1<f<c,1\leq i\leq 15 $

In formula (2), $\Delta t$ represents $\left[f-1,f+1\right]$, the number of frames between bone points, and the acceleration expression is shown in formula (3):

Editor\textemdash{}Highlight\textemdash{}Is this the intended meaning? If not, please clarify (i.e., between what and what?).

(3)

$ a_{i}^{f}=\frac{p_{i}^{f+1}-p_{i}^{f-1}}{\Delta t^{2}},1<f<c,1\leq i\leq 15 $

In order to describe spatial positions in human motion more conveniently, three time series are defined and processed, as shown in formula (4):

(4)

$ T=\left[T_{1},T_{2},T_{3}\right] $

Considering that human action features have the characteristics of both space sequences and time series, the time series describes the movement of bone points through three types of motion features, and there are certain differences between the features due to the differences in human actions ^[12]. Especially when people are moving slowly, there is not a big difference between speed and acceleration in people’s movement, and movement is mainly described by displacement. Then, the positional relationship between the reference point and the bone point is shown in formula (5):

(5)

$ p_{i,j}^{f}=p_{i}^{f}-p_{j}^{f} $

In formula (5), $p_{i}^{f}$ and $p_{j}^{f}$ represent spatial relative position parameter $f$ of skeleton node $i$ and at frame time $j$, respectively. Table 1 shows the relationships between the four relative spatial position parameters.

The spatial position feature is similar to the time series expression, and can be described by concatenation, as shown in formula (6):

(6)

$ S=\left[S_{1},S_{2},S_{3}S_{4}\right] $

In the description of a single-action feature, similar individual actions will be the same in the description process, especially when the distance between classes is small and the actions cannot be distinguished ^[13]. For example, running and walking are close in description, which will affect the action recognition effect. Therefore, combining multiple features and expanding the distance between different actions can improve the effectiveness of human action recognition. Therefore, the V-DBN action recognition model (combining the VLAD model and the DBN network) is used for optimization. By unifying the length of action features through the VLAD model, it is necessary to encode the action features so as to more fully describe them ^[14]. Arbitrary time series expressions are seen in formula (7):

(7)

$ T_{n}=\left[t_{n}^{1},t_{n}^{2},\ldots ,t_{n}^{l},\ldots ,t_{n}^{L}\right] $

In (7), the $n$ value is taken from $\left\{1,2,3\right\}$ so $t_{n}^{l}$ represents the $l$ characteristic subsequence in the time dimension, with $L$representing the total number. Then, any spatial position can be expressed with (8):

(8)

$ S_{m}=\left[S_{m}^{1},S_{m}^{2},\ldots ,S_{m}^{l},\ldots ,S_{m}^{L}\right] $

In formula (8), $S_{m}^{l}$ represents the $l$ feature subsequence in the mth spatial position, and the $m$ value is taken from $\left\{1,2,3,4\right\}$. The temporal and spatial feature sequences are converted into frames, and a total of 24 cluster centers can be obtained through the clustering algorithm ^[15]. Use Euclidean distance to calculate the distance between the cluster center and each of the feature data to get 24 clusters; then, the expression of time clusters is as seen in test (9):

Editor - Highlight - Is this the intended meaning? If not, please define as intended.

(9)

$ NN_{u,n,qn}=\left\{t_{n,f}\left| q_{n}=\underset{p}{argmin}\left\| t_{n,f}-\mu _{u,np}\right\| \right.\right\} $

In test (9), $N_{u,n,qn}$ represents the time clustering center, $t_{n,f}$ represents the sequence unit time frame, and $\mu _{u,np}$ represents the distance coefficient corresponding to time. Then, the spatial cluster expression is seen in (10):

Editor\textemdash{}Highlight\textemdash{}Is this intended (not NN?)

(10)

$ NN_{u,m,qm}=\left\{s_{n,f}\left| q_{m}=\underset{p}{argmin}\left\| t_{m,f}-\mu _{u,mp}\right\| \right.\right\} $

In (10), $NN_{u,m,qm}$ represents the spatial clustering center, $s_{n,f}$ represents the sequence unit spatial frame, and $\mu _{u,mp}$ represents the distance coefficient corresponding to space. Each human action, $U\times 23$, is described by vectors of the same dimension, and each action vector contains four space sequences and three time series features ^[16]. The large amount of data will affect human action recognition. Therefore, the contrastive divergence method is used to reduce data processing, and the principle of the V-DBN’s football action recognition for children can be seen in Fig. 3.

In Fig. 3, the VLAD model is responsible for unification of feature data. In the RBM layer, the contrastive divergence algorithm is mainly used to complete the simulation of the initial feature data, and optimizes the data to improve the feature extraction effect.

Fig. 1. Distributions of the different skeleton points.

Fig. 2. Divisions for two kinds of sports bones.

Fig. 3. Schematic diagram of the V-DBN action recognition model.

Table 1. Corresponding results of spatial positions.

Spatial description operator	Space 1	Space 2	Space 3	Space 4
Skeletal Point Subset	Right Hand and Left Hand	Right Foot, Right Hand, Head	Left Hand, Left Foot, Head	Right Foot, Left Foot, Head
Reference Frame	Head	Left hip	Right hip	Spine

3.2 Construction of the LSTM Model

In the V-DBN action recognition model, the VLAD model unifies the length of the action feature data, but the relationship between the upper and lower frames of continuous skeletal action is interrupted, which is not conducive to the recognition of human action. To overcome the above problems, the LSTM model optimizes the process to recognize human actions ^[17]. Since V-DBN human action model recognition mainly uses the clustering method to realize the operation of the unit frame, and puts frames with similar action structures into one category, the relationship between the front and back sequences of the skeleton is destroyed, and continuous feature data of each moment are processed via LSTM. After processing, training on (and learning) the human body feature data can be completed, which is shown in Fig. 4 ^[18].

Fig. 4 is arranged in frame order, and sequence feature data are processed through the LSTM model. The acquired feature data are input to the LSTM model in a time relationship, so the distribution relationship of the action sequence features in the time dimension can be obtained. For a more detailed description of bone distributions in the same frame, the spatial angle feature is optimized on the basis of quaternions, and three imaginary parts and one real part are used to describe the rotation point and retrograde of the bone action. The corresponding coordinate relationship is constructed as seen in Fig. 5 ^[19].

As shown in Fig. 5, a vector is used to describe the human skeleton relationship, and the angle between the bones of each frame is solved. The first step is to collect human bone data, and obtain two vectors, $v_{1}$ and $v_{2}$, where each vector has three pieces of coordinate position information; then, the expression of the bone vector is seen in (11):

(11)

$ u\left(u_{x},u_{y},u_{z}\right)=v_{1}\times v_{2} $

In (11), $u_{x},u_{y},u_{z}$ represents the coordinate position information of the bone vector. We then calculate the rotation angle between the two vectors, as seen in (12):

(12)

$ \cos \theta =\frac{v_{1}\cdot v_{2}}{\left| v_{1}\right| \left| v_{2}\right| } $

By solving the rotation angle parameters between bone vectors, the coefficient corresponding to the quaternion can be obtained from (11) and (12), and the real coefficient of the quaternion is as seen in (13) ^[20]:

(13)

$ q_{0}=\cos \frac{\theta }{2} $

In (13), $q_{0}$ represents the real coefficient, and the imaginary coefficient is $q_{1}$, expressed as seen in (14):

(14)

$ q_{1}=u_{x}\sin \frac{\theta }{2} $

In (14), $u_{x}$ represents axis information of bone vector $x$, and the imaginary coefficient is $q_{2}$, as expressed in (15):

(15)

$ q_{2}=u_{y}\sin \frac{\theta }{2} $

In (15), $u_{y}$ represents axis information of bone vector $y$, and the imaginary coefficient is $q_{3}$, expressed in (16):

(16)

$ q_{3}=u_{z}\sin \frac{\theta }{2} $

In (16), $u_{z}$ represents axis information of bone vector $z$, and the coefficient corresponding to the quaternion is obtained by solving the angle between the vectors. Using bone vector features in Table 2, recognition of character action features can be effectively realized.

The LSTM model is used to process the continuous feature data of the human skeleton at each moment to solve the problem of breaking the relationship between the upper and lower frames of the continuous skeleton movements. The entire process of young children's football movement recognition is shown in Fig. 6.

Using Kinect devices to collect young children's football movement data, build a model original motion recognition library. Extract the bone feature data, unify the length of motion features through the VLAD model, optimize the feature data using the DBN model, and optimize the continuous motion feature data using the LSTM model. Finally, complete the recognition and analysis of young children's football movement.

Fig. 4. Schematic diagram of the LSTM model processing continuous feature data.

Fig. 5. Skeleton origin coordinates.

Fig. 6. The entire toddler soccer movement recognition process.

Fig. 6. Action recognition results for different iterations.

Table 2. Skeleton vector angle feature selection table.

Angular Features	B1	B2	B3	B4	B5
Skeletal Vector V1	(1,3), (7,9), (10,12), (13,15)	(1,3), (4,6), (13,15), (11,12), (13,15)	(1,3), (4,6), (13,15), (8,9), (8,9)	(7,9), (10,12), (1,3), (5,6)	(12,6), (6,1), (1,9), (9,15)
Skeletal Vector V1	(1,3), (7,9), (10,12), (13,15)	(1,3), (4,6), (13,15), (11,12), (13,15)	(1,3), (4,6), (13,15), (8,9), (8,9)	(7,9), (10,12), (1,3), (5,6)	(12,6), (6,1), (1,9), (9,15)
Skeletal Vector V2	(4,6)	(7,9)	(10,12)	(13,15)	(9,15), (15,12), (6,1), (1,9)

4. Action Recognition Simulation

In order to verify the proposed action recognition algorithm, a test platform was set up with Windows 10 on an Intel i7 processor with 32GB of memory. The test was completed using MSRAction3D and UTKinect-Action data sets, both of which are the most classic platforms in bone testing. First, MSRAction3D is composed of 20 different actions, with each action completed by different individuals. The UTKinect-Action data set comprises 10 kinds of action that include sitting down, standing up, walking, moving, throwing, pushing, pulling, picking up, swinging, and clapping. The test platform used Grassmann manifold ^[21] and graph-based action classification and recognition ^[22] for comparison. The model training set was the 1st, 3rd, 5th, and 9th target actions, and the model test set was the 2nd, 4th, 6th, and 10th target actions. Classification accuracy from the algorithm for different iterations using the MSRAction3D action data is seen in Fig. 6.

Fig. 6(a) shows classification results of each recognition algorithm after 50 iterations of training. We can see from the graph that when the number of iterations was zero, there were different accuracies from Grassmann manifold, graph-based, and the proposed V-DBN action classification and recognition: 71.2%, 86.6%, and 77.3%, respectively. After 50 iterations, the accuracy from V-DBN action classification and recognition was 93.2%, graph-based was 93.7%, and Grassmann manifold was 79.6%. Grassmann manifold accuracy tended to be stable after 20 iterations. Fig. 6(b) shows classification results from each recognition algorithm after 100 iterations of training. We can see that graph-based recognition accuracy reached 94.9% after 50 iterations. As recognition accuracy decreased, it reached 93.4% after 100 iterations, whereas Grassmann manifold accuracy was 79.2%. V-DBN had an accuracy rate of 98.6% after 100 iterations, which is significantly better than both Grassmann manifold and graph-based action classification and recognition. At the same time, the specific action recognition results of the three recognition algorithms were tested using the UTKinect-Action action database, as shown in Table 3.

Table 3 shows results from using the UTKinect-Action action database. The proposed V-DBN action recognition algorithm had good adaptability to 10 kinds of actions, and the average recognition rate was the highest among the three methods. Grassmann manifold and graph-based average recognition rates were 88.38% and 91.31%, respectively. Although the overall recognition effect from V-DBN was better than Grassmann manifold and graph-based approaches, the accuracy of individual action recognition was relatively low. The recognition effect was relatively poor, and there is still room for improvement. In actual model training, model initialization parameters will also affect the accuracy of action recognition, so the aim is to select the best weight matrix initialization strategy for the model, set the variances to 1, 0.1, and 0.01, and obey the mean value of 0 using the contrast method to test and find the hidden layer of the model. Test results are shown in Fig. 7.

From the normal distribution initialization weight matrix test, Fig. 7(a) shows reconstruction error results from different variances. The variance reflects the effect of the hidden layer on the data simulation; the lower the error value, the better the test effect. When the variance was 1, 0.1, and 0.001, the error values were 700, 210, and 198 when the number of iterations was 1, and they were 7, 8, and 8 at five iterations. Therefore, when the variance was set to 1, the final error was the smallest, and the model provided better performance. Fig. 7(b) shows the reconstruction error results under different learning rates. Different learning rate settings will affect the model training convergence effect and the data recognition effect. We set the learning rate to 1, 0.5, and 0.1, and after five iterations, when the learning rate was set to 1, the average error was 6, and the model obtained the best test results.

At the same time, considering the shortcomings of V-DBN in certain action recognitions and classifications, LSTM was used to optimize the V-DBN model structure, and dropout was used to improve the generalization effect of input action recognition data. In order to get closer to the actual effect of children’s football, a total of 400 pieces of children’s football-related action data were collected using the Kinect device. Four children participated in creating the data, and each child performed 10 actions 10 times to ensure accuracy. The rest of the settings were consistent with the previous parameters; the learning rate was set to 1, the variance was set to 1, and the action recognition tests performed. The experiment consisted of two parts: spatial location action feature recognition and spatial dual-stream action feature recognition. Fig. 8 shows the results from recognizing action features in spatial locations.

Fig. 8(a) shows that with 100 iterations, the LSTM+V-DBN model had the highest action recognition rate at 0.923, while the accuracy of V-DBN, Grassmann manifold, and graph-based action recognition were 0.832, 0.802, and 0.815, respectively. In general, LSTM+V-DBN improved action recognition by 9.6%, compared with V-DBN alone. Fig. 8(b) shows action recognition results after 200 iterations. Considering that the LSTM+V-DBN model recognizes action types through the fusion of four bone sets in position feature recognition, the actual training requires multiple iterations. In the 200-iteration test, the effects of the V-DBN, Grassmann manifold, and graph-based models did not change much, while the recognition accuracy of LSTM+V-DBN after 200 iterations was 0.963. This shows that the position of the skeleton set integrated in the LSTM+V-DBN information required more training iterations but significantly improved the model’s recognition of human actions. Fig. 9 shows the action feature recognition results from the spatial two-stream LSTM.

From the perspective of quaternion space-time, it is impossible to compare Grassmann manifold and graph-based approaches with unified parameters. Therefore, only the action recognition effect of the LSTM+V-DBN and V-DBN models was tested. Fig. 9(a) shows the action recognition results after 100 iterations. We can see from the results that recognition from LSTM+V-DBN was more obvious than that of V-DBN, and the accuracy was higher. At 100 iterations, recognition accuracy rates of LSTM+V-DBN and V-DBN were 0.816 and 0.932, respectively. Fig. 9(b) shows the action recognition results after 200 iterations. After 100 iterations, the action recognition accuracy of the V-DBN model tends to be stable, while LSTM+V-DBN needs to iterate 180 times to obtain a stable recognition rate. After 200 iterations, the final accuracy in action recognition was 0.965. We can see that the combined LSTM+V-DBN model had a better effect on human action recognition, mainly due to the increase in the recognition of spatial angle feature data, which significantly improved the extraction effect of the model on complex action features. Therefore, common actions in children’s football were selected for recognition, and the final results are shown in Table 4.

Compared with V-DBN action recognition results, the mean recognition accuracy was 0.981, significantly higher than V-DBN’s 0.892 and higher than Grassmann manifold approach at 0.888 and the graph-based approach at 0.881. In single-action recognition, although the V-DBN model could achieve a relatively high result, the accuracy in recognition of individual actions such as Pull, Dribble, etc., was low After using LSTM+V-DBN optimization, the V-DBN action recognition model can effectively improve the problem of insufficient single-action recognition by V-DBN. The recognition results for Dribble and Pull were 0.976 and 0.916, respectively, which were significantly improved over 0.846 and 0.826 with V-DBN.

Fig. 7. Error results from reconstructed data under different model parameters.

Fig. 8. Spatial location action feature recognition results.

Fig. 9. Spatial dual-stream action feature recognition results.

Table 3. Accuracy of each algorithm using the UTKinect-Action data set.

Action No.	Action type	Grassmann manifold	Graph-based	V-DBN
1	Carry	98.6%	99.7%	100%
2	Throw	61.7%	60.4%	95.3%
3	Walk	98.3%	97.3%	94.7%
4	Push	63.6%	82.3%	95.4%
5	Clap hands	95.3%	100.0 %	93.8%
6	Wave	99.6%	99.8%	96.3%
7	Pick up	100.0 %	97.5%	100.0 %
8	Stand up	100.0 %	90.4%	90.1%
9	Sit down	80.1%	92.6%	91.4%
10	Pull	86.6%	93.1%	82.6%
Mean		88.38%	91.31%	93.96%

Table 4. The results of children’s football action recognition.

Action No.	Action type	Grassmann manifold	Graph-based	V-DBN	LSTM+V-DBN
1	Run fast	0.624	0.702	0.821	0.986
2	Volley	0.823	0.723	0.821	0.976
3	Walk	0.983	0.973	0.947	1.000
4	Juggling	0.915	0.896	0.915	0.987
5	Clap hands	0.953	1.000	0.938	1.000
6	Dribble	0.916	0.786	0.846	0.976
7	Pick up	1.000	0.975	1.000	1.000
8	Stand up	1.000	0.904	0.901	1.000
9	Sit down	0.801	0.926	0.914	0.976
10	Pull	0.866	0.931	0.826	0.916
Mean		0.888	0.881	0.892	0.981

5. Conclusion

Human action recognition is widely used in medicine, gaming, education, and other fields. The combination of deep technology and visual recognition machine technology promotes the development of modern education. In order to better implement football education for young children, a V-DBN human action recognition algorithm was proposed based on human skeleton recognition, which describes the human motion relationship through the analysis of human time series and spatial position sequence features. Considering the impact of V-DBN on the continuity between bone sequences, LSTM is used to optimize the VLAD model to more accurately describe the characteristics of human motion. A performance test showed that with UTKinect-Action data, V-DBN had the best comprehensive action recognition with an average recognition rate of 93.96%, which is higher than Grassmann manifold and graph-based approaches at 88.38% and 91.31%, respectively. However, in recognizing actions such as clap hands, sit down and pull, V-DBN performed relatively poorly. Therefore, LSTM was used to optimize the recognition effect. In children’s football recognition, the average recognition rate of LSTM+V-DBN was 0.981, while the average recognition rate for V-DBN was 0.892. The optimized LSTM+V-DBN model had better action recognition. At the same time, LSTM+V-DBN significantly improved recognition of actions such as dribble and pull. We can see that the proposed children’s football action recognition algorithm had excellent performance and meets the development requirements of children’s football education. However, when studying human body movement, the change requirements for the same movements in different scenarios were not considered, and there is room for later improvement.

REFERENCES

Jaouedi N, Boujnah N, Bouhlel M S. A new hybrid deep learning model for human action recognition. Journal of King Saud University-Computer and Information Sciences, 2020, 32(4): 447-453.

Kong Y, Fu Y. Human action recognition and prediction: A survey. International Journal of Computer Vision, 2022, 130(5): 1366-1401.

Tu Z, Li H, Zhang D,Dauwels J,Li B. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing, 2019, 28(6): 2799-2812.

Quaid M A K, Jalal A. Wearable sensors based human behavioral pattern recognition using statistical features and reweighted genetic algorithm. Multimedia Tools and Applications, 2020, 79(9): 6061-6083.

Wang T. Exploring intelligent image recognition technology of football robot using omnidirectional vision of internet of things. The Journal of Supercomputing, 2022, 78(8): 10501-10520.

Cuperman R, Jansen K M B, Ciszewski M G. An end-to-end deep learning pipeline for football activity recognition based on wearable acceleration sensors. Sensors, 2022, 22(4): 1347-1352.

Cai Y, Zhao T. Performance analysis of distance teaching classroom based on machine learning and virtual reality. Journal of Intelligent & Fuzzy Systems, 2021, 40(2): 2157-2167.

Li T, Sun J, Wang L. An intelligent optimization method of motion management system based on BP neural network. Neural Computing and Applications, 2021, 33(2): 707-722.

Inan T, Cavas L. Estimation of market values of football players through artificial neural network: a model study from the Turkish super league. Applied Artificial Intelligence, 2021, 35(13): 1022-1042.

Zhao K, Jiang W, Jin X, Xuming X. Artificial intelligence system based on the layout effect of both sides in volleyball matches. Journal of Intelligent & Fuzzy Systems, 2021, 40(2): 3075-3084.

Mahaseni B, Faizal E R M, Raj R G. Spotting football events using two-stream convolutional neural network and dilated recurrent neural network. IEEE Access, 2021, 9: 61929-61942.

Li J, Zhan W, Hu Y. Generic tracking and probabilistic prediction framework and its application in autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(9): 3634-3649.

Ibrahim A W S. Augmented Reality Appling with Consistency of Behavior using Oriented Bounding Box Algorithm. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 2021, 12(10): 2511-2517.

Behravan I, Razavi S M. A novel machine learning method for estimating football players’ value in the transfer market. Soft Computing, 2021, 25(3): 2499-2511.

Guangjing L, Cuiping Z. Research on static image recognition of sports based on machine learning. Journal of Intelligent & Fuzzy Systems, 2019, 37(5): 6205-6215.

Flah M, Nunez I, Ben Chaabene W, Nehdi M. Machine learning algorithms in civil structural health monitoring: a systematic review. Archives of computational methods in engineering, 2021, 28(4): 2621-2643.

Herold M, Goes F, Nopp S, Meyer T. Machine learning in men’s professional football: Current applications and future directions for improving attacking play. International Journal of Sports Science & Coaching, 2019, 14(6): 798-817.

Lv Q. Simulation of football sport PID controller based on BP neural network. Journal of Intelligent & Fuzzy Systems, 2021, 40(4): 7483-7495.

Slama R, Daoudi M, Daoudi M, Wannous Hl. Accurate 3D action recognition using learning on the Grassmann manifold[J]. Pattern Recognition, 2015, 48(2):556-567.

Li M, Leung H. Graph-based approach for 3D human skeletal action recognition. Pattern Recognition Letters, 2016, 87:195-202.

Author

Zhaosheng Chen

Zhaosheng Chen, born March 5, 1989, male, lecturer. Bachelor degree, graduated from Quanzhou Normal University in 2011, majoring in physical education. Master, graduated from Fujian Normal University in 2017, majoring in physical education. Ph.D. candidate at Krirk University, Thailand, majoring in physical education. Now working at Yang En University, lecturer, football direction, has published 1 academic article.

Na Chen

Na Chen, On October 17, 1986, female, lecturer, with a bachelor's degree, graduated from Jimei University (September 2006 to July 2010) with a major in Physical Education. She was an on-the-job graduate with a master's degree, and graduated from Hubei University (March 2011 to December 2015) with a major in Physical Education and Training. Currently, she works at Xiamen University of Technology as a lecturer in the field of university physical education. She has published three academic articles and one soft work, and serves as an editorial board member of the school-based textbook "College Physical Education and Health".

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Children’s Football Action Recognition based on LSTM and a V-DBN

Abstract

Keywords

1. Introduction

2. Related Work

3. Algorithm for Recognizing Children’s Football Movements

3.1 Construction of the V-DBN Model

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

Fig. 1. Distributions of the different skeleton points.

Fig. 2. Divisions for two kinds of sports bones.

Fig. 3. Schematic diagram of the V-DBN action recognition model.

Table 1. Corresponding results of spatial positions.

3.2 Construction of the LSTM Model

(11)

(12)

(13)

(14)

(15)

(16)

Fig. 4. Schematic diagram of the LSTM model processing continuous feature data.

Fig. 5. Skeleton origin coordinates.

Fig. 6. The entire toddler soccer movement recognition process.

Fig. 6. Action recognition results for different iterations.

Table 2. Skeleton vector angle feature selection table.

4. Action Recognition Simulation

Fig. 7. Error results from reconstructed data under different model parameters.

Fig. 8. Spatial location action feature recognition results.

Fig. 9. Spatial dual-stream action feature recognition results.

Table 3. Accuracy of each algorithm using the UTKinect-Action data set.

Table 4. The results of children’s football action recognition.

5. Conclusion

REFERENCES

Author

Zhaosheng Chen

Na Chen

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing