Mobile QR Code QR CODE

2024

Acceptance Ratio

21%


  1. (Department of Physical Education, Hunan International Economics University, Changsha, 410205, China)
  2. (School of Arts and Media, Qingdao Hengxing University of Science and Technology, Qingdao, 266000, China)



YOLOv3, Sport of basketball, Gesture recognition, Enhancement techniques

1. Introduction

Today, as technology rapidly advances, artificial intelligence is transforming various industries, and the sports field is no exception. Basketball, as a popular sport all over the world, has become an important means to improve athletes' performance by scientific and precise training and competition [1]. Among them, key gesture recognition, as one of the key applications of AI technology in the sports field, is of great significance for in-depth analysis of athletes' technical movements, evaluation of training effects and guidance of tactical arrangements.

As an advanced algorithm in the field of current target detection, YOLOv3 algorithm stands out in many application scenarios with its high accuracy and fast speed. However, in the complex basketball environment, the YOLOv3 algorithm still faces some challenges, such as lighting changes, occlusion problems, and the diversity of movements among different athletes. These factors may affect the accuracy of key gesture recognition. Therefore, how to improve YOLOv3 algorithm and make it better adapt to basketball scenes has become an important topic of current research.

The purpose of this study is to propose a key gesture recognition method for basketball based on the improved YOLOv3 algorithm [2, 3]. First, we will improve the shortcomings of the original algorithm, such as optimizing the feature extraction network and enhancing the model's adaptability to complex backgrounds; Secondly, the spatiotemporal information is introduced, and the sequence data analysis technology is used to improve the recognition accuracy of continuous movements. A basketball database is created for training and testing the enhanced algorithm model. Through experimental verification, our improved YOLOv3 algorithm significantly enhances basketball posture recognition accuracy, while maintaining high real-time performance [4, 5]. This will provide basketball coaches with more scientific and objective evaluation tools, and help them to guide athletes' technical training and tactical drills more effectively. In addition, this study will also explore how to apply key gesture recognition technologies to actual training environments, Integrating VR and AR for enhanced immersion in training. This study will promote the application research of AI technology in the field of basketball, not only provide new technical support for athletes' training and competition, but also provide reference and enlightenment for intelligent training and analysis of other sports.

2. Object Detection and Human Pose Estimation

2.1. Moving Object Detection

Moving object detection is the core task of video analysis and is crucial for identifying and tracking dynamic objects in images. In this paper, we focus on moving object detection, especially for pedestrian detection in complex scenes. We explore two major categories of mainstream object detection methods in detail: Traditional manual feature methods and emerging deep learning approaches.

Although the traditional manual feature-based method performs well in scenes with obvious features and simple backgrounds, it relies on artificially designed feature extraction, which is stretched in complex and changeable scenes [6, 7]. They are effective in dealing with static background and slow-moving targets, but they are inadequate in detecting dynamic background and fast-moving targets.

Deep learning has led to advanced object detection methods that enhance accuracy and robustness by learning high-level features. These methods primarily fall into two categories: one-stage and two-stage.

Two-stage detectors like R-CNN, SPP-Net, Fast R-CNN, and Faster R-CNN generate candidate regions before extracting features and classifying them. Among them, R-CNN uses selective search for candidate regions, extracts feature with CNN, and classifies with an SVM. On this basis, SPP-Net employs spatial pyramid pooling to cut computation. Fast R-CNN and Faster R-CNN refine the architecture for faster detection [8, 9]. In contrast, one-stage algorithms like YOLO and SSD use CNNs to predict both category and position simultaneously, avoiding the candidate region generation step in the two-stage method. YOLO divides the image into multiple grids and classifies and locates the objects within each grid. SSD combines the rapid detection advantages of YOLO with the multi-scale detection capabilities of RPN. Anchors boost detection speed for multi-scale targets.

2.2. YOLOv3 Target Recognition

In our approach to moving object detection, we leverage the SNN-YOLOv3 (Spiking Neural Network-YOLOv3) algorithm, an innovative variant of the renowned YOLOv3 framework. SNN-YOLOv3, or Spiking Neural Network-enhanced YOLOv3, integrates the efficiency of YOLOv3 with the event-driven dynamics of spiking neural networks, offering superior performance and adaptability. Initially, the input image is standardized to a fixed dimension of 416x416 pixels. This ensures that our network can seamlessly handle images of various sizes, maintaining detection accuracy and processing efficiency across the board. By employing SNN-YOLOv3, we are not only able to detect moving objects with high precision but also to do so with reduced computational resources, thanks to the spiking neural network's ability to process information more efficiently than traditional neural networks. Then, the key feature information in the image is identified by convolutional neural network (CNN), and the target is classified and located based on these features [10, 11].

In the feature extraction stage, CNN extracts key info from raw images, which is crucial for distinguishing different targets and determining their location. Then, the network will generate multiple bounding boxes, each corresponding to a potential target, and calculate the Confidence Score of each box (as shown in Eq. (1)), which represents the probability. In order to improve the accuracy of detection, we use the Non-Maximum Suppression (NMS) algorithm. The role of NMS is to remove duplicate detection boxes and retain the best detection results. Specifically, it compares detection box confidences and keeps the highest one, while removing those other boxes that overlap them too high.

In the YOLOv3 algorithm, each grid cell forecasts 3 differently sized bounding boxes, so a total of 9 boxes needs to be predicted. The sizes and proportions of these boxes are obtained through online learning, and they are able to adapt to targets of different shapes and sizes. In this way, the YOLOv3 algorithm can achieve better detection performance while maintaining a high detection speed.

(1)
$ Confidence = Pr(Object) \times IOU_{pred}^{truth}. $

In object detection, $Pr(Object)$ represents the probability that an object exists within a bounding box. $IOU(Intersection over Union)$ is a key metric measuring the overlap between a predicted bounding box and a ground truth box in object detection.

Within the context of our enhanced YOLOv3 algorithm, the process initiates with a sophisticated multi-scale feature extractor meticulously processing the input image. This results in the creation of triad feature maps, each of $13 \times 13$, $26 \times 26$, and $52 \times 52$ dimensions, respectively. These maps are designed to encapsulate features at various granularities, enriching the algorithm's capability to discern objects across different sizes [12].

Following this, the feature maps undergo a fusion process, culminating in the generation of a substantial array of candidate bounding boxes. Specifically, the total number of boxes produced sums up to $(13 \times 13 + 26 \times 26 + 52 \times 52)$ multiplied by three, translating to a comprehensive set of 10,647 potential object locations. This extensive set ensures a broad coverage of possible object placements within the image.

Subsequently, the tNMS (tailored Non-Maximum Suppression) mechanism comes into play [13]. This process is pivotal in selecting the most probable bounding box. It does so by identifying the box with the highest confidence score that surpasses the predefined threshold, effectively distinguishing the most likely object from the candidates.

Moreover, our enhancement introduces significant performance gains over the original YOLOv3 model. Through meticulous testing and comparisons, we've quantified these improvements, showcasing our method's superior detection accuracy and efficiency.

The NMS algorithm works by first sorting the prediction boxes from high to low in terms of confidence, and then checking the boxes one by one. For each box, if its IOU with the reserved box exceeds a preset threshold, the box is excluded; Otherwise, leave the box. Process repeats till all boxes are handled; highest confidence box is the final result. We focus on detecting pedestrians in this study, so when applying the YOLOv3 algorithm, we limit the category of recognized objects to “person” and ignore other categories of objects. Doing so can improve the focus of the detection algorithm, reduce misjudgment, and speed up processing. In this way, the traveling target can be effectively identified from the image.

3. Human Pose Estimation

3.1. Typical Human Pose Estimation Model

Traditional pose estimation utilizes a graph structure (PS) model, which decomposes an object into components and considers the spatial relationships between components. PS model mainly includes graph model, component appearance description and graph reasoning technology [14, 15]. Improved human component detection by cascading and advanced segmentation contours. By adopting the general detector of weak attitude model, the search space is reduced, and the accuracy is improved by using confidence propagation algorithm [16, 17]. Enhanced spatial relationship modeling by multistage classifier and cascade prediction of joint positions, utilizing particle-based inference and FAUST mesh alignment, improved computational efficiency and realism. However, traditional methods rely too much on hand-designed templates, and it is difficult to accurately estimate complex or multi-person gestures [18].

This paper proposes DeepPose, which Uses CNN to extract human pose features from images, and estimates the current pose through coordinate regression [19, 20]. The calculation model is shown in Fig. 1. Two cascade networks regress body key point heatmap for enhanced model robustness and accuracy.

Fig. 1. Computational model.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig1.png

Multi-person pose estimation has two types: top-down and bottom-up. Top-down detects people first, and then locates their key points. For example, Fast R-CNN is used to identify the human body, CMP network is used to estimate the posture of a single person, and then integer linear programming is used to divide the outlier values. Cascaded pyramid network (CPN), combined with GlobalNet and RefineNet, used to enhance key point detection accuracy, and the graph model is used for multi-peak prediction to solve the problem of multi-person pose estimation in crowded scenes. The bottom-up method detects all key points of the image first, and then combines them into poses by graph structure algorithm. Fast R-CNN is used to detect and mark key points, and integer linear programming clustering connections are used to estimate multi-person pose. The PersonLab model is proposed to predict displacement and heat map and realize instance-level character segmentation [21, 22].

3.2. Human Pose Estimation

Use YOLOv3 to detect the human body area, and send it to SSTN (combined with STN, SPPE, and SDTN) to generate candidate postures. Optimize the STN network during the training process, use parallel SPPE branching to improve efficiency, and add PGPG to increase sample diversity and model generalization ability [23, 24]. Finally, the Pose-NMS module is used to retain the highest confidence pose and remove redundancy. The pose estimation algorithm model is shown in Fig. 2.

Fig. 2. Pose estimation algorithm model.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig2.png

STN module is the key component of pose recognition, which contains three core parts: STN optimization candidate frame, SPPE for pose estimation, SDTN for coordinate mapping. STN locates the dominant area of the human body, and adjusts the posture to the center through parallel SPPE to improve the extraction accuracy. SPPE uses heat maps to predict key points and reduce positioning errors. SDTN reverses the STN operation and restores the estimated pose to the original image coordinate system. The overall process efficiently integrates attitude positioning and estimation, and improves recognition accuracy [25].

In attitude estimation, inaccurate target positioning leads to the generation of redundant attitudes. To solve this problem, P-NMS (Pose NMS) was proposed. The algorithm selects the pose that best fits the target person from all the pre-selected poses for output. The basic principle of P-NMS is similar to NMS (Non-Maximum Suppression) [26, 27]. Filter out high confidence postures, select the one with the highest confidence as the reference, calculate the distance between other postures and the reference, delete them if they are less than the threshold, and repeat this process until there is no redundancy. The advantage of P-NMS is that it can reduce redundant poses more effectively and improve the accuracy of pose estimation. In practical applications, P-NMS has become a widely used pose selection algorithm. The deletion criteria definition is expressed as Eq. (2):

(2)
$ f(G_i, G_j | \Lambda, \gamma) = 1, \quad (d(G_i, G_j | \Lambda, \gamma) \le \gamma). $

In Eq. (2), $f$ represents a function that is primarily used to calculate or adjust the confidence score for detecting a bounding box, which takes into account the probability of the presence of an object within the bounding box and the accuracy of the prediction box [27, 28]. $G$ is a set that contains all the candidate detection boxes or bounding boxes, which are the possible target locations that the algorithm preliminarily identifies in the image. At last $d$ typically represents a density or distance metric that is used to quantify the spatial relationship between detection boxes, especially when calculating the degree of overlap between boxes or measuring the density of boxes. Where the distance is calculated as shown in Eq. (3):

(3)
$ d(G_i, G_j | \Lambda) = K_{sim}(G_i, G_j | \sigma_1) + \lambda H_{sim}(G_i, G_j | \sigma_2). $

In Eq. (3), $K$ represent a set containing the detection boxes that have passed the preliminary screening and have a high confidence level and are the basis for further processing and analysis. $H$ usually represents a threshold or function that defines the acceptance criteria for the degree of overlap between detection boxes, or the criteria that determine the retention or suppression of detection boxes. The two distances are shown in Eq. (4):

(4)
$ K_{sim}(G_i, G_j | \sigma_1) = \begin{cases} \sum_n \tanh \frac{c_i^n}{\sigma_1} \tanh \frac{c_j^n}{\sigma_1}, & \text{if } k_j^n \text{ is within } B(k_i^n), \\ 0, & \text{otherwise.} \end{cases} $

In Eq. (4), $k$ is typically an index or counter that iterates through elements in a set or sequence in a summation expression, such as traversing all other boxes that overlap the current detection box, or iterating through all elements in a particular set to perform a specific calculation or evaluation. $B$ represent a vector that contains information about all detection boxes or bounding boxes. Spatial distance measures the spatial similarity of the two-feature data, as shown in formula (5).

(5)
$ H_{sim}(G_i, G_j | \sigma_2) = \sum_n \exp \left[ - \frac{(k_i^n - k_j^n)^2}{\sigma^2} \right]. $

3.3. Evaluation Index

This model evaluation covers accuracy, accuracy, recall, F value, missed alarm rate and false alarm rate by comparing prediction and real labels, reflecting the classification. Accuracy measures correct classifications; Precision counts true positives among predicted positives; Recall gauges actual positives among predictions; F-score averages Precision and Recall; Missed Alarm Rate ($MA$) shows chances of missing positives; False Alarm Rate ($FA$) indicates negatives wrongly marked as positives. Formula provided (6)-(11).

(6)
$ Accuracy = \frac{TP + TN}{TP + TN + FN + FP}, $
(7)
$ Precision = \frac{TP}{TP + FP}, $
(8)
$ Recall = \frac{TP}{TP + FN}, $
(9)
$ F = \frac{2 \times P \times R}{P + R}, $
(10)
$ MA = 1 - R, $
(11)
$ FA = 1 - P. $

Among the above formula, $TP$ (True Positives) refers to the number of samples that are actually positive and correctly identified by the model as positive classes. $TN$ (True Negatives) refers to the number of samples that are actually negative and correctly identified by the model as negatives. $FP$ (False Positives) is a false positive example, that is, the number of samples that are actually in a negative class but are incorrectly classified as positive by the model. $FN$ (False negatives) are false negatives, which refer to the number of samples that are actually positive but are incorrectly identified as negative by the model [29, 30]. $P$ usually stands for $Precision$ while $R$ stands for $Recall$.

4. Key Posture Detection in Sports Basketball

4.1. Spatial Feature Extraction

In human pose estimation, in order to identify and distinguish individuals more accurately, researchers will comprehensively consider a variety of features, including color, texture and shape. In order to capture subtle changes and details, local information such as corner and edge points is also widely used. These local features can not only represent the overall features of the image, also significantly reduces computation and boosts processing speed.

In sports scenes, due to the instability of basketball environment, traditional feature points may not provide stable and reliable tracking information. Therefore, the researchers chose stable and unique corner features as key points. Corner features have clear trajectories in the video, which facilitates cross-frame tracking, effectively assists abnormal behavior detection, and enhances motion safety. In this paper, SIFT algorithm is used to extract corner points from images, which can capture key feature points, ignore image scale, rotation and viewing angle changes, and realize real-time tracking. Fig. 3 displays the mean and variance of the inner layers' parameters of the DINO-DETR model.

Fig. 3. Mean and variance of each layer within the DINO-DETR model.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig3.png

4.1.1 Extreme point detection in scale space

The SIFT algorithm follows three key steps when detecting key points in images. Firstly, the scale space is constructed, and a series of Gaussian images with different scales are generated by multi-scale sampling and Gaussian blur processing on the original images. These images constitute the scale space of images and provide rich information for subsequent feature detection. Then, the DoG pyramid is constructed, and the Gaussian difference operator is used to differentiate the images in each scale space to generate the Gaussian difference pyramid. This step aims to highlight the key points and edge information in the image, providing a basis for subsequent key point detection. Finally, key point detection, by comparing key points in adjacent scale spaces and the same scale space, the extreme points are determined. These extreme points represent key points in the image, such as corners, edges, etc., which are the basis of subsequent feature description. Gaussian kernel is the unique linear kernel that achieves scale transformation. The scale space of the image can be constructed by convolving Gaussian kernel functions of different scales (Eq. 12) with the image, as shown in Eq. (13).

(12)
$ G(a, b, \sigma) = \frac{1}{2\pi\sigma^2} e^{-\frac{(a^2 + b^2)}{2\sigma^2}}, $
(13)
$ L(a, b, \sigma) = G(a, b, \sigma) * I(a, b). $

Among them, $a$ and $b$ are two points in the input space, and $\sigma$ is a parameter that controls the width of the function. The difference operation is performed on the Gaussian pyramid to construct the Gaussian difference pyramid. When detecting the extreme point, the scale factor scale $k$ is considered, and the Gaussian difference operator is defined by formula (14).

(14)
$ \begin{aligned} D(a, b, \sigma) &= [G(a, b, k\sigma) - G(a, b, \sigma)] * I(a, b) \\ &= L(a, b, k\sigma) - L(a, b, \sigma). \end{aligned} $

4.1.2 Precise location of extreme points

Calculated feature points include position, scale, and orientation. By copying feature points and considering their multiple directions, point sets with the same coordinates and scales but different directions are generated, and the main directions of these point sets are determined. Fig. 4 depicts a single bit flip error in the encoder/decoder block's linear layer. Calculate the gradient modulus and orientation of each key point in the scale space using Eqs. (15)-(16). $L$ represent the loss, $m$ and $\theta$ are the location indicator and the corresponding parameter.

Fig. 4. Single-bit flip error injected in the linear layer within the encoder/decoder block of the model.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig4.png
(15)
$ m(a, b) = \sqrt{\begin{aligned} &[L(a+1, ) - L(a-1, )]^2 \\ &\quad + [L(a, b+1) - L(a, b-1)]^2 \end{aligned}}, $
(16)
$ \theta(a, b) = \tan^{-1} \frac{L(a, b+1) - L(a, b-1)}{L(a+1, b) - L(a-1, b)}. $

4.2. Temporal Feature Extraction

Optical flow is used to detect the motion of objects in the field of view and describe the visual performance of the motion changes of objects in the scene. According to Assumption 1, formula (17) can be obtained, $x$ and $y$ usually represent the coordinates of the object's position in two-dimensional space, while $t$ represents the corresponding time:

(17)
$ I(x, y, t) = I(x + \Delta x, y + \Delta y, t + \Delta t). $

Based on hypothesis 2, formula (18) is obtained by using the Taylor series expansion equation. Under small movements, partial terms are ignored and simplified to equation (19), $x$ represents an independent variable around a point around which the Taylor series expansion takes place. $y$ represents another independent variable, which is also expanded near the point. $\varepsilon$ is deviation value; $t$ is used to represent time or other continuous variables, especially in physics or engineering, and Taylor series may be used to approximate time-dependent functions. $I$ denote integral variables, especially when calculating indefinite or definite integrals of a function.

(18)
$ \begin{aligned} &I(x + \Delta x, y + \Delta y, t + \Delta t) \\ &= I(x, y, t) + \frac{\partial I}{\partial x} \Delta x + \frac{\partial I}{\partial y} \Delta y + \frac{\partial I}{\partial t} \Delta t + \varepsilon, \end{aligned} $
(19)
$ \frac{\partial I}{\partial x} \Delta x + \frac{\partial I}{\partial y} \Delta y + \frac{\partial I}{\partial t} \Delta t = 0. $

The final optical flow expression is as in Eq. (20), $I_x$ stands for image intensity or brightness, which is the grayscale value of a pixel in an image. $v_x$ stands for optical flow vector, which is a two-dimensional vector that represents the speed at which a point in an image is moving in time:

(20)
$ I_x v_x + I_y v_y + I_t = 0. $

If the pixels move consistently between the two images, it can be expressed as a matrix, see formula (21).

(21)
$ \begin{bmatrix} I_{x1} & I_{y1} \\ I_{x2} & I_{y2} \\ \vdots & \vdots \\ I_{xn} & I_{yn} \end{bmatrix} \begin{bmatrix} v_x \\ v_y \end{bmatrix} = - \begin{bmatrix} I_{t1} \\ I_{t2} \\ \vdots \\ I_{tm} \end{bmatrix}. $

5. Experimental Results and Analysis

5.1. Experimental Data

In this paper, training and evaluation are performed on the FLIC and MPI Human Pose datasets. There is often more than one person in an image and the solution is to train only the person in the positive center of the image. The target person was cropped to the positive center then the input image was resized to 256x256, and the image was rotated (+/-30 degrees), scaled (.75-1.25) in order to perform data increment.

In video frame processing, we first use AlphaPose algorithm to accurately detect the key points of character skeleton in each image, and save this key points information. Subsequently, we filter the entire image set for accurate and consistent detection, checking if each frame has the correct number of key points corresponding to the actual people present during filming. If it is exceeded, it will be considered that there is an error in the detection result of the frame image, and it will be deleted. There are 300 pieces of test video data, which are divided into two categories: normal motion and conflict behavior, 150 pieces each, each piece is about 5 seconds, and the frame rate is 20 frames/second.

5.2. Results and Analysis

In this experiment, we used 1080 images for model testing, including 360 images in each of three different poses. Fig. 5 shows the vulnerability comparison. After analyzing the test results, we found that the proposed recognition model has good overall performance, with an average accuracy rate of 95.56%. Each movement behavior has a recognition accuracy of over 94%, showing high consistency and accuracy in recognizing various behaviors.

Fig. 5. Comparison of vulnerability.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig5.png

Fig. 6 shows the initialization and training network data analysis. When further analyzing the misidentified image frames, we found that there are two main problems. Firstly, some image frames are partially occluded due to abnormal behavior and posture, which makes the camera unable to fully capture the key information, thus affecting the recognition effect of the model. Secondly, in some cases, poor lighting conditions may cause the camera to be unable to accurately recognize the image content, which in turn affects the extraction of key feature points of human posture and leads to incorrect recognition.

Fig. 6. Initialization and training network data analysis.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig6.png

Fig. 7 shows the similarity of the YOLOv3 layer. The algorithm analyzes the video frame and determines it as “safe” or “unsafe”. Safe status indicates that athletes' behavior is normal, while unsafe status indicates that abnormality needs attention. 540 frames of self-made videos were used for evaluation, including 366 normal frames and 174 abnormal frames. Table 1 displays the evaluation results.

Fig. 7. CKA similarity between all layers U-Real layers in YOLOv3.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig7.png

Table 1. Results of algorithm behavior detection and evaluation.

Evaluation index Epoch Result
accuracy 100 95.12%
accuracy rate 100 91.24%
Recall rate 100 99.50%
F1 value 100 93.24%
Missed alarm rate 100 1.15%
false alarm rate 100 6.03%

Evaluating the proposed spatiotemporal feature point-based anomalous behavior detection algorithm in this paper, we employ an experimental data set consisting of 5 video segments and 5 simulated video segments containing conflicting behaviors. Fig. 8 shows the standardized energy comparison. By calculating the average displacement of feature points on key frames and drawing the displacement change curve, we observe that the normal motion behavior and the abnormal motion behavior show obvious distinction points on the curve. This shows that the proposed algorithm is effective in identifying and distinguishing normal and abnormal behaviors.

Fig. 8. Normalized energy comparison.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig8.png

Fig. 9 shows the analysis of NMS execution time and quantity. The experiment shows a kinetic energy change threshold of 4000 for motion. By observing the kinetic energy change curve of image frames, we find that the average kinetic energy is significantly higher than the threshold in the interval of 30 frames to 110 frames, while before and after these two frames, the average kinetic energy is lower than the threshold. This shows that by analyzing the kinetic energy changes of image frames, we can accurately judge whether there is abnormal motion behavior.

Fig. 9. NMS execution time and quantity.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig9.png

Fig. 10 Normalized RGB value analysis. By combining the average displacement of feature points and kinetic energy change, the athlete abnormal behavior detection algorithm proposed in this paper is evaluated. The experimental data set contains 150 video clips of safe movement behavior and 150 video clips of abnormal movement behavior, each video clip lasting about 5 seconds. The evaluation demonstrates the system's high accuracy in detecting athletes' abnormal behaviors.

Fig. 10. Normalized RGB value analysis.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig10.png

According to the system test results, the accuracy of safe motion and abnormal motion detection is high, but it only gives an alarm when abnormal behavior is detected, which leads to a certain false alarm rate. In order to deeply analyze the causes of false alarms, we selected 15 safe motion video clips and 15 abnormal motion video clips from the test video set, which consisted of normal and abnormal key image frames. Through the analysis of these image frames, we found some possible causes of false alarms.

Fig. 11 is an experimental time histogram analysis, which details the system's performance on various motion recognition. Specifically, the recognition accuracy of safe sports is as high as 97.04%, which shows that the system can accurately distinguish the normal and safe behaviors of athletes in most cases. At the same time, the recognition accuracy of abnormal motion has reached 95.03%, which shows that the system has high sensitivity to abnormal behavior.

Fig. 11. Time histogram.

../../Resources/ieie/IEIESPC.2026.15.1.55/fig11.png

However, although the overall accuracy is high, the system still has a certain false positive rate. The false positive rate refers to the sum of the probability of erroneously identifying a frame image originally belonging to a safe motion as an abnormal motion and erroneously identifying a frame image of an abnormal motion as a safe motion. Through in-depth analysis of these misjudgment frame images, we found that the main reasons include the following aspects:

Since the video data is simulated by the camera, and the shooting angle is limited to the lower right corner in front of the secondary movement, this viewing angle limitation may cause the system to be unable to fully capture the athlete's movements in some cases. Especially when athletes suddenly make large movements, their bodies may temporarily block the camera, making it impossible for the system to accurately capture and analyze the details of the movements, thus misjudging them as fighting or other abnormal behaviors.

In the video, when two people perform large-scale actions at a certain point at the same time and last for a long time, the system may also misjudge. This is because the system may struggle to distinguish between this coordinated action and real fighting behavior, especially if the action is complex, fast, and difficult to predict.

In order to reduce the false judgment rate and improve the overall performance of the system, the following measures can be considered: First, optimize the shooting angle and position of the camera to ensure that the athlete's movements can be comprehensively and clearly captured; Secondly, introducing advanced image processing technologies and algorithms enhances the system's ability to recognize complex actions and scenes; The third is to strengthen the monitoring and analysis of system misjudgment, and discover and correct problems in a timely manner.

6. Conclusion

In this study, we successfully propose a key gesture recognition method for basketball based on the improved YOLOv3 algorithm, and conduct an in-depth evaluation of its practical effect. By constructing a large-scale data set containing multiple basketball actions, we rigorously trained and tested the improved algorithm. The experiment indicates a 10% average increase in key pose recognition accuracy compared to the original YOLOv3 algorithm, reaching the accuracy of 92%. When processing real-time video streams, the average processing time per frame is shortened by 20%, which greatly enhances the system's real-time performance.

The key gesture recognition technology of basketball based on the improved YOLOv3 algorithm will show a broad application prospect in many aspects. First of all, with ongoing advancements in deep learning, we expect that the accuracy of the algorithm will be further improved, providing coaches with more accurate technical analysis. Secondly, with the popularity of the Internet of Things and wearable devices, we can foresee that key gesture recognition technologies will be widely used in personal health monitoring and remote training. For example, by analyzing athletes' sports data, their physical state can be monitored in real time, providing scientific basis for preventing sports injuries. In addition, we also anticipate integrating key gesture recognition with VR and AR to enhance athletes' training environments. By simulating real competition scenarios, athletes can train in a virtual environment and improve their competition adaptability. Finally, with enhanced computing power and optimized algorithms, we believe that key gesture recognition technology will be more widely used in future sports events, providing referees with auxiliary decision-making and improving the fairness and professionalism of competitions.

References

1 
Liu B. , He F. , Du S. , Li J. , Liu W. , 2023, An advanced YOLOv3 method for small object detection, Journal of Intelligent & Fuzzy Systems, Vol. 45, No. 4, pp. 5807-5819DOI
2 
Bai H. , Zhang T. , Lu C. , Chen W. , Xu F. , Han Z.-B. , 2020, Chromosome extraction based on U-Net and YOLOv3, IEEE Access, Vol. 8, pp. 178563-178569DOI
3 
Li D.-Y. , Wang G.-F. , Zhang Y. , Wang S. , 2022, Coal gangue detection and recognition algorithm based on deformable convolution YOLOv3, IET Image Processing, Vol. 16, No. 1, pp. 134-144DOI
4 
Chen J. , Li X. , Zhang Y. , Wang H. , Li Z. , 2021, An improved YOLOv3 based on dual path network for cherry tomatoes detection, Journal of Food Process Engineering, Vol. 44, No. 10DOI
5 
Ding H. , Wang H. , Wang K. , 2022, Improved YOLOv3 flame detection algorithm based on dynamic shape feature extraction and enhancement, Laser & Optoelectronics Progress, Vol. 59, No. 24Google Search
6 
Shi T. , Liu M. , Niu Y. , Yang Y. , Huang Y. , 2020, Underwater targets detection and classification in complex scenes based on an improved YOLOv3 algorithm, Journal of Electronic Imaging, Vol. 29, No. 4DOI
7 
Altamirano S. F. S. , Pérez J. A. , Pacheco D. L. , Vásquez M. A. , 2023, Utilizing image processing and the YOLOv3 network for real-time traffic light control, Journal of Engineering, Vol. 2023DOI
8 
Wan J. , Wang H. , Yang X. , Wang Y. , Wang Z. , 2021, An efficient small traffic sign detection method based on YOLOv3, Journal of Signal Processing Systems, Vol. 93, No. 8, pp. 899-911DOI
9 
Fang M.-T. , Chen Z.-J. , Przystupa K. , Li T. , Majka M. , Kochan O. , 2021, Examination of abnormal behavior detection based on improved YOLOv3, Electronics, Vol. 10, No. 2DOI
10 
Taheri Tajar A. , Ramazani A. , Mansoorizadeh M. , 2021, A lightweight Tiny-YOLOv3 vehicle detection approach, Journal of Real-Time Image Processing, Vol. 18, No. 6, pp. 2389-2401DOI
11 
Zhang G. , Chen X. , Zhao Y. , Wang J. , Yi G. , 2022, Lightweight YOLOv3 algorithm for small object detection, Laser & Optoelectronics Progress, Vol. 59, No. 16Google Search
12 
Zhang T. , Li J. , Jiang Y. , Zeng M. , Pang M. , 2022, Position detection of doors and windows based on DSPP-YOLO, Applied Sciences, Vol. 12, No. 21DOI
13 
Dahan F. , El Hindi K. , Ghoneim A. , Alsalman H. , 2021, An enhanced ant colony optimization based algorithm to solve QoS-aware web service composition, IEEE Access, Vol. 9, pp. 34098-34111DOI
14 
Liu Y. , Zhang H. , Wang J. , Li Z. , Sun Y. , 2020, Research on automatic location and recognition of insulators in substation based on YOLOv3, High Voltage, Vol. 5, No. 1, pp. 62-68DOI
15 
Zheng Z. , Zhao J. , Li Y. , 2021, Research on detecting bearing-cover defects based on improved YOLOv3, IEEE Access, Vol. 9, pp. 10304-10315DOI
16 
Zhang M. , Liang H. , Wang Z. , Wang L. , Huang C. , Luo X. , 2024, Damaged apple detection with a hybrid YOLOv3 algorithm, Information Processing in Agriculture, Vol. 11, No. 2, pp. 163-171DOI
17 
Wang X. , Wang S. , Cao J. , Wang Y. , 2020, Data-driven based Tiny-YOLOv3 method for front vehicle detection inducing SPP-Net, IEEE Access, Vol. 8, pp. 110227-110236DOI
18 
Shao Y. , Zhang X. , Chu H. , Zhang X. , Zhang D. , Rao Y. , 2022, AIR-YOLOv3: aerial infrared pedestrian detection via an improved YOLOv3 with network pruning, Applied Sciences, Vol. 12, No. 7DOI
19 
Wei T. , Liu H. , Tian H. , Wang X. , 2023, A visual positioning method for wedge support robot based on Pruned-Ghost-YOLOv3, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, Vol. 237DOI
20 
He B. , Qian S. , Niu Y. , 2024, Visual recognition and location algorithm based on optimized YOLOv3 detector and RGB depth camera, The Visual Computer, Vol. 40, No. 3, pp. 1965-1981DOI
21 
Yang Z. , Xu Z. , Wang Y. , 2022, Bidirection-fusion-YOLOv3: an improved method for insulator defect detection using UAV image, IEEE Transactions on Instrumentation and Measurement, Vol. 71DOI
22 
Wang S. , Hu Y. , Feng L. , Guo L. , 2022, Improved breast mass recognition YOLOv3 algorithm based on cross-layer feature aggregation, Laser & Optoelectronics Progress, Vol. 59, No. 4Google Search
23 
Wang Z. , Zhu H. , Jia X. , Bao Y. , Wang C. , 2022, Surface defect detection with modified real-time detector YOLOv3, Journal of Sensors, Vol. 2022DOI
24 
Cao L. , Li H. , Xie R. , Zhu J. , 2020, A text detection algorithm for image of student exercises based on CTPN and enhanced YOLOv3, IEEE Access, Vol. 8, pp. 176924-176934DOI
25 
Wang F. , Ao X. , Wu M. , Kawata S. , She J. , 2024, Explainable deep learning for sEMG-based similar gesture recognition: a Shapley-value-based solution, Information Sciences, Vol. 672, pp. 120667DOI
26 
Wang J. , Su S. , Wang W. , Chu C. , Jiang L. , Ji Y. , 2022, An object detection model for paint surface detection based on improved YOLOv3, Machines, Vol. 10, No. 4DOI
27 
Li Y. , Wei G. , Desrosiers C. , Zhou Y. , 2024, Decoupled and boosted learning for skeleton-based dynamic hand gesture recognition, Pattern Recognition, Vol. 153, pp. 110536DOI
28 
Luo Z. , Yu H. , Zhang Y. , 2020, Pine cone detection using boundary equilibrium generative adversarial networks and improved YOLOv3 model, Sensors, Vol. 20, No. 16DOI
29 
Qu M. , Zhou J. , Lv D. , Zhang G. , Zheng Y. , Xie J. , 2024, Synchronous gesture recognition and muscle force estimation based on piezoelectric micromachined ultrasound transducer, Sensors and Actuators A: Physical, Vol. 377, pp. 115687DOI
30 
Sun Z. , 2024, Wearable glove gesture recognition based on fiber Bragg grating sensing using genetic algorithm–back propagation neural network, Optical Fiber Technology, Vol. 87, pp. 103874DOI
Shunmin Su
../../Resources/ieie/IEIESPC.2026.15.1.55/au1.png

Shunmin Su was born in Changsha Hunan Village, China, in 1980. He received his bachelor's degree from Hunan Normal University in 2003 and a master's degree from Hunan Normal University in 2011. From 2003 to 2025, he has been an PE teacher at Hunan International Economics University. He is the author of two books, more than 10 articles, His research interests include sports training and sports industry.

Shuangshuang Yan
../../Resources/ieie/IEIESPC.2026.15.1.55/au2.png

Shuangshuang Yan was born in Zibo City, Shandong Province, China in 1991. She earned a master's degree in sports training from Capital University of Physical Education and Sports in 2015. Since 2017, she has been employed at Qingdao Hengxing University of Science and Technology, where she currently serves as Dean of the School of Music and Dance. Her academic contributions include over 20 published papers, 6 research projects, and over 30 awards in professional competitions. Her research expertise focuses on professional development of sports dance, stage performance practices, technical expertise in Latin dance.