Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 13, No. 01, p.23-32

ISSN (online) :

2287-5255

Received : 27 June 2023Revised : 14 August 2023

DOI :

https://doi.org/10.5573/IEIESPC.2024.13.1.23

Regular Paper

SMAGNet: Scaled Mask Attention Guided Network for Vision-based Gait Analysis in Multi-person Environments

YuHosang^1,a Park,Jaechan^1,^2,a KangKyunghun^3,^* Jeong,Sungmoon^1,⁴^*

( Research Center for Artificial Intelligence in Medicine, Kyungpook National Univ. Hospital / Daegu, Korea {youhs4554, jeongsm00}@gmail.com)
( Department of Neurosurgery, School of Medicine, Kyungpook National Univ. / Daegu, Korea jparkmd@hotmail.com)
( Department of Neurology, School of Medicine, Kyungpook National Univ. / Daegu, Korea kangkh@knu.ac.kr)
( Department of Medical Informatics, School of Medicine, Kyungpook National Univ. / Daegu, Korea)
(a These authors contributed equally.)

^*Corresponding Author: Kyunghun Kang and Sungmoon Jeong

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Clinical gait analysis plays a key role in diagnosing and managing neurodegenerative diseases such as Parkinson’s disease. In recent years, vision-based gait analysis methods have emerged as promising non-invasive approaches to quantify gait characteristics. However, most methods assume single-person situations, but multi-person situations are more common in real-world medical settings. In this paper, we propose a novel mask-guided attention model called a Scaled Mask Guided Attention Network (SMAGNet), which exploits a target person's detection result to address multi-person issues. SMAGNet utilizes a detection box as a mask label to predict attention maps that highlight patients’ gait features and progressively refines the maps for accurate analysis. Experimental results show that the mean absolute percentage error (MAPE) was improved by up to 20% for the target spatio-temporal gait variable compared to the baseline 3D CNN (Convolutional Neural Networks). Moreover, we achieved significantly better performance compared to other methods, including a recent state-of-the-art gait recognition model named GaitBase. These results showcase SMAGNet’s effectiveness in multi-person gait analysis and its potential for real-world clinical use.

Keywords

Video recognition, Computer vision, Gait analysis, Mask guided attention, Bi-level optimization

1. Introduction

Clinical gait analysis is a comprehensive assessment of an individual’s walking pattern, which plays a crucial role in the diagnosis and management of various neurodegenerative diseases ^[1-^4]. It involves the systematic evaluation of the complex interaction of muscles, joints, and the nervous system, which contribute to human gait. The importance of clinical gait analysis lies in its ability to identify and quantify gait abnormalities, facilitate early intervention, and monitor the progression and treatment outcomes of neurodegenerative conditions.

In recent years, computer vision-based gait analysis methods have gained traction as promising non-invasive approaches to assess and quantify gait characteristics. These methods overcome the limitations of marker-based motion capture techniques, which can be time-consuming, expensive, and require specialized equipment, discouraging regular gait status assessments. Recent studies demonstrate that physical spatio-temporal gait variables, such as gait speed and step width, can be measured accurately using deep learning-based pose estimation algorithms such as OpenPose ^[5,^6] and AlphaPose ^[7]. However, given that these pose-based approaches predict coordinates of body joints in the image plane, they need camera calibration for physical gait variable computation. This limits their general applicability despite their performance.

Other vision-based methods involve pre-processing video frames to create gait energy images (GEI) ^[8] or silhouette images ^[9-^11], which are subsequently used to train predictors. Recent works have employed thin skeleton-like images to train a 3D CNN ^[12], which performs spatio-temporal evaluations on skeleton-like images. By using skeleton-like images, the 3D CNN can more precisely evaluate gait cycles based on features that emphasize the lower body parts. They achieved affordable results in the Casia-B Gait benchmark dataset ^[13], which includes a task of estimating the view angle and changes in clothing and carrying condition.

The results demonstrated potential of using a 3D CNN for gait analysis. Thanks to its end-to-end trainable nature, this method does not require camera calibration but delivers good results. However, the end-to-end trainable methods recognize gait by training a predictor and work well in only specific conditions, such as limited backgrounds and single-person indoor setups.

Vision-based gait analysis methods assume single-person situations, but such scenarios are unnatural in real-world medical settings. Instead, multi-person situations are more common. For instance, as shown in Fig. 1, medical staff often walk alongside a patient during a gait test to assist an elderly patient who has difficulty walking. In this scenario, it might be challenging to apply the vision-based methods because only the patient’s gait patterns should be recognized while avoiding unrelated subjects or objects that could hinder accurate analysis.

In this paper, we propose a new method to address these challenges in multi-person environments where multiple individuals coexist. The main challenge is extracting target patient gait characteristics only while reducing the effects of distractors for precise gait analysis. In recent years, deep learning-based models have shown strong performance in object detection and segmentation tasks. For better results, however, fine-tuning is generally required by adapting models to a dataset. Given the tremendous number of video frames, annotating bounding boxes or segmentation masks for each frame is time consuming and expensive. To reduce annotation costs, we automatically detect patients using YOLO v8 ^[14] and predicted detection boxes as pseudo-mask labels, providing additional supervision to focus on the patient’s gait. YOLO v8 is a state-of-the-art object detector that offers both fast inference speed and high performance and was used for generating the pseudo-mask labels. We employed the Deep SORT tracking algorithm ^[15], selected the target patient’s track ID, and converted the detection boxes into binary mask images to serve as our pseudo-mask labels.

The main contribution of this paper is the proposal of a novel model architecture called a Scaled Mask Guided Attention Network (SMAGNet), which effectively addresses the issues in multi-person environments where multiple individuals coexist. We used the pseudo-mask labels as auxiliary targets for attention learning to give more attention to the target patient’s gait as a result. Our model is a variant of mask-guided approaches ^[16-^19], which utilize pseudo-labeled binary masks to suppress irrelevant features. This mask-guided approach enables the model to recognize that the region corresponding to the given mask where the target ROIs (e.g., patients) are located is more important than the others.

Fig. 1. Real-world gait analysis environment using the electrical walkway system GAITRite.

2. Related Works

2.1 Mask-guided Attention Method

The mask-guided attention method is a simple but effective approach to address interference issues in multiple-person environments. It suppresses irrelevant features through the straightforward multiplication of convolutional feature maps and binary masks of a target object. The masks provide supervision or guidance for predicting pixel-wise spatial attention maps that highlight regions that are likely to be associated with the target object. For example, by using mask-guided attention methods, the performance of pedestrian re-identification tasks was notably improved in ^[16,^18]. A re-identification task is challenging in that the model should distinguish each pedestrian’s unique gait patterns regardless of their clothing and background. Box-shaped coarse masks from the Faster R-CNN detector ^[17] were utilized to focus on each detected individual for the person-person occlusion problem effectively ^[16]. Binary semantic segmentation masks from the mask R-CNN ^[19] were utilized to reduce background clutter and prioritize the pedestrian’s body parts ^[18]. Our approach differs from these mask-guided methods ^[16,^18], which heavily rely on the quality of the noisy pseudo-mask labels. Instead, we optimize auxiliary scaling weights for refined mask supervisions through bi-level optimization.

2.2 Bi-level Optimization Problem (BOP)

The goal in the bi-level optimization problem (BOP) is to find the optimal hyperparameters when the optimization task involves a hierarchical structure with two levels of optimization. A generative adversarial network (GAN) ^[32] is the most famous and successful example of the BOP and alternatively solves two optimization problems at different levels using discriminator and generator networks. In many studies, BOP has been used to optimize the hyperparameters of deep neural networks. BOP was utilized for class-incremental learning (CIL), in which the number of classes increased phase by phase ^[20]. To address the stability-plasticity dilemma between learning old and new classes, aggregation weights (i.e., hyperparameters) for adaptively balancing them are learned. The aggregation weights are learned alongside model parameters while carefully balancing the stability and plasticity building blocks by an alternative training strategy. This approach demonstrates improved results in varied CIL benchmarks.

BOP was utilized for the gradient regularization method, in which gradients are boosted when the gradients of both the training and validation sets agree in direction and are regulated otherwise ^[21]. The authors introduced scalar weights, which adaptively modulate the magnitude of the gradients during weight updates. These scalar weights were also considered as hyperparameters and were fine-tuned right after updating model parameters to minimize validation errors through a bi-level optimization strategy. They minimized training errors by updating model parameters with the scalar weights fixed and then switched to minimize validation errors. This bi-level approach resulted in consistent improvement in generalization across various image classification benchmarks.

Inspired by previous studies, we introduce scaling weights for re-weighting mispredicted attention maps. In order to train the proposed SMAGNet, we employed a bi-level optimization scheme to optimize scaling weights. The SMAGNet showed refined attention patterns, even when noisy pseudo-mask labels were used for training attention, as shown in Fig. 4. The overall architecture of the proposed SMAGNet is shown in Fig. 2.

Fig. 2. The proposed SMAGNet architecture.

3. Proposed Method

Unlike other mask-guided approaches ^[16,^17], we perform an additional optimization round called the attention mask refinement process, which progressively corrects mispredicted attention maps. To this end, we implemented an auxiliary scaling layer re-weighting for the mispredicted attention maps. To learn optimal scaling weights, we introduce a hyperparameter optimization method called on BOP ^[20,^21]. In summary, the proposed attention mask refinement process involves training attention maps using pseudo-mask labels and then fine-tuning the scaling weights (hyperparameters) to correct attention errors. These two steps are alternately performed. For instance, when prediction performance on gait variables is degraded due to mask supervision onto non-target objects, the scaling layer learns to down-weight the corresponding regions of convolutional feature maps in the direction of recovering the performance.

It is important to validate the generalization performance using public benchmark data. While searching relevant datasets to evaluate the performance of the proposed spatiotemporal gait parameters regression task, we conducted extensive research on available options, including well-known datasets like GREW ^[22], Gait3D ^[23], Human 3.6M ^[24], and GPJATK ^[25]. The GREW dataset provides an annotation for re-identification that identifies who the pedestrian is, which is different from the gait variable regression task that we mainly deal with in this study, so it was difficult to include it.

The Human 3.6M and GPJATK datasets offer valuable 3D coordinate information for each joint acquired through precise 3D motion capture systems. Unfortunately, these datasets primarily focus on single-person scenarios within controlled laboratory settings. Recent large-scale public datasets like GREW and Gait3D are limited to localized pedestrians, making it difficult to evaluate gait parameter regression tasks within multi-person conditions. Additionally, the provided annotations are incompatible with the objectives of our study. We have found a similar study ^[26], but unfortunately, it has not been published yet, and the dataset is not publicly available.

To the best our knowledge, a dataset offering spatiotemporal gait parameters measured by gold-standard systems like GAITRite has not been released, particularly for a clinical population. We evaluated the proposed model on our own gait video dataset collected from a pressure-sensing electronic walkway system, GAITRite, which is the gold standard in clinical gait analysis and has been verified ^[27-^29]. Detailed information about the GAITRite dataset is presented in Table 4.

3.1 Mask-guided Attention

Following a previous mask-guided attention operation ^[16,^18], we defined generic mask-guided attention as element-wise multiplication of every base feature map channel in $f_{base}$ and attention mask $f_{mask}$ as:

(1)

${f}_{out}^{\left[i\right]}=~ {f}_{base}^{\left[i\right]}\otimes f_{mask},\,\,i=1,2,\ldots ,C$,

where $i$ is the channel index, and $\otimes $ is element-wise multiplication. $f_{mask}\in \left[0,1\right]$ is defined as $f_{mask}=\sigma \left(W_{mask}*f_{base}\right)$, where $*$ is a convolution operator, $\sigma \left(x\right)=1/\left(1+\exp \left(-x\right)\right)$ is the sigmoid function, and $W_{mask}$ denotes the convolution filters of the mask prediction layer. The mask-guided approach assumes that a predicted attention mask would be helpful in identifying a target object and provide supervision for the base feature by optimizing attention loss $L_{att}$. $L_{att}$ is defined as the per-pixel binary cross entropy loss (BCE):

(2)

$L_{att}=~ BCE\left(m_{gt},f_{mask}\right)$,

where $m_{gt}$ is coarse-level ground truth mask labels, which are obtained using the YOLO v8 object detector ^[14].

If a pixel lies within the target object’s detection box, it is annotated as 1. Otherwise, it is annotated as 0. When a pseudo-mask label $m_{gt}$ is used as a target label, it can be quite noisy, leading to invalid attention. To address this issue, we introduce a scaling layer with weights that are learned by backpropagation. This layer is parallel to the mask prediction layer, and its output is multiplied to refine the original mask predictions, as illustrated in Fig. 2.

3.2 Attention Module

In order to reduce attention errors from noisy pseudo-mask labels from the YOLO detector, we designed a simple and effective attention module that adaptively scales the intensity of the mispredicted attention mask in an explicit manner. As shown in Fig. 2, the attention module includes two parallel convolution layers: a mask layer and a scaling layer. These layers predict the attention mask $f_{mask}$ and scaling weights $\phi _{scale}$, respectively. The scaling layer was implemented with small stacks of convolutions with an architecture that is shared with the mask layer. Two $3\times 3\times 3$ convolution layers with rectified linear unit (ReLU) activation are used for feature extraction, and $1\times 1\times 1$ convolution and a sigmoid function are used for a single-channel output in the range of [0,1]. Then, we obtain refined features ${f}_{out}^{s}$ by re-weighting the base feature map using a scaled attention mask $\sqrt{f_{mask}\otimes \phi _{scale}}$:

(3)

${{f}_{out}^{s}}^{\left[i\right]}={f}_{base}^{\left[i\right]}\otimes \left(\sqrt{f_{mask}\otimes \phi _{scale}}\right),\,\,\,i=1,2,\ldots ,C$,

To build SMAGNet, the ResNet-based 3D CNN R(2+1)D-18 ^[30] was employed as the base model. The proposed attention module was placed after each residual block (or stage) of the base model. By forwarding refined features ${f}_{out}^{s}$ across every residual stage, we predict gait variables in fully connected layer fc as $y_{pred}=W_{fc}{{f}_{out}^{s}}_{last}$, where $W_{fc}$ denotes the fc layer weights, and ${{f}_{out}^{s}}_{last}$ is the refined features in the last residual stage.

We employed the R(2+1)D-18 model in our study due to its inherent capability to effectively capture spatiotemporal characteristics. By factorizing 3D kernels into 2D and 1D kernels, the model is able to separately capture spatial and temporal information. This systematic kernel factorization method allows the R(2+1)D-18 model to accurately measure gait variables by comprehensively detecting both spatial and temporal changes. Also, since this architecture is easy to customize, we employed it architecture as our baseline model.

Our objective was to learn the optimal scaling weights $\phi _{scale}$ by suppressing misguided features that adversely affect the main task loss, resulting in refined attention maps. We used main task loss $L_{main}$ as smooth L1 loss for its robust nature against outlier samples:

(4)

$\begin{align} L_{main}=\begin{cases} 0.5\left(y_{gt}-y_{pred}\right)^{2},\,\,\,\left| y_{gt}-y_{pred}\right| \leq \delta ,\\ \left| y_{gt}-y_{pred}\right| -0.5\delta ^{2},\,\,\,otherwise, \end{cases} \end{align} $

where $y_{gt}$ denotes ground truth gait variables, and $\delta $ is a hyperparameter used for determining whether to use L1 loss or L2 loss adaptively, depending on the input. We followed the default PyTorch implementation and used 1.0 for ${\delta}$.

3.3 Attention Mask Refining Process

The proposed attention module has two key layers to optimize: 1) the mask prediction layer supervised with pseudo-mask labels and 2) the scaling layer that corrects mask prediction errors. In this work, we introduce scaling weights as new hyperparameters that adaptively re-weight mispredicted attention maps. To this end, we employed BOP, which alternately solves two different levels of problems, where one task (i.e., the lower-level task) is subject to the other task (i.e., the upper-level task). In our formulation, the lower-level task involves learning attention maps using pseudo-mask labels, whereas the upper-level task involves fine-tuning the scaling weights (i.e., hyperparameters) to achieve optimal mask predictions for improved validation prediction performance, given previously learned attention maps. These two steps are alternately performed to balance the mask and scaling layers until convergence. This is accomplished through two-round refinement steps, where attention masks are progressively refined by alternatively switching to optimize the two groups of parameters $\theta _{1}$ and $\theta _{2}$ of SMAGNet. Each subscript number represents the parameter group index.

Table 1. Overall performance evaluation results.

Method

RMSE

MAPE (%)

R(2+1)D-18 [30]

4.11

10.0

SMAGNet

(ours)

2.17

6.63

Table 2. Results of ablations of each method. Reduction ratio in RMSE ($\boldsymbol{\Delta }$) is reported.

Method	RMSE	$\Delta$(%)
Baseline	6.16	-
+attention module	3.39	44.9
+BOP-based mask refinements	2.62	22.4

Table 3. Performance comparison for 9 types of gait variables. MAPEs (%) are compared.

Variable Name		Baseline	SMAGNet	Change
Velocity (cm/min)	-	6.02	5.28	-0.74
Cadence (steps/min)	-	4.47	4.02	-0.45
Cycle Time (sec)	Left	4.05	4.14	+0.09
Cycle Time (sec)	Right	4.51	4.15	-0.36
Stride Length (cm)	Left	5.06	3.72	-1.34
Stride Length (cm)	Right	4.82	3.61	-1.21
Support Base (cm)	Left	6.67	5.71	-0.96
Support Base (cm)	Right	6.42	6.13	-0.29
Swing Percent (%)	Left	4.92	3.53	-1.39
Swing Percent (%)	Right	6.71	3.98	-2.73
Stance Percent (%)	Left	2.53	1.69	-0.84
Stance Percent (%)	Right	3.23	1.86	-1.37
Double Support Percent (%)	Left	8.48	4.91	-3.57
Double Support Percent (%)	Right	7.67	4.94	-2.73
Toe In Out (degree)	Left	43.97	27.48	-16.49
Toe In Out (degree)	Right	41.09	21.02	-20.07

In round 1, we jointly optimize our main task loss $L_{main}$ and attention loss $L_{att}$ for the training dataset. We backpropagate the gradients with respect to both $L_{main}$ and $L_{att}$ through layers of SMAGNet while keeping the scaling layers fixed. Thus, in round 1, we learn to approximate attention maps using pseudo-mask labels by updating the model parameters $\theta _{1}=\left[W_{base},W_{mask},~ W_{fc}\right]$ with a learning rate of $\gamma _{1}$. Each element of the list represents the model parameter: the base model ($W_{base}$), mask layer ($W_{mask}$), and fully connected output layer ($W_{fc}$).

In round 2, we subsequently optimize the scaling weights $\phi _{scale}$ on the top of predicted attention masks $f_{mask}$, which are used for the refined feature maps ${f}_{out}^{s}$ defined in Eq. (3). The optimization involves learning the optimal scaling weights to refine misguided attention caused by invalid supervision of pseudo-labels, which could lead to bad main-task performance. The parameters $\theta _{2}=\left[W_{scale}\right]$ are optimized with a learning rate of $\gamma _{2}$, where $W_{scale}$ represents the model parameter of the scaling layer.

In contrast to round 1, we only update the scaling layers while all other model weights of SMAGNet remain fixed and minimize the main task loss for the validation sets. This enables the model to learn optimal scaling weights for enhanced generalization capability in the main task, which is equivalent to hyperparameter optimization. Thus, the proposed bi-level optimization serves as a unified method for the continuous optimization of both the model parameters and hyperparameters for better generalization, which is akin to the procedure in cross validation. The proposed two-round refinement process is repeated alternately until the model converges (i.e., round-1 model parameters $\theta _{1}=\left[W_{base},W_{mask},W_{fc}\right]$ are updated in the n-th epoch, and the round-2 model parameters $\theta _{2}=\left[W_{scale}\right]$ are updated in the (n+1)-th epoch).

4. Experiments

4.1 Dataset

Fig. 4. Impact of the proposed method on attention capability in multi-person environments: convolutional features are visualized where the detection of the target patient is not accurate. From left to right are the inaccurate mask labels using detection boxes from YOLO v8, convolutional features by R(2+1)D-18 [30], our method without BOP refinement, and our full method (red color: higher values; blue color: lower values).

4.2 Experimental Settings

R(2+1)D-18 CNN models ^[30] pre-trained on Kinetics-400 were used as our baseline models. The official implementations from the torchvision library were utilized. We implemented the mask layer and scaling layer in parallel, as shown in Fig. 2, and integrated them into each residual stage of the R(2+1)D-18 model. Both the mask layer and scaling layer were constructed using two stacks of convolution layers with the sigmoid function as the output activation to predict binary attention masks and scaling weights, respectively.

For all models, Adam ^[33] was used as an optimizer with the momentum set to 0.9 and weight decay set to 0.001. All models were trained for 150 epochs using a batch size of 32 and a learning rate of 0.0001. To implement the BOP-based attention mask refinement algorithm, which requires two optimizers, we used the same settings for both optimizers. We allocated 80% of the dataset for training and 20% for testing. Since gait analysis can be performed multiple times for each patient, we carefully split the training/testing data to ensure that the same patients were not included in both splits.

4.3 Preprocessing

As a preprocessing step, we obtained detection boxes of all individuals using YOLO v8 ^[14] and tracked each detection box's trajectories using the DeepSort algorithm ^[15]. We utilized DeepSort to minimize track ID switching caused by occlusion or varying viewpoints. Considering that our videos were recorded from a frontal viewpoint, we chose a trajectory that varied the most in the y-direction (vertical) compared to the x-direction (horizontal). The process of the patient detection is shown in the upper part of Fig. 2.

From each video, we uniformly sampled 64 frames. We applied standard data augmentation methods, such as random cropping and horizontal flipping. Moreover, we stacked 64 images with a resolution of 112${\times}$112 pixels to use them as input for our model. We used 64 images because they empirically fit our GPU memory capability (NVIDIA TITAN V; 12 GB). We also employed a frame resolution of 112x112 because the R(2+1)D-18 model has been pretrained using video frames with this resolution for the Kinetics-400 dataset ^[30]. As shown in Table 4, we used nine types of gait variables as our targets with different units and scales. Therefore, we applied standard scaling for data normalization.

Table 4. Dataset information. Distributions of the gait variables and demographics of subjects are listed. Gait variables with left/right pairs are averaged for simplicity.

		Range	Definition
Gait Variables
Velocity (cm/min)	73.3 28.1	[27.0, 119]	walking speed
Cadence (steps/min)	105 16.6	[76.6, 129]	counts of steps per minutes
Cycle Time (sec)	1.16 0.26	[0.92, 1.56]	duration of a single gait cycle
Stride Length (cm)	82.8 27.3	[36.0, 124]	displacement between the same foot strides
Support Base (cm)	11.4 4.00	[5.86, 18.7]	horizontal distance between the outer edges of both feet
Swing Percent (%)	34.2 4.54	[25.7, 39.6]	percent of duration between toe-off and next heel strike within a gait cycle
Stance Percent (%)	65.7 4.63	[60.3, 74.3]	percent of duration between heel strike and toe-off within a gait cycle when a foot is contacting with ground
Double Support Percent (%)	31.8 9.36	[21.0, 48.8]	percent of duration within a gait cycle when both feet are in contact with the ground simultaneously
Toe In Out (degree)	11.9 7.27	[-31.5, 41.0]	angle between the foot's long axis (from heel to the toes) and gait direction
Demographics
Age (years)	68.1 13.2	[22, 93]	N/A
Height (cm)	155 43.6	[132, 187]	N/A
Weight (kg)	59.5 20.3	[34, 105]	N/A
Gender	1,069 males; 1,215 females		N/A

4.4 Results

Table 1 presents the overall evaluation results on our GAITRite dataset. We compared the root mean squared error (RMSE) and mean absolute percentage error (MAPE) using the same base model of R(2+1)D-18. To evaluate the overall performance, we computed the average RMSE and MAPE of all gait variables. In the results, our proposed SMAGNet demonstrated significant improvements in both RMSE and MAPE, with values dropping from 4.11 to 2.17 and from 10.0% to 6.63%, respectively. This result is notable since we only utilized pseudo-mask labels obtained from the YOLO v8 object detector to train SMAGNet’s attention capability without any labeling costs. Furthermore, our method requires light modifications to the modeling pipeline, including the insertion of attention modules and the implementation of our proposed bi-level optimization strategy for attention mask refinements.

Next, we conducted an ablation study to demonstrate the effectiveness of each proposed method: the insertion of the attention modules and the BOP-based mask refinement process. The resulting RMSE reduction ratio (${\Delta}$) is the relative change in RMSE before and after applying some method and is reported in Table 2. In the results, simply inserting the attention module had the most significant impact on improving performance, showing a reduction ratio of 44.9% compared to the baseline method. The BOP-based mask refinements enhanced the results, showing a reduction ratio of 22.4% compared to the case of optimizing only the attention module without any refinement. These results suggest that learning attention masks for the target patients, even when using pseudo-mask labels, is critical for accurate gait analysis in multi-person scenarios. In addition, they also indicate that by utilizing our proposed bi-level strategy to refine potentially mispredicted mask predictions that are trained with noisy pseudo-mask labels, we can achieve better results.

As shown in Table 3, we inspected the MAPE values for each gait variable for error analysis. The gait variable named ``Toe In-Out (degree)'' is defined as the angle between the foot and the gait direction. The MAPE value was reduced most drastically by 20% for this variable. Analyzing this variable is challenging as it requires accurately capturing the shape of the foot, which occupies only a small portion of the video frame. Furthermore, for gait variables that require precise detection of the gait cycle, such as ``Swing Percent (%)'', ``Stance Percent (%)'', and ``Double Support Percent (%)'', SMAGNet also demonstrated a notable reduction in MAPE. These results suggest that SMAGNet has the ability to capture gait cycles by paying more attention to the target patient in our multi-person scenarios effectively while preventing interference from non-target individuals walking alongside the target patient.

In order to investigate the linear correlation between predictions and ground truth from the GAITRite, we calculated Pearson correlation coefficients using SciPy’s stats module and compared the proposed methods with the baseline. As shown in Fig. 3, our full methods (denoted as ``w/BOP) showed the best correlations with all ground truth gait variables. However, when incorporating attention modules without BOP-based mask refinements (denoted as ``w/o BOP''), the correlation coefficients were poor for the angle-related gait variable ``Toe In-Out (degree)''. We suspect that the proposed method of learning attention maps supervised with box-shaped masks might not be sufficient to provide details of foot shapes. After applying BOP-based mask refinements (see w/BOP), we observed improved results for all gait variables, including the ``Toe In-Out (degree)''. The proposed method can overcome the limitations of using box-shaped attention masks by producing more complex and detailed attention masks through refinements, leading to better results.

Lastly, as shown in Fig. 4, we conducted a qualitative investigation of the attention capability of our proposed method by visualizing intermediate convolutional features for the baseline method, our approach before BOP-based refinements, and our approach after the refinements. To examine the attention capability for challenging samples, we analyzed samples providing inaccurate mask labels to the model. To find these samples, we applied a simple K-nearest neighbor outlier detection algorithm to the detection box coordinate vectors (i.e., width, height, x_center, y_center).

In our baseline model, R(2+1)D-18 ^[30], we observed less meaningful results with high-level attention values on all moving individuals, regardless of whether they were patients or not. In contrast, after simply incorporating the attention module (denoted as ``before BOP'' in the figure), the attention shifted toward patients’ lower body parts in most cases. Intriguingly, as illustrated in the rightmost column, after applying BOP-based attention mask refinements, we noticed even better attention results that presented more reasonable attention maps, which particularly focused on the actual foot strides of the target patient, which is crucial for precise gait-cycle and foot-shape analysis. These results support the enhanced performance of our proposed SMAGNet compared to before applying BOP-based refinements, particularly in foot-angle-related gait variables and gait-cycle-related variables.

Fig. 3. Correlation coefficient comparison for all gait variables. All results showed p<0.05, which represents statistical significance of correlation coefficients.

Fig. 4. Impact of the proposed method on attention capability in multi-person environments: convolutional features are visualized where the detection of the target patient is not accurate. From left to right are the inaccurate mask labels using detection boxes from YOLO v8, convolutional features by R(2+1)D-18 [30], our method without BOP refinement, and our full method (red color: higher values; blue color: lower values).

4.5 Comparison with State of the Art

Table 5 shows comparison results for a recent state-of-the-art gait recognition model named GaitBase ^[31] using our GAITRite dataset. GaitBase is a recently proposed method that has achieved strong performance in many gait-based person re-identification benchmarks in wild conditions and contains diverse noisy visual factors, such as clothing, carrying, etc. Accordingly, silhouette images with fewer visuals are utilized as inputs. We followed the official codebase provided by the official OpenGait repository (\url{https://github.com/ShiqiYu/OpenGait}) to implement preprocessing, including person tracking and extracting silhouette images of each person. After tracking, we selected patient silhouettes using the same selection criterion as our method.

In this experiment, we used the GaitBase model, which was pretrained on the largest public gait dataset named GREW ^[22]. Robust silhouette embeddings were extracted from the pretrained GaitBase and fed to the fc layer. The same regression loss $L_{main}$ defined in Eq. (4) was adopted to finetune the entire model weights. We employed the same training strategy as our method. RMSE and MAPE(%) are compared in Table 5.

Table 5 shows that the proposed SMAGNet significantly outperforms GaitBase across all evaluation metrics. The lower performance of GaitBase could arise from its heavy reliance on the quality of silhouette inputs. As shown in Fig. 5, preprocessed silhouette images are too simple or noisy to provide enough information for precise gait analysis. As a result, the GaitBase model, which only uses silhouette images as inputs, cannot cope with such irreversible bad inputs and performs worse than the proposed model, even though it was fine-tuned with the large-scale dataset GREW.

Table 5. Comparison of GaitBase [31] and proposed SMAGNet on our dataset.

Method	RMSE	MAPE (%)
GaitBase [31]	8.81	31.8
SMAGNet	2.17	6.63

5. Conclusion

We have presented a new network architecture called SMAGNet designed for gait analysis to address the challenges of multi-person environments in complex real-world medical settings. Experimental results on the GAITRite dataset collected in real-world clinical environments demonstrated significantly improved performance compared to the baseline model, showcasing SMAGNet’s effectiveness in multi-person gait analysis and its potential for real-world clinical use. Additionally, SMAGNet exhibited a strong correlation with the GAITRite gold standard gait analysis system. Validating the proposed method with public benchmark datasets remains for future work. However, there is currently no publicly available dataset that provides qualitative gait variables measured by gold-standard gait analysis systems like GAITRite. Therefore, we will expand our method to a multi-center study for verification of its generality in the future.

Our ultimate goal is to introduce this system in hospitals, enabling medical professionals to seamlessly monitor the progression and treatment outcomes of neurodegenerative conditions and make well-informed decisions regarding customized treatment plans for each patient in a timely manner. Also, this system will eventually be integrated into mobile phone applications or public kiosk platforms to promote regular gait status assessments after an intensive validation process. To achieve this goal, we plan to make the heavy 3D CNN architecture of SMAGNet more lightweight by considering efficiency in the number of parameters, floating-point operations, and inference time.

ACKNOWLEDGMENTS

This research was supported by Kyungpook National University Research Fund, 2020.

REFERENCES

Y.-H. Lim et al., ``Quantitative Gait Analysis and Cerebrospinal Fluid Tap Test for Idiopathic Normal-pressure Hydrocephalus,'' Sci Rep, vol. 9, no. 1, Art. no. 1, Nov. 2019.

C. Selge et al., ``Gait analysis in PSP and NPH: Dual-task conditions make the difference,'' Neurology, vol. 90, no. 12, pp. e1021-e1028, Mar. 2018.

D. Cabral et al., ``Frequency of Alzheimer’s Disease Pathology at Autopsy in Patients with Clinical Normal Pressure Hydrocephalus,'' Alzheimers Dement, vol. 7, no. 5, pp. 509-513, Sep. 2011.

W. Pirker and R. Katzenschlager, ``Gait disorders in adults and the elderly\,: A clinical guide,'' Wien Klin Wochenschr, vol. 129, no. 3-4, pp. 81-95, Feb. 2017.

J. Kwon, Y. Lee, and J. Lee, ``Comparative Study of Markerless Vision-Based Gait Analyses for Person Re-Identification,'' Sensors (Basel), vol. 21, no. 24, p. 8208, Dec. 2021.

D. Xue et al., ``Vision-Based Gait Analysis for Senior Care.'' arXiv, Dec. 01, 2018.

Y.-M. Tang et al., ``Diagnostic value of a vision-based intelligent gait analyzer in screening for gait abnormalities,'' Gait Posture, vol. 91, pp. 205-211, Jan. 2022.

C. Wang, J. Zhang, J. Pu, X. Yuan, and L. Wang, ``Chrono-Gait Image: A Novel Temporal Template for Gait Recognition,'' in Computer Vision - ECCV 2010, K. Daniilidis, P. Maragos, and N. Paragios, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2010, pp. 257-270.

L. Wang, T. Tan, H. Ning, and W. Hu, ``Silhouette analysis-based gait recognition for human identification,'' IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1505-1518, Feb. 2003.

C. Prakash, A. Mittal, R. Kumar, and N. Mittal, ``Identification of gait parameters from silhouette images,'' in 2015 Eighth International Conference on Contemporary Computing (IC3), Aug. 2015, pp. 190-195.

N. Karimi Hosseini and M. J. Nordin, ``Human Gait Recognition: A Silhouette Based Approach,'' Journal of Automation and Control Engineering, vol. 1, pp. 40-42, Mar. 2013.

P. Supraja, R. J. Tom, R. S. Tiwari, V. Vijayakumar, and Y. Liu, ``3D convolution neural network-based person identification using gait cycles,'' Evolving Systems, vol. 12, no. 4, pp. 1045-1056, Dec. 2021.

S. Yu, D. Tan, and T. Tan, ``A Framework for Evaluating the Effect of View Angle, Clothing and Carrying Condition on Gait Recognition,'' in 18th International Conference on Pattern Recognition (ICPR’06), Aug. 2006, pp. 441-444.

D. Reis, J. Kupec, J. Hong, and A. Daoudi, ``Real-Time Flying Object Detection with YOLOv8.'' arXiv, May 17, 2023.

N. Wojke, A. Bewley, and D. Paulus, ``Simple Online and Realtime Tracking with a Deep Association Metric.'' arXiv, Mar. 21, 2017.

Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, and L. Shao, ``Mask-Guided Attention Network for Occluded Pedestrian Detection,'' in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 4966-4974.

S. Ren, K. He, R. Girshick, and J. Sun, ``Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.'' arXiv, Jan. 06, 2016.

C. Song, Y. Huang, W. Ouyang, and L. Wang, ``Mask-Guided Contrastive Attention Model for Person Re-identification,'' in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT: IEEE, Jun. 2018, pp. 1179-1188.

K. He, G. Gkioxari, P. Dollár, and R. Girshick, ``Mask R-CNN.'' arXiv, Jan. 24, 2018.

Y. Liu, B. Schiele, and Q. Sun, ``Adaptive Aggregation Networks for Class-Incremental Learning,'' in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2021, pp. 2544-2553.

S. Jenni and P. Favaro, ``Deep Bilevel Learning,'' in Computer Vision - ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2018, pp. 632-648.

Zhu, Zheng, et al. ``Gait recognition in the wild: A benchmark.'' Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Zheng, Jinkai, et al. ``Gait recognition in the wild with dense 3d representations and a benchmark.'' Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Ionescu, Catalin, et al. ``Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.'' IEEE transactions on pattern analysis and machine intelligence 36.7 (2013): 1325-1339.

Kwolek, Bogdan, et al. ``Calibrated and synchronized multi-view video and motion capture dataset for evaluation of gait recognition.'' Multimedia Tools and Applications 78 (2019): 32437-32465.

Cotton, R. James, et al. ``Spatiotemporal Characterization of Gait from Monocular Videos with Transformers.'' (2021).

A. L. McDonough, M. Batavia, F. C. Chen, S. Kwon, and J. Ziai, ``The validity and reliability of the GAITRite system’s measurements: A preliminary evaluation,'' Arch Phys Med Rehabil, vol. 82, no. 3, pp. 419-425, Mar. 2001.

A. J. Nelson et al., ``The validity of the GaitRite and the Functional Ambulation Performance scoring system in the analysis of Parkinson gait,'' NeuroRehabilitation, vol. 17, no. 3, pp. 255-262, 2002.

B. Bilney, M. Morris, and K. Webster, ``Concurrent related validity of the GAITRite walkway system for quantification of the spatial and temporal parameters of gait,'' Gait Posture, vol. 17, no. 1, pp. 68-74, Feb. 2003.

D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, ``A Closer Look at Spatiotemporal Convolutions for Action Recognition,'' in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT: IEEE, Jun. 2018, pp. 6450-6459.

Fan, Chao, et al. ``OpenGait: Revisiting Gait Recognition Towards Better Practicality.'' Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

I. J. Goodfellow et al., ``Generative Adversarial Networks.'' arXiv, Jun. 10, 2014.

D. P. Kingma and J. Ba, ``Adam: A Method for Stochastic Optimization.'' arXiv, Jan. 29, 2017.

Hosang Yu

Hosang Yu is a deep learning research engineer in research center for AI in Medicine at Kyungpook National University Hospital, Daegu, South Korea. He received his B.S. and M.S. degree in Electronical Engineering from Kyungpook National University, Daegu, South Korea, in 2017 and 2019, respectively. His research interests include deep learning-based computer vision algorithms for gait analysis, medical image classification and segmentation, etc.

Jaechan Park

Jaechan Park is a neurosurgeon who trained as a resident in Kyungpook National University Hospital, Daegu, South Korea and as a clinical fellow at the Detroit Medical Center (Wayne State University), Detroit, USA. He received a Ph.D. degree, Neurosurgery, Seoul National University, Seoul, South Korea. He has clinical interest in vascular neurosurgery and minimally invasive neurosurgery. His current research interests include medical artificial intelligence, optical coherence tomography for intraoperative cerebral angiography, endovascular simulator for angiography and endovascular procedures, etc.

Kyunghun Kang

Kyunghun Kang received his B.S. and M.S. degrees from Kyungpook National University School of Medicine, Daegu, Korea, in 2003 and 2006, respectively. He received his Ph.D. degree in Biomedical Engi-neering from Hanyang University, Seoul, Korea, in 2020. In 2014, he joined the Department of Neurology, Kyungpook National University School of Medicine, Daegu, Korea, where he is currently working as an Associate Professor. His research interests are the areas of neuroimaging, gait analysis, and normal-pressure hydrocephalus.

Sungmoon Jeong

Sungmoon Jeong received a Ph.D. degree in electronical engineering and computer science from Kyungpook National University, South Korea, in 2013. From 2013 to 2018, he was an assisant professor of school of information science at Japan Advanced Institute of Science and Technology (JAIST), Japan. From 2018, he is currently an assisant professor of department of medical informatics and research center for AI in Medicine at Kyungpook National University and Hospital. His current research interests include multi-modal medical data analysis, SaMD, intelligent hospital information system and deep learning based medical applications.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

SMAGNet: Scaled Mask Attention Guided Network for Vision-based Gait Analysis in Multi-person Environments

Abstract

Keywords

1. Introduction

Fig. 1. Real-world gait analysis environment using the electrical walkway system GAITRite.

2. Related Works

2.1 Mask-guided Attention Method

2.2 Bi-level Optimization Problem (BOP)

Fig. 2. The proposed SMAGNet architecture.

3. Proposed Method

3.1 Mask-guided Attention

(1)

(2)

3.2 Attention Module

(3)

(4)

3.3 Attention Mask Refining Process

Table 1. Overall performance evaluation results.

Table 2. Results of ablations of each method. Reduction ratio in RMSE ($\boldsymbol{\Delta }$) is reported.

Table 3. Performance comparison for 9 types of gait variables. MAPEs (%) are compared.

4. Experiments

4.1 Dataset

4.2 Experimental Settings

4.3 Preprocessing

Table 4. Dataset information. Distributions of the gait variables and demographics of subjects are listed. Gait variables with left/right pairs are averaged for simplicity.

4.4 Results

Fig. 3. Correlation coefficient comparison for all gait variables. All results showed p<0.05, which represents statistical significance of correlation coefficients.

4.5 Comparison with State of the Art

Table 5. Comparison of GaitBase [31] and proposed SMAGNet on our dataset.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Hosang Yu

Jaechan Park

Kyunghun Kang

Sungmoon Jeong

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing