YuHosang1,a
Park,Jaechan1,2,a
KangKyunghun3,*
Jeong,Sungmoon1,4*
-
( Research Center for Artificial Intelligence in Medicine, Kyungpook National Univ.
Hospital / Daegu, Korea {youhs4554, jeongsm00}@gmail.com)
-
( Department of Neurosurgery, School of Medicine, Kyungpook National Univ. / Daegu,
Korea jparkmd@hotmail.com)
-
( Department of Neurology, School of Medicine, Kyungpook National Univ. / Daegu, Korea
kangkh@knu.ac.kr)
-
( Department of Medical Informatics, School of Medicine, Kyungpook National Univ. /
Daegu, Korea)
-
(a These authors contributed equally.)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Video recognition, Computer vision, Gait analysis, Mask guided attention, Bi-level optimization
1. Introduction
Clinical gait analysis is a comprehensive assessment of an individual’s walking pattern,
which plays a crucial role in the diagnosis and management of various neurodegenerative
diseases [1-4]. It involves the systematic evaluation of the complex interaction of muscles, joints,
and the nervous system, which contribute to human gait. The importance of clinical
gait analysis lies in its ability to identify and quantify gait abnormalities, facilitate
early intervention, and monitor the progression and treatment outcomes of neurodegenerative
conditions.
In recent years, computer vision-based gait analysis methods have gained traction
as promising non-invasive approaches to assess and quantify gait characteristics.
These methods overcome the limitations of marker-based motion capture techniques,
which can be time-consuming, expensive, and require specialized equipment, discouraging
regular gait status assessments. Recent studies demonstrate that physical spatio-temporal
gait variables, such as gait speed and step width, can be measured accurately using
deep learning-based pose estimation algorithms such as OpenPose [5,6] and AlphaPose [7]. However, given that these pose-based approaches predict coordinates of body joints
in the image plane, they need camera calibration for physical gait variable computation.
This limits their general applicability despite their performance.
Other vision-based methods involve pre-processing video frames to create gait energy
images (GEI) [8] or silhouette images [9-11], which are subsequently used to train predictors. Recent works have employed thin
skeleton-like images to train a 3D CNN [12], which performs spatio-temporal evaluations on skeleton-like images. By using skeleton-like
images, the 3D CNN can more precisely evaluate gait cycles based on features that
emphasize the lower body parts. They achieved affordable results in the Casia-B Gait
benchmark dataset [13], which includes a task of estimating the view angle and changes in clothing and carrying
condition.
The results demonstrated potential of using a 3D CNN for gait analysis. Thanks to
its end-to-end trainable nature, this method does not require camera calibration but
delivers good results. However, the end-to-end trainable methods recognize gait by
training a predictor and work well in only specific conditions, such as limited backgrounds
and single-person indoor setups.
Vision-based gait analysis methods assume single-person situations, but such scenarios
are unnatural in real-world medical settings. Instead, multi-person situations are
more common. For instance, as shown in Fig. 1, medical staff often walk alongside a patient during a gait test to assist an elderly
patient who has difficulty walking. In this scenario, it might be challenging to apply
the vision-based methods because only the patient’s gait patterns should be recognized
while avoiding unrelated subjects or objects that could hinder accurate analysis.
In this paper, we propose a new method to address these challenges in multi-person
environments where multiple individuals coexist. The main challenge is extracting
target patient gait characteristics only while reducing the effects of distractors
for precise gait analysis. In recent years, deep learning-based models have shown
strong performance in object detection and segmentation tasks. For better results,
however, fine-tuning is generally required by adapting models to a dataset. Given
the tremendous number of video frames, annotating bounding boxes or segmentation masks
for each frame is time consuming and expensive. To reduce annotation costs, we automatically
detect patients using YOLO v8 [14] and predicted detection boxes as pseudo-mask labels, providing additional supervision
to focus on the patient’s gait. YOLO v8 is a state-of-the-art object detector that
offers both fast inference speed and high performance and was used for generating
the pseudo-mask labels. We employed the Deep SORT tracking algorithm [15], selected the target patient’s track ID, and converted the detection boxes into binary
mask images to serve as our pseudo-mask labels.
The main contribution of this paper is the proposal of a novel model architecture
called a Scaled Mask Guided Attention Network (SMAGNet), which effectively addresses
the issues in multi-person environments where multiple individuals coexist. We used
the pseudo-mask labels as auxiliary targets for attention learning to give more attention
to the target patient’s gait as a result. Our model is a variant of mask-guided approaches
[16-19], which utilize pseudo-labeled binary masks to suppress irrelevant features. This
mask-guided approach enables the model to recognize that the region corresponding
to the given mask where the target ROIs (e.g., patients) are located is more important
than the others.
Fig. 1. Real-world gait analysis environment using the electrical walkway system GAITRite.
2. Related Works
2.1 Mask-guided Attention Method
The mask-guided attention method is a simple but effective approach to address interference
issues in multiple-person environments. It suppresses irrelevant features through
the straightforward multiplication of convolutional feature maps and binary masks
of a target object. The masks provide supervision or guidance for predicting pixel-wise
spatial attention maps that highlight regions that are likely to be associated with
the target object. For example, by using mask-guided attention methods, the performance
of pedestrian re-identification tasks was notably improved in [16,18]. A re-identification task is challenging in that the model should distinguish each
pedestrian’s unique gait patterns regardless of their clothing and background. Box-shaped
coarse masks from the Faster R-CNN detector [17] were utilized to focus on each detected individual for the person-person occlusion
problem effectively [16]. Binary semantic segmentation masks from the mask R-CNN [19] were utilized to reduce background clutter and prioritize the pedestrian’s body parts
[18]. Our approach differs from these mask-guided methods [16,18], which heavily rely on the quality of the noisy pseudo-mask labels. Instead, we optimize
auxiliary scaling weights for refined mask supervisions through bi-level optimization.
2.2 Bi-level Optimization Problem (BOP)
The goal in the bi-level optimization problem (BOP) is to find the optimal hyperparameters
when the optimization task involves a hierarchical structure with two levels of optimization.
A generative adversarial network (GAN) [32] is the most famous and successful example of the BOP and alternatively solves two
optimization problems at different levels using discriminator and generator networks.
In many studies, BOP has been used to optimize the hyperparameters of deep neural
networks. BOP was utilized for class-incremental learning (CIL), in which the number
of classes increased phase by phase [20]. To address the stability-plasticity dilemma between learning old and new classes,
aggregation weights (i.e., hyperparameters) for adaptively balancing them are learned.
The aggregation weights are learned alongside model parameters while carefully balancing
the stability and plasticity building blocks by an alternative training strategy.
This approach demonstrates improved results in varied CIL benchmarks.
BOP was utilized for the gradient regularization method, in which gradients are boosted
when the gradients of both the training and validation sets agree in direction and
are regulated otherwise [21]. The authors introduced scalar weights, which adaptively modulate the magnitude of
the gradients during weight updates. These scalar weights were also considered as
hyperparameters and were fine-tuned right after updating model parameters to minimize
validation errors through a bi-level optimization strategy. They minimized training
errors by updating model parameters with the scalar weights fixed and then switched
to minimize validation errors. This bi-level approach resulted in consistent improvement
in generalization across various image classification benchmarks.
Inspired by previous studies, we introduce scaling weights for re-weighting mispredicted
attention maps. In order to train the proposed SMAGNet, we employed a bi-level optimization
scheme to optimize scaling weights. The SMAGNet showed refined attention patterns,
even when noisy pseudo-mask labels were used for training attention, as shown in Fig. 4. The overall architecture of the proposed SMAGNet is shown in Fig. 2.
Fig. 2. The proposed SMAGNet architecture.
3. Proposed Method
Unlike other mask-guided approaches [16,17], we perform an additional optimization round called the attention mask refinement
process, which progressively corrects mispredicted attention maps. To this end, we
implemented an auxiliary scaling layer re-weighting for the mispredicted attention
maps. To learn optimal scaling weights, we introduce a hyperparameter optimization
method called on BOP [20,21]. In summary, the proposed attention mask refinement process involves training attention
maps using pseudo-mask labels and then fine-tuning the scaling weights (hyperparameters)
to correct attention errors. These two steps are alternately performed. For instance,
when prediction performance on gait variables is degraded due to mask supervision
onto non-target objects, the scaling layer learns to down-weight the corresponding
regions of convolutional feature maps in the direction of recovering the performance.
It is important to validate the generalization performance using public benchmark
data. While searching relevant datasets to evaluate the performance of the proposed
spatiotemporal gait parameters regression task, we conducted extensive research on
available options, including well-known datasets like GREW [22], Gait3D [23], Human 3.6M [24], and GPJATK [25]. The GREW dataset provides an annotation for re-identification that identifies who
the pedestrian is, which is different from the gait variable regression task that
we mainly deal with in this study, so it was difficult to include it.
The Human 3.6M and GPJATK datasets offer valuable 3D coordinate information for each
joint acquired through precise 3D motion capture systems. Unfortunately, these datasets
primarily focus on single-person scenarios within controlled laboratory settings.
Recent large-scale public datasets like GREW and Gait3D are limited to localized pedestrians,
making it difficult to evaluate gait parameter regression tasks within multi-person
conditions. Additionally, the provided annotations are incompatible with the objectives
of our study. We have found a similar study [26], but unfortunately, it has not been published yet, and the dataset is not publicly
available.
To the best our knowledge, a dataset offering spatiotemporal gait parameters measured
by gold-standard systems like GAITRite has not been released, particularly for a clinical
population. We evaluated the proposed model on our own gait video dataset collected
from a pressure-sensing electronic walkway system, GAITRite, which is the gold standard
in clinical gait analysis and has been verified [27-29]. Detailed information about the GAITRite dataset is presented in Table 4.
3.1 Mask-guided Attention
Following a previous mask-guided attention operation [16,18], we defined generic mask-guided attention as element-wise multiplication of every
base feature map channel in $f_{base}$ and attention mask $f_{mask}$ as:
where $i$ is the channel index, and $\otimes $ is element-wise multiplication. $f_{mask}\in
\left[0,1\right]$ is defined as $f_{mask}=\sigma \left(W_{mask}*f_{base}\right)$,
where $*$ is a convolution operator, $\sigma \left(x\right)=1/\left(1+\exp \left(-x\right)\right)$
is the sigmoid function, and $W_{mask}$ denotes the convolution filters of the mask
prediction layer. The mask-guided approach assumes that a predicted attention mask
would be helpful in identifying a target object and provide supervision for the base
feature by optimizing attention loss $L_{att}$. $L_{att}$ is defined as the per-pixel
binary cross entropy loss (BCE):
where $m_{gt}$ is coarse-level ground truth mask labels, which are obtained using
the YOLO v8 object detector [14].
If a pixel lies within the target object’s detection box, it is annotated as 1. Otherwise,
it is annotated as 0. When a pseudo-mask label $m_{gt}$ is used as a target label,
it can be quite noisy, leading to invalid attention. To address this issue, we introduce
a scaling layer with weights that are learned by backpropagation. This layer is parallel
to the mask prediction layer, and its output is multiplied to refine the original
mask predictions, as illustrated in Fig. 2.
3.2 Attention Module
In order to reduce attention errors from noisy pseudo-mask labels from the YOLO detector,
we designed a simple and effective attention module that adaptively scales the intensity
of the mispredicted attention mask in an explicit manner. As shown in Fig. 2, the attention module includes two parallel convolution layers: a mask layer and
a scaling layer. These layers predict the attention mask $f_{mask}$ and scaling weights
$\phi _{scale}$, respectively. The scaling layer was implemented with small stacks
of convolutions with an architecture that is shared with the mask layer. Two $3\times
3\times 3$ convolution layers with rectified linear unit (ReLU) activation are used
for feature extraction, and $1\times 1\times 1$ convolution and a sigmoid function
are used for a single-channel output in the range of [0,1]. Then, we obtain refined
features ${f}_{out}^{s}$ by re-weighting the base feature map using a scaled attention
mask $\sqrt{f_{mask}\otimes \phi _{scale}}$:
To build SMAGNet, the ResNet-based 3D CNN R(2+1)D-18 [30] was employed as the base model. The proposed attention module was placed after each
residual block (or stage) of the base model. By forwarding refined features ${f}_{out}^{s}$
across every residual stage, we predict gait variables in fully connected layer fc
as $y_{pred}=W_{fc}{{f}_{out}^{s}}_{last}$, where $W_{fc}$ denotes the fc layer weights,
and ${{f}_{out}^{s}}_{last}$ is the refined features in the last residual stage.
We employed the R(2+1)D-18 model in our study due to its inherent capability to effectively
capture spatiotemporal characteristics. By factorizing 3D kernels into 2D and 1D kernels,
the model is able to separately capture spatial and temporal information. This systematic
kernel factorization method allows the R(2+1)D-18 model to accurately measure gait
variables by comprehensively detecting both spatial and temporal changes. Also, since
this architecture is easy to customize, we employed it architecture as our baseline
model.
Our objective was to learn the optimal scaling weights $\phi _{scale}$ by suppressing
misguided features that adversely affect the main task loss, resulting in refined
attention maps. We used main task loss $L_{main}$ as smooth L1 loss for its robust
nature against outlier samples:
where $y_{gt}$ denotes ground truth gait variables, and $\delta $ is a hyperparameter
used for determining whether to use L1 loss or L2 loss adaptively, depending on the
input. We followed the default PyTorch implementation and used 1.0 for ${\delta}$.
3.3 Attention Mask Refining Process
The proposed attention module has two key layers to optimize: 1) the mask prediction
layer supervised with pseudo-mask labels and 2) the scaling layer that corrects mask
prediction errors. In this work, we introduce scaling weights as new hyperparameters
that adaptively re-weight mispredicted attention maps. To this end, we employed BOP,
which alternately solves two different levels of problems, where one task (i.e., the
lower-level task) is subject to the other task (i.e., the upper-level task). In our
formulation, the lower-level task involves learning attention maps using pseudo-mask
labels, whereas the upper-level task involves fine-tuning the scaling weights (i.e.,
hyperparameters) to achieve optimal mask predictions for improved validation prediction
performance, given previously learned attention maps. These two steps are alternately
performed to balance the mask and scaling layers until convergence. This is accomplished
through two-round refinement steps, where attention masks are progressively refined
by alternatively switching to optimize the two groups of parameters $\theta _{1}$
and $\theta _{2}$ of SMAGNet. Each subscript number represents the parameter group
index.
Table 1. Overall performance evaluation results.
Method
|
RMSE
|
MAPE (%)
|
R(2+1)D-18 [30]
|
4.11
|
10.0
|
SMAGNet
(ours)
|
2.17
|
6.63
|
Table 2. Results of ablations of each method. Reduction ratio in RMSE ($\boldsymbol{\Delta }$) is reported.
Method
|
RMSE
|
$\Delta$(%)
|
Baseline
|
6.16
|
-
|
+attention module
|
3.39
|
44.9
|
+BOP-based mask refinements
|
2.62
|
22.4
|
Table 3. Performance comparison for 9 types of gait variables. MAPEs (%) are compared.
Variable Name
|
Baseline
|
SMAGNet
|
Change
|
Velocity (cm/min)
|
-
|
6.02
|
5.28
|
-0.74
|
Cadence (steps/min)
|
-
|
4.47
|
4.02
|
-0.45
|
Cycle Time (sec)
|
Left
|
4.05
|
4.14
|
+0.09
|
Right
|
4.51
|
4.15
|
-0.36
|
Stride Length (cm)
|
Left
|
5.06
|
3.72
|
-1.34
|
Right
|
4.82
|
3.61
|
-1.21
|
Support Base (cm)
|
Left
|
6.67
|
5.71
|
-0.96
|
Right
|
6.42
|
6.13
|
-0.29
|
Swing Percent (%)
|
Left
|
4.92
|
3.53
|
-1.39
|
Right
|
6.71
|
3.98
|
-2.73
|
Stance Percent (%)
|
Left
|
2.53
|
1.69
|
-0.84
|
Right
|
3.23
|
1.86
|
-1.37
|
Double Support Percent (%)
|
Left
|
8.48
|
4.91
|
-3.57
|
Right
|
7.67
|
4.94
|
-2.73
|
Toe In Out (degree)
|
Left
|
43.97
|
27.48
|
-16.49
|
Right
|
41.09
|
21.02
|
-20.07
|
In round 1, we jointly optimize our main task loss $L_{main}$ and attention loss $L_{att}$
for the training dataset. We backpropagate the gradients with respect to both $L_{main}$
and $L_{att}$ through layers of SMAGNet while keeping the scaling layers fixed. Thus,
in round 1, we learn to approximate attention maps using pseudo-mask labels by updating
the model parameters $\theta _{1}=\left[W_{base},W_{mask},~ W_{fc}\right]$ with a
learning rate of $\gamma _{1}$. Each element of the list represents the model parameter:
the base model ($W_{base}$), mask layer ($W_{mask}$), and fully connected output layer
($W_{fc}$).
In round 2, we subsequently optimize the scaling weights $\phi _{scale}$ on the top
of predicted attention masks $f_{mask}$, which are used for the refined feature maps
${f}_{out}^{s}$ defined in Eq. (3). The optimization involves learning the optimal scaling weights to refine misguided
attention caused by invalid supervision of pseudo-labels, which could lead to bad
main-task performance. The parameters $\theta _{2}=\left[W_{scale}\right]$ are optimized
with a learning rate of $\gamma _{2}$, where $W_{scale}$ represents the model parameter
of the scaling layer.
In contrast to round 1, we only update the scaling layers while all other model weights
of SMAGNet remain fixed and minimize the main task loss for the validation sets. This
enables the model to learn optimal scaling weights for enhanced generalization capability
in the main task, which is equivalent to hyperparameter optimization. Thus, the proposed
bi-level optimization serves as a unified method for the continuous optimization of
both the model parameters and hyperparameters for better generalization, which is
akin to the procedure in cross validation. The proposed two-round refinement process
is repeated alternately until the model converges (i.e., round-1 model parameters
$\theta _{1}=\left[W_{base},W_{mask},W_{fc}\right]$ are updated in the n-th epoch,
and the round-2 model parameters $\theta _{2}=\left[W_{scale}\right]$ are updated
in the (n+1)-th epoch).
4. Experiments
4.1 Dataset
Fig. 4. Impact of the proposed method on attention capability in multi-person
environments: convolutional features are visualized where the detection of the target
patient is not accurate. From left to right are the inaccurate mask labels using detection
boxes from YOLO v8, convolutional features by R(2+1)D-18 [30], our method without
BOP refinement, and our full method (red color: higher values; blue color: lower values).
4.2 Experimental Settings
R(2+1)D-18 CNN models [30] pre-trained on Kinetics-400 were used as our baseline models. The official implementations
from the torchvision library were utilized. We implemented the mask layer and scaling
layer in parallel, as shown in Fig. 2, and integrated them into each residual stage of the R(2+1)D-18 model. Both the mask
layer and scaling layer were constructed using two stacks of convolution layers with
the sigmoid function as the output activation to predict binary attention masks and
scaling weights, respectively.
For all models, Adam [33] was used as an optimizer with the momentum set to 0.9 and weight decay set to 0.001.
All models were trained for 150 epochs using a batch size of 32 and a learning rate
of 0.0001. To implement the BOP-based attention mask refinement algorithm, which requires
two optimizers, we used the same settings for both optimizers. We allocated 80% of
the dataset for training and 20% for testing. Since gait analysis can be performed
multiple times for each patient, we carefully split the training/testing data to ensure
that the same patients were not included in both splits.
4.3 Preprocessing
As a preprocessing step, we obtained detection boxes of all individuals using YOLO
v8 [14] and tracked each detection box's trajectories using the DeepSort algorithm [15]. We utilized DeepSort to minimize track ID switching caused by occlusion or varying
viewpoints. Considering that our videos were recorded from a frontal viewpoint, we
chose a trajectory that varied the most in the y-direction (vertical) compared to
the x-direction (horizontal). The process of the patient detection is shown in the
upper part of Fig. 2.
From each video, we uniformly sampled 64 frames. We applied standard data augmentation
methods, such as random cropping and horizontal flipping. Moreover, we stacked 64
images with a resolution of 112${\times}$112 pixels to use them as input for our model.
We used 64 images because they empirically fit our GPU memory capability (NVIDIA TITAN
V; 12 GB). We also employed a frame resolution of 112x112 because the R(2+1)D-18 model
has been pretrained using video frames with this resolution for the Kinetics-400 dataset
[30]. As shown in Table 4, we used nine types of gait variables as our targets with different units and scales.
Therefore, we applied standard scaling for data normalization.
Table 4. Dataset information. Distributions of the gait variables and demographics of subjects are listed. Gait variables with left/right pairs are averaged for simplicity.
|
|
Range
|
Definition
|
Gait Variables
|
Velocity (cm/min)
|
73.3 28.1
|
[27.0, 119]
|
walking speed
|
Cadence (steps/min)
|
105 16.6
|
[76.6, 129]
|
counts of steps per minutes
|
Cycle Time (sec)
|
1.16 0.26
|
[0.92, 1.56]
|
duration of a single gait cycle
|
Stride Length (cm)
|
82.8 27.3
|
[36.0, 124]
|
displacement between the same foot strides
|
Support Base (cm)
|
11.4 4.00
|
[5.86, 18.7]
|
horizontal distance between the outer edges of both feet
|
Swing Percent (%)
|
34.2 4.54
|
[25.7, 39.6]
|
percent of duration between toe-off and next heel strike within a gait cycle
|
Stance Percent (%)
|
65.7 4.63
|
[60.3, 74.3]
|
percent of duration between heel strike and toe-off within a gait cycle when a foot
is contacting with ground
|
Double Support Percent (%)
|
31.8 9.36
|
[21.0, 48.8]
|
percent of duration within a gait cycle when both feet are in contact with the ground
simultaneously
|
Toe In Out (degree)
|
11.9 7.27
|
[-31.5, 41.0]
|
angle between the foot's long axis (from heel to the toes) and gait direction
|
Demographics
|
Age (years)
|
68.1 13.2
|
[22, 93]
|
N/A
|
Height (cm)
|
155 43.6
|
[132, 187]
|
N/A
|
Weight (kg)
|
59.5 20.3
|
[34, 105]
|
N/A
|
Gender
|
1,069 males; 1,215 females
|
N/A
|
4.4 Results
Table 1 presents the overall evaluation results on our GAITRite dataset. We compared the
root mean squared error (RMSE) and mean absolute percentage error (MAPE) using the
same base model of R(2+1)D-18. To evaluate the overall performance, we computed the
average RMSE and MAPE of all gait variables. In the results, our proposed SMAGNet
demonstrated significant improvements in both RMSE and MAPE, with values dropping
from 4.11 to 2.17 and from 10.0% to 6.63%, respectively. This result is notable since
we only utilized pseudo-mask labels obtained from the YOLO v8 object detector to train
SMAGNet’s attention capability without any labeling costs. Furthermore, our method
requires light modifications to the modeling pipeline, including the insertion of
attention modules and the implementation of our proposed bi-level optimization strategy
for attention mask refinements.
Next, we conducted an ablation study to demonstrate the effectiveness of each proposed
method: the insertion of the attention modules and the BOP-based mask refinement process.
The resulting RMSE reduction ratio (${\Delta}$) is the relative change in RMSE before
and after applying some method and is reported in Table 2. In the results, simply inserting the attention module had the most significant impact
on improving performance, showing a reduction ratio of 44.9% compared to the baseline
method. The BOP-based mask refinements enhanced the results, showing a reduction ratio
of 22.4% compared to the case of optimizing only the attention module without any
refinement. These results suggest that learning attention masks for the target patients,
even when using pseudo-mask labels, is critical for accurate gait analysis in multi-person
scenarios. In addition, they also indicate that by utilizing our proposed bi-level
strategy to refine potentially mispredicted mask predictions that are trained with
noisy pseudo-mask labels, we can achieve better results.
As shown in Table 3, we inspected the MAPE values for each gait variable for error analysis. The gait
variable named ``Toe In-Out (degree)'' is defined as the angle between the foot and
the gait direction. The MAPE value was reduced most drastically by 20% for this variable.
Analyzing this variable is challenging as it requires accurately capturing the shape
of the foot, which occupies only a small portion of the video frame. Furthermore,
for gait variables that require precise detection of the gait cycle, such as ``Swing
Percent (%)'', ``Stance Percent (%)'', and ``Double Support Percent (%)'', SMAGNet
also demonstrated a notable reduction in MAPE. These results suggest that SMAGNet
has the ability to capture gait cycles by paying more attention to the target patient
in our multi-person scenarios effectively while preventing interference from non-target
individuals walking alongside the target patient.
In order to investigate the linear correlation between predictions and ground truth
from the GAITRite, we calculated Pearson correlation coefficients using SciPy’s stats
module and compared the proposed methods with the baseline. As shown in Fig. 3, our full methods (denoted as ``w/BOP) showed the best correlations with all ground
truth gait variables. However, when incorporating attention modules without BOP-based
mask refinements (denoted as ``w/o BOP''), the correlation coefficients were poor
for the angle-related gait variable ``Toe In-Out (degree)''. We suspect that the proposed
method of learning attention maps supervised with box-shaped masks might not be sufficient
to provide details of foot shapes. After applying BOP-based mask refinements (see
w/BOP), we observed improved results for all gait variables, including the ``Toe In-Out
(degree)''. The proposed method can overcome the limitations of using box-shaped attention
masks by producing more complex and detailed attention masks through refinements,
leading to better results.
Lastly, as shown in Fig. 4, we conducted a qualitative investigation of the attention capability of our proposed
method by visualizing intermediate convolutional features for the baseline method,
our approach before BOP-based refinements, and our approach after the refinements.
To examine the attention capability for challenging samples, we analyzed samples providing
inaccurate mask labels to the model. To find these samples, we applied a simple K-nearest
neighbor outlier detection algorithm to the detection box coordinate vectors (i.e.,
width, height, x_center, y_center).
In our baseline model, R(2+1)D-18 [30], we observed less meaningful results with high-level attention values on all moving
individuals, regardless of whether they were patients or not. In contrast, after simply
incorporating the attention module (denoted as ``before BOP'' in the figure), the
attention shifted toward patients’ lower body parts in most cases. Intriguingly, as
illustrated in the rightmost column, after applying BOP-based attention mask refinements,
we noticed even better attention results that presented more reasonable attention
maps, which particularly focused on the actual foot strides of the target patient,
which is crucial for precise gait-cycle and foot-shape analysis. These results support
the enhanced performance of our proposed SMAGNet compared to before applying BOP-based
refinements, particularly in foot-angle-related gait variables and gait-cycle-related
variables.
Fig. 3. Correlation coefficient comparison for all gait variables. All results showed p<0.05, which represents statistical significance of correlation coefficients.
Fig. 4. Impact of the proposed method on attention capability in multi-person environments: convolutional features are visualized where the detection of the target patient is not accurate. From left to right are the inaccurate mask labels using detection boxes from YOLO v8, convolutional features by R(2+1)D-18 [30], our method without BOP refinement, and our full method (red color: higher values; blue color: lower values).
4.5 Comparison with State of the Art
Table 5 shows comparison results for a recent state-of-the-art gait recognition model named
GaitBase [31] using our GAITRite dataset. GaitBase is a recently proposed method that has achieved
strong performance in many gait-based person re-identification benchmarks in wild
conditions and contains diverse noisy visual factors, such as clothing, carrying,
etc. Accordingly, silhouette images with fewer visuals are utilized as inputs. We
followed the official codebase provided by the official OpenGait repository (\url{https://github.com/ShiqiYu/OpenGait})
to implement preprocessing, including person tracking and extracting silhouette images
of each person. After tracking, we selected patient silhouettes using the same selection
criterion as our method.
In this experiment, we used the GaitBase model, which was pretrained on the largest
public gait dataset named GREW [22]. Robust silhouette embeddings were extracted from the pretrained GaitBase and fed
to the fc layer. The same regression loss $L_{main}$ defined in Eq. (4) was adopted to finetune the entire model weights. We employed the same training strategy
as our method. RMSE and MAPE(%) are compared in Table 5.
Table 5 shows that the proposed SMAGNet significantly outperforms GaitBase across all evaluation
metrics. The lower performance of GaitBase could arise from its heavy reliance on
the quality of silhouette inputs. As shown in Fig. 5, preprocessed silhouette images are too simple or noisy to provide enough information
for precise gait analysis. As a result, the GaitBase model, which only uses silhouette
images as inputs, cannot cope with such irreversible bad inputs and performs worse
than the proposed model, even though it was fine-tuned with the large-scale dataset
GREW.
Table 5. Comparison of GaitBase [31] and proposed SMAGNet on our dataset.
Method
|
RMSE
|
MAPE (%)
|
GaitBase [31]
|
8.81
|
31.8
|
SMAGNet
|
2.17
|
6.63
|
5. Conclusion
We have presented a new network architecture called SMAGNet designed for gait analysis
to address the challenges of multi-person environments in complex real-world medical
settings. Experimental results on the GAITRite dataset collected in real-world clinical
environments demonstrated significantly improved performance compared to the baseline
model, showcasing SMAGNet’s effectiveness in multi-person gait analysis and its potential
for real-world clinical use. Additionally, SMAGNet exhibited a strong correlation
with the GAITRite gold standard gait analysis system. Validating the proposed method
with public benchmark datasets remains for future work. However, there is currently
no publicly available dataset that provides qualitative gait variables measured by
gold-standard gait analysis systems like GAITRite. Therefore, we will expand our method
to a multi-center study for verification of its generality in the future.
Our ultimate goal is to introduce this system in hospitals, enabling medical professionals
to seamlessly monitor the progression and treatment outcomes of neurodegenerative
conditions and make well-informed decisions regarding customized treatment plans for
each patient in a timely manner. Also, this system will eventually be integrated into
mobile phone applications or public kiosk platforms to promote regular gait status
assessments after an intensive validation process. To achieve this goal, we plan to
make the heavy 3D CNN architecture of SMAGNet more lightweight by considering efficiency
in the number of parameters, floating-point operations, and inference time.
ACKNOWLEDGMENTS
This research was supported by Kyungpook National University Research Fund, 2020.
REFERENCES
Y.-H. Lim et al., ``Quantitative Gait Analysis and Cerebrospinal Fluid Tap Test for
Idiopathic Normal-pressure Hydrocephalus,'' Sci Rep, vol. 9, no. 1, Art. no. 1, Nov.
2019.
C. Selge et al., ``Gait analysis in PSP and NPH: Dual-task conditions make the difference,''
Neurology, vol. 90, no. 12, pp. e1021-e1028, Mar. 2018.
D. Cabral et al., ``Frequency of Alzheimer’s Disease Pathology at Autopsy in Patients
with Clinical Normal Pressure Hydrocephalus,'' Alzheimers Dement, vol. 7, no. 5, pp.
509-513, Sep. 2011.
W. Pirker and R. Katzenschlager, ``Gait disorders in adults and the elderly\,: A clinical
guide,'' Wien Klin Wochenschr, vol. 129, no. 3-4, pp. 81-95, Feb. 2017.
J. Kwon, Y. Lee, and J. Lee, ``Comparative Study of Markerless Vision-Based Gait Analyses
for Person Re-Identification,'' Sensors (Basel), vol. 21, no. 24, p. 8208, Dec. 2021.
D. Xue et al., ``Vision-Based Gait Analysis for Senior Care.'' arXiv, Dec. 01, 2018.
Y.-M. Tang et al., ``Diagnostic value of a vision-based intelligent gait analyzer
in screening for gait abnormalities,'' Gait Posture, vol. 91, pp. 205-211, Jan. 2022.
C. Wang, J. Zhang, J. Pu, X. Yuan, and L. Wang, ``Chrono-Gait Image: A Novel Temporal
Template for Gait Recognition,'' in Computer Vision - ECCV 2010, K. Daniilidis, P.
Maragos, and N. Paragios, Eds., in Lecture Notes in Computer Science. Berlin, Heidelberg:
Springer, 2010, pp. 257-270.
L. Wang, T. Tan, H. Ning, and W. Hu, ``Silhouette analysis-based gait recognition
for human identification,'' IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 25, no. 12, pp. 1505-1518, Feb. 2003.
C. Prakash, A. Mittal, R. Kumar, and N. Mittal, ``Identification of gait parameters
from silhouette images,'' in 2015 Eighth International Conference on Contemporary
Computing (IC3), Aug. 2015, pp. 190-195.
N. Karimi Hosseini and M. J. Nordin, ``Human Gait Recognition: A Silhouette Based
Approach,'' Journal of Automation and Control Engineering, vol. 1, pp. 40-42, Mar.
2013.
P. Supraja, R. J. Tom, R. S. Tiwari, V. Vijayakumar, and Y. Liu, ``3D convolution
neural network-based person identification using gait cycles,'' Evolving Systems,
vol. 12, no. 4, pp. 1045-1056, Dec. 2021.
S. Yu, D. Tan, and T. Tan, ``A Framework for Evaluating the Effect of View Angle,
Clothing and Carrying Condition on Gait Recognition,'' in 18th International Conference
on Pattern Recognition (ICPR’06), Aug. 2006, pp. 441-444.
D. Reis, J. Kupec, J. Hong, and A. Daoudi, ``Real-Time Flying Object Detection with
YOLOv8.'' arXiv, May 17, 2023.
N. Wojke, A. Bewley, and D. Paulus, ``Simple Online and Realtime Tracking with a Deep
Association Metric.'' arXiv, Mar. 21, 2017.
Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, and L. Shao, ``Mask-Guided Attention
Network for Occluded Pedestrian Detection,'' in 2019 IEEE/CVF International Conference
on Computer Vision (ICCV), Seoul, Korea (South): IEEE, Oct. 2019, pp. 4966-4974.
S. Ren, K. He, R. Girshick, and J. Sun, ``Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks.'' arXiv, Jan. 06, 2016.
C. Song, Y. Huang, W. Ouyang, and L. Wang, ``Mask-Guided Contrastive Attention Model
for Person Re-identification,'' in 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT: IEEE, Jun. 2018, pp. 1179-1188.
K. He, G. Gkioxari, P. Dollár, and R. Girshick, ``Mask R-CNN.'' arXiv, Jan. 24, 2018.
Y. Liu, B. Schiele, and Q. Sun, ``Adaptive Aggregation Networks for Class-Incremental
Learning,'' in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Jun. 2021, pp. 2544-2553.
S. Jenni and P. Favaro, ``Deep Bilevel Learning,'' in Computer Vision - ECCV 2018,
V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., in Lecture Notes in Computer
Science. Cham: Springer International Publishing, 2018, pp. 632-648.
Zhu, Zheng, et al. ``Gait recognition in the wild: A benchmark.'' Proceedings of the
IEEE/CVF international conference on computer vision. 2021.
Zheng, Jinkai, et al. ``Gait recognition in the wild with dense 3d representations
and a benchmark.'' Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022.
Ionescu, Catalin, et al. ``Human3. 6m: Large scale datasets and predictive methods
for 3d human sensing in natural environments.'' IEEE transactions on pattern analysis
and machine intelligence 36.7 (2013): 1325-1339.
Kwolek, Bogdan, et al. ``Calibrated and synchronized multi-view video and motion capture
dataset for evaluation of gait recognition.'' Multimedia Tools and Applications 78
(2019): 32437-32465.
Cotton, R. James, et al. ``Spatiotemporal Characterization of Gait from Monocular
Videos with Transformers.'' (2021).
A. L. McDonough, M. Batavia, F. C. Chen, S. Kwon, and J. Ziai, ``The validity and
reliability of the GAITRite system’s measurements: A preliminary evaluation,'' Arch
Phys Med Rehabil, vol. 82, no. 3, pp. 419-425, Mar. 2001.
A. J. Nelson et al., ``The validity of the GaitRite and the Functional Ambulation
Performance scoring system in the analysis of Parkinson gait,'' NeuroRehabilitation,
vol. 17, no. 3, pp. 255-262, 2002.
B. Bilney, M. Morris, and K. Webster, ``Concurrent related validity of the GAITRite
walkway system for quantification of the spatial and temporal parameters of gait,''
Gait Posture, vol. 17, no. 1, pp. 68-74, Feb. 2003.
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, ``A Closer Look at
Spatiotemporal Convolutions for Action Recognition,'' in 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT: IEEE, Jun. 2018, pp.
6450-6459.
Fan, Chao, et al. ``OpenGait: Revisiting Gait Recognition Towards Better Practicality.''
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
2023.
I. J. Goodfellow et al., ``Generative Adversarial Networks.'' arXiv, Jun. 10, 2014.
D. P. Kingma and J. Ba, ``Adam: A Method for Stochastic Optimization.'' arXiv, Jan.
29, 2017.
Hosang Yu is a deep learning research engineer in research center for AI in Medicine
at Kyungpook National University Hospital, Daegu, South Korea. He received his B.S.
and M.S. degree in Electronical Engineering from Kyungpook National University, Daegu,
South Korea, in 2017 and 2019, respectively. His research interests include deep learning-based
computer vision algorithms for gait analysis, medical image classification and segmentation,
etc.
Jaechan Park is a neurosurgeon who trained as a resident in Kyungpook National
University Hospital, Daegu, South Korea and as a clinical fellow at the Detroit Medical
Center (Wayne State University), Detroit, USA. He received a Ph.D. degree, Neurosurgery,
Seoul National University, Seoul, South Korea. He has clinical interest in vascular
neurosurgery and minimally invasive neurosurgery. His current research interests include
medical artificial intelligence, optical coherence tomography for intraoperative cerebral
angiography, endovascular simulator for angiography and endovascular procedures, etc.
Kyunghun Kang received his B.S. and M.S. degrees from Kyungpook National University
School of Medicine, Daegu, Korea, in 2003 and 2006, respectively. He received his
Ph.D. degree in Biomedical Engi-neering from Hanyang University, Seoul, Korea, in
2020. In 2014, he joined the Department of Neurology, Kyungpook National University
School of Medicine, Daegu, Korea, where he is currently working as an Associate Professor.
His research interests are the areas of neuroimaging, gait analysis, and normal-pressure
hydrocephalus.
Sungmoon Jeong received a Ph.D. degree in electronical engineering and computer
science from Kyungpook National University, South Korea, in 2013. From 2013 to 2018,
he was an assisant professor of school of information science at Japan Advanced Institute
of Science and Technology (JAIST), Japan. From 2018, he is currently an assisant professor
of department of medical informatics and research center for AI in Medicine at Kyungpook
National University and Hospital. His current research interests include multi-modal
medical data analysis, SaMD, intelligent hospital information system and deep learning
based medical applications.