Mobile QR Code QR CODE

  1. (Department of Electronical Engineering, Sogang University / Seoul, Korea {anstmdgns97, beoungwoo, qazw5741} )

Object detection, Deep learning, Computer vision, Fall down, Real-time detector, Dataset

1. Introduction

Falls are currently one of the leading causes of death, and according to the Centers for Disease Control and Prevention (CDC), more than 800,000 people are hospitalized each year with fall injuries [1]. Falls can be caused by abnormal health conditions, violence, or accidents. Therefore, it is important that these falls are detected as quickly as possible and that appropriate actions are taken. The consequences may be irreversible if the response is late.

Recently, many studies have been conducted to detect falls in various fields. Park et al. [2] introduced a method using Mask-RCNN to detect a fallen person using CCTV [3]. Chen et al. [4] presented a method for detecting falls by extracting optical features using a CCD camera and thermal imaging camera in a dark outdoor environment.

If a person falls on a deserted road or at a dark time such as at night or dawn, immediate help is not available, and a situation may arise in which prompt action cannot be taken. In these situations, it is possible to detect a fallen person economically and effectively if we take advantage of CCTV. For example, Xu et al. [5] detected falls using tracking methods with prior information such as motion information obtained from object detection in a CCTV environment. Salimi et al. [6] focused on a method of detecting falls through 2D-level human pose estimation by sorting out key points such as a person’s head or joints in a CCTV environment. These studies improved fallen person detection in CCTV environments, but they depend on prior information and require massive hardware resources to store prior information in memory. CCTV also has a critical problem in that hardware resources are limited. These limitations make it hard for the system to operate in real time.

Another interest is creating a specialized dataset for fall detection. An et al. [7] proposed the specialized VFP290K benchmark dataset for fall detection by appending a new fallen person class to their dataset. This approach was capable of real-time operation. However, as shown in Fig. 1, this method caused a problem of falsely detecting dark objects such as dark shoes and black scooters as fallen people.

Moreover, in video-level operation, object detection can be interrupted in the middle of consecutive frames due to occlusion events. Also, situations that do not have high risk, such as stumbling and slipping, can be wrongly detected as a falling event. This results in both precision and recall degradation for the human class in the object detector, and unnecessary human resources may be wasted due to false alarms when the system is applied to an actual CCTV environment.

A dataset is proposed to address the limitations by integrating the fallen person class and non-fallen class of the VFP290K dataset [7] while excluding the data in night conditions. We included additional classes of common objects by using the AI-Hub dataset to avoid falsely detecting dark objects as a fallen person. In addition, we introduce a bounding box ratio algorithm, which is an efficient yet robust method for detecting a fall event. The algorithm makes no difference in terms of speed when compared to the computational cost caused by object detection.

Furthermore, we use the distinct point when a fallen person does not move much. With this knowledge, we propose a novel algorithm specialized in tracking the person ID of a fallen person: the bounding box overlap algorithm. This algorithm presents robust performance in video-level operation by adopting a time-merge method, which aggregates frames over time at the video level to solve the problem of object detection being intermittent. Combining all these proposed methods, we improved the fall detection performance.

Fig. 1. Comparison with VFP290K[7]and Our Method.

2. Related Works

2.1 You Only Look Once (YOLO)

Object detection is used to find the location of given classes in an image. Vision-based object detection is a huge research topic in the field of computer vision and is used in a wide range of applications such as recognizing faces or obstacles. YOLO methods combine vision-based object detection with deep learning and act as a one-stage detector that carries out classification and localization simultaneously. Since YOLOv1 [8] was first introduced in 2015, YOLO methods have been continuously updated. It has a lightweight structure and high detecting accuracy, so YOLO methods have been recognized as state-of-the-art methods in the field of object detection.

YOLOv4 [9], Scaled-YOLOv4 [10], and YOLOR [11] were introduced in 2020 and 2021. The latest model, YOLOv7 [12], was introduced in 2022 and outperformed well-known object detectors such as R-CNN [13], YOLOv5, YOLOX [14], PPYOLO [15], DETR [16], etc. It also reduced the number of parameters and computational cost due to the optimization of the model structure and the training process. YOLOv4 [9] increased detection accuracy but failed to reduce the inference cost, while YOLOv7 [12] achieved both by presenting a planned re-parameterized method that trains multiple convolutional layers in parallel and combines them into one convolutional layer in the inference process.

YOLOv7 [12] has a model structure like that of YOLOv5, which consists of an input, backbone, and head. The backbone layer extracts features of an input image, and the head layer predicts and outputs the prediction in a bounding box format. The image is pre-processed at the input state, goes through the backbone layer, and is converted into a feature map. Finally, the prediction result is exported by Rep convolution and Imp convolution. We conducted experiments with VFP290K [7] as a baseline, and comparative analysis was conducted with YOLOv5, the backbone of VFP290K [7], and YOLOv7 [12], the state-of-the-art real-time object detector.

2.2 Fall Detection Specialized Datasets

Various studies have been conducted to create a dataset specialized for fall detection. Charfi et al. [17] proposed the Le2i dataset, which captured falls in four types of indoor environments (i.e., home, cafe, office, and classroom) with a single Kinect camera. Auvinet et al. [18] presented the MultiCam dataset, which captured falls in a living room environment with 8 general cameras. Mastorakis et al. [19] introduced a dataset containing 48 types of falls photographed in an indoor environment at a height of 2 m. Zhang et al. [20] proposed a dataset with occlusion cases. These approaches concentrated on robust fall detection in natural situations by considering various factors such as camera type, filming height, backgrounds, environments, and occlusion.

An et al. [7] proposed the VFP290K dataset [7], which is composed of 294,713 frames in diverse circumstances such as light conditions (i.e., day and night), background conditions (i.e., street, park, and building), camera heights (i.e., high and low), and occlusion cases. The dataset’s validity and usefulness were verified through a performance evaluation experiment using well-known object detectors like R-CNN [13] and YOLOv5.

Despite these attempts, the fall detection performance is not robust enough for real-world applications across wild environments with diverse domains. The training process using two-classes (i.e., fallen and non-fallen) also generated false-negative cases, where dark objects were predicted as a fallen person, which degraded recall values and the fall detection performance of the system.

3. The Proposed Method

Fig. 2 shows an overview of our method. The object detection part receives an input image through encoding in video-level. After that, person detection takes place with the bounding box as an output. This is followed by fall detection via the bounding box ratio. The bounding box overlap algorithm calculates the IoU (i.e., Intersection over Union) with the bounding box of the immediately preceding frame.

If the IoU exceeds 0.9, it means that the bounding boxes overlap by 90\%, and the system regards the object as a fallen person. After that, the information of the corresponding bounding box is stored in the person ID memory. The duplicate person IDs are removed in the time-merge part. Finally, the final fall-down memory that stores information of the falls is exported as output. In addition, we mixed the VFP290K dataset [7] and the AI-Hub dataset [21] provided by the National Information Society Agency (NIA) for robust fall detection in natural conditions and the possibility of training falls in diverse viewpoints and environments.

Fig. 2. Overall Process of the Proposed Method.

3.1 Fallen Person Detection

3.1.1 Bounding Box Ratio based Fall Detection

Fig. 3 shows the details of the bounding box ratio method. The basic premise of this method is that a standing person seen in an image would be vertically long, while a fallen person would be horizontally long. The ratio of the bounding boxes $\text{ratio}_{\mathrm{wh}}$ can be expressed as:

$ ratio_{wh}=\frac{W}{H} $

where W represents the width of the bounding box, and the H represents to the height of the bounding box. The $\text{ratio}_{\mathrm{wh}}$ denotes the ratio of the width and the height of the bounding box. We used $\text{ratio}_{\mathrm{wh}}$ of a predicted person through object detection to determine a fall. We empirically obtained a threshold value of 1.2 by analyzing $\text{ratio}_{\mathrm{wh}}$ of the ground truth of a fallen person in our dataset. This simple rule-based method determines a fall if $\text{ratio}_{\mathrm{wh}}$ is greater than 1.2 and non-fall if it is not. As shown in Fig. 3, it is intuitive, fast, and powerful. Using this method, training and fall detection can be done with only a single class, and robust operation of multiple fall detections can also be done in a CCTV environment.

Fig. 3. Bounding Box Ratio Algorithm.

3.1.2 Bounding Box Overlap for Video-level Fall Tracking

Fall events have an apparent restriction on the object’s movement compared to other events. Two assumptions can be made from this restriction. First, a person who falls will hardly move from one location, which means the bounding box will not have notable change in its size and location. Second, since the person has fallen, the width of the bounding box will be greater than the height, and $\text{ratio}_{\mathrm{wh}}$ will be greater than 1.2.

Using these assumptions, we propose the bounding box overlap algorithm, a tracking method specialized for fall detection. The flow of the algorithm is shown in Fig. 2. If a fallen person is predicted, the person ID is assigned to the person ID memory. The frame number and location information of the bounding box are saved in the memory. Then, the person ID of this frame is matched with the person ID of the previous frame stored in the person ID memory. If the IoU between these boxes overlaps by more than 90\%, the person ID is regarded as the same and is stored in the corresponding person ID memory.

The detection of a fallen person may not be performed accurately in every frame at the video-level since an unexpected situation such as occlusion can take place. To handle this, a time-merge process is performed to calculate the IoUs between the person IDs stored for up to 10 seconds and the new person ID. If no bounding box that is considered to be the same person ID is detected after 10 seconds, the person ID stored in the person ID memory is stored in the merge fall-down memory, and the person ID memory is reinitialized.

In the merge fall-down memory, the person IDs generated up to the n$^{\mathrm{th}}$ frame are stored, and the person IDs with IoU higher than 0.9 are integrated before storage. This method was optimized for application to environments where hardware resources are scarce, like a CCTV environment. When a new fall event is detected once and merge fall-down memory is created, the system checks the person ID one more time to prevent cases where the same person is incorrectly assigned a different person ID after the fall-down tracking. If another person ID is created for a fall detection every 30 seconds, the bounding box overlap algorithm is used again because it is likely to be the same person ID as that stored in the merge fall-down memory. If it is verified that the new ID is the same person ID through this operation, a single person that has several person IDs can finally be made into one person ID.

In addition, unintentional occlusions may occur, where other objects pass between the fallen person and the camera. This may make the system give the passing object the same person ID as the fallen person. In this case, the bounding box ratio method is applied again, and the person ID is given only to those satisfying the threshold. This can be done to accurately detect the fallen person and efficiently track the person through the time-merge process.

3.2 Mixed-up Dataset Configuration for Fall Detection

It is important that the model detects both people who have fallen and those who have not as the same single person class so that it is possible to track them at the video level. In addition, to solve the problem of mistaking a black object for a fallen person, as shown in Fig. 4, we created a new dataset by mixing up the AI-Hub dataset, as shown in the fall-down dataset in Fig. 2.

In the VFP290K dataset [7], the background and camera positions are limited to ``high'' and ``low.'' As shown in Fig. 4, training using the VFP290K dataset [7] leads to falsely detecting a dark object as a fallen person. To address this limitation, we added some objects that are easily found to our dataset for the training process. We also added people wearing dark clothes to our training data.

Fig. 4. False positives of fall event detection via VFP290K.

4. Experiments

4.1.1 Experimental Setting

Verification of the proposed method was done by evaluating the performance of object detection by training the backbone models YOLOv5 and YOLOv7 [12], with the VFP290K dataset [7] and our dataset. Then, evaluation of fall detection performance was conducted in wild conditions with diverse domains using our test dataset. Our experiments were conducted with the same model as that of An et al. [7]. Since the VFP290K dataset [7] has been evaluated using YOLOv5, our method also used YOLOv5 as the backbone. The current state-of-the-art real-time object detector, YOLOv7 [12], was also used as our backbone model to evaluate the proposed method. This experiment verified the versatility of fall detection application for when new YOLO methods are proposed in the future.

4.1.2 Test Dataset

Since the background, environment, and camera height of the VFP290K dataset [7] do not vary, it was necessary to verify that our dataset robustly detects falls even in natural situations. Therefore, we constructed a test dataset by dividing 1,531 seconds of video taken with a new camera height and background into 6,124 frames. This video contains 21 fall events. All parts concerning personal information were cropped, mosaicked, and anonymized.

4.1.3 Evaluation Metrics

Precision is the proportion of objects that the model detects as a fall to what are truly falls:

$ Precision=\frac{TruePositives}{TruePositives+FalsePositives} $

True positives denote the number of correctly predicted falls, and false positives mean the cases where the prediction failed despite predicting a fall.

Recall refers to the proportion of correct predictions and total actual falls:

$ Recall=\frac{TruePositives}{TruePositives+FalseNegatives} $

AP is calculated as the mean precision of each class at certain thresholds. $\mathrm{mAP}_{50}$ is the average of AP over all detected classes with an IoU threshold of 0.5. In the case of VFP290K [7], it is the average value for the two classes, fallen and non-fallen. Since we trained the system with a single person class, the ratio of IoU over 50\% between all the bounding boxes predicted to be a person and the ground truth label for the answer was calculated and averaged.

$\mathrm{mAP}_{95}$ is the average value of all mAP values for IoU thresholds ranging from 0.5 to 0.95 with a step size 0.05. The F1 score is the harmonic average of precision and recall, and in our study, it represents the comprehensive fall detection performance considering the tradeoff between precision and recall.

$ F1~ score=2\times \frac{Precision\times Recall}{Precision~ +Recall} $

4.2 Overall Performance Compared to VFP290K (Object Detector)

In this experiment, we evaluated the object detecting performance compared to the baseline VFP290K [7] with the same backbone models, YOLOv5 and YOLOv7 [12]. The baseline was trained with two classes, fallen and non-fallen, but we conducted the model training process with a single person class. The result is shown in Table 1. When trained with YOLOv5 as the backbone model, our dataset improved precision, recall, $\mathrm{mAP}_{50}$ and $\mathrm{mAP}_{95}$by 0.126, 0.084, 0.156, and 0.11 compared to VFP290K, which showed results of 0.906, 0.724, 0.841, and 0.49, respectively. This verified that our dataset is specialized for the fall detection task.

As shown in Fig. 4, the model using the baseline dataset repeatedly outputted incorrect results and detected a dark object as fallen people. This was because the baseline dataset was trained without considering objects other than humans. This led to false-positive cases, which decreased the recall value, so it could consequently be inefficient when applied to a real CCTV environment.

Table 1 also shows the experimental results with YOLOv7 [12]. The precision, recall, $\mathrm{mAP}_{50}$, and $\mathrm{mAP}_{95}$ were 0.954, 0.844, 0.904, and 0.531, respectively, showing the best performance. In Fig. 5, it can be seen that the false-positive cases that occurred in our baseline method disappeared when our dataset was trained with YOLOv5. This verified that even if YOLO methods or other object detectors with better performance are proposed in the future, our method can be applied universally with robust fall detection performance.

Fig. 5. Object Detecting Performance Comparison Between Our Method and Baseline.
Table 1. Overall Performance Evaluation: Object Detector Trained with Different Datasets.






$\mathrm{mAP}_{50}$ $\mathrm{mAP}_{95}$



$\mathrm{mAP}_{50}$ $\mathrm{mAP}_{95}$










Our dataset









4.3 Overall Performance Compared to VFP290K (Fall Detector)

We created a new test video to test the fall detection performance in wild conditions with diverse domain gap, which was not included in the training dataset. The video contained a total of 21 fall events, and an experiment was conducted to compare the fall detection test results with the model trained with YOLOv5 on the baseline dataset. The results are shown in Table 2.

Our method enhanced precision by 0.349 compared to the baseline. Our system achieved an F1 score of 0.624, which is 0.104 higher than that of the baseline. While the VFP290K dataset [7] showed relatively poor performance at the video level, our method’s fall detection performance was verified through the experiment.

Table 2. Comparison of VFP290K[7]and Our Proposed Method on Our Fall Dataset.








Our method




5. Conclusion

We proposed new methods to detect fallen people and constructed a new dataset specialized for a fall detection task. Our method showed better performance in object detection and fall detection at the video level than the baseline. The algorithm can be applied in a way that it learns the characteristics or clothing of a fallen person and stores them together in the person ID. It could be applied in a variety of ways, such as transmission to a hospital or police station, leading to quick action taken for people who have fallen.


This work used datasets from The Open AI Dataset Project (AI-Hub, S. Korea). All data can be accessed through AI-Hub (

This work is also supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of the Interior and Safety (Grant 22PQWO-C153359-04).


"Facts About Falls," Centers for Disease Control, Aug. 2021.URL
Park et al., "Emergency Situation Recognition System Using CCTV and Deep Learning," Korea Information Processing Society, Nov. 2020.DOI
He, Kaiming, Georgia Gkioxari, Piotr Dollar, and Ross Girshick, ``Mask R-CNN,'' 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017.URL
Chen, Ying-Nong, Chi-Hung Chuang, Chih-Chang Yu, and Kuo-Chin Fan, ``Fall Detection in Dusky Environment,'' SpringerLink, Nov. 2013.DOI
Xu, Teng, et al., "Fall Detection Based on Person Detection and Multi-target Tracking," 2021 11th International Conference on Information Technology in Medicine and Education (ITME). IEEE, Nov. 2021.DOI
Salimi, Mohammadamin, José JM Machado, and João Manuel RS Tavares, ``Using Deep Neural Networks for Human Fall Detection Based on Pose Estimation,'' Sensors, 22(12), Jun. 2022.DOI
An, Jaeju, et al., "VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person Detection," Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Aug. 2021.DOI
Redmon, Joseph, et al., "You only look once: Unified, real-time object detection," Proceedings of the IEEE conference on computer vision and pattern recognition, Jun. 2016.DOI
M. Ning, Y. Lu, W. Hou and M. Matskin, "YOLOv4-object: an Efficient Model and Method for Object Discovery," 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Jul. 2021, pp. 31-36.DOI
WANG, Chien-Yao; BOCHKOVSKIY, Alexey; LIAO, Hong-Yuan Mark, ``Scaled-yolov4: Scaling cross stage partial network,'' Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, Jul. 2021. p. 13029-13038.DOI
WANG, Chien-Yao; YEH, I.-Hau; LIAO, Hong-Yuan Mark, ``You only learn one representation: Unified network for multiple tasks,'' arXiv preprint, arXiv:2105.04206, May. 2021,DOI
WANG, Chien-Yao; BOCHKOVSKIY, Alexey; LIAO, Hong-Yuan Mark, ``YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,'' arXiv preprint, arXiv:2207.02696, Jul. 2022.DOI
GIRSHICK, Ross, et al, ``Rich feature hierarchies for accurate object detection and semantic segmentation,'' Proceedings of the IEEE conference on computer vision and pattern recognition, Jun. 2014. p. 580-587.DOI
GE, Zheng, et al, ``Yolox: Exceeding yolo series in 2021,'' arXiv preprint, arXiv:2107.08430, Aug. 2021.DOI
LONG, Xiang, et al., ``PP-YOLO: An effective and efficient implementation of object detector,'' arXiv preprint, arXiv:2007.12099, Aug. 2020.DOI
ZHU, Xizhou, et al., ``Deformable detr: Deformable transformers for end-to-end object detection,'' arXiv preprint, arXiv:2010.04159, Oct. 2020.DOI
CHARFI, Imen, et al., ``Definition and performance evaluation of a robust SVM based fall detection solution,'' 2012 eighth international conference on signal image technology and internet based systems. IEEE, Nov. 2012, p. 218-224.DOI
AUVINET, Edouard, et al., ``Multiple cameras fall dataset,'' DIRO-Université de Montréal, Tech. Rep}, Jul. 2010, 1350: 24.URL
MASTORAKIS, Georgios; MAKRIS, Dimitrios, ``Fall detection system using Kinect’s infrared sensor,'' Journal of Real-Time Image Processing, Dec. 2014, 9.4: 635-646.DOI
ZHANG, Zhong; CONLY, Christopher; ATHITSOS, Vassilis, ``Evaluating depth-based computer vision methods for fall detection under occlusions,'' International symposium on visual computing. Springer, Cham, 2014, p. 196-207.DOI


Seunghun Moon

Seunghun Moon received the B.S. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2023, where he is currently pursuing the M.S. degree in electronics engineering. His current research interests include deep learning, anomaly detection, and computer vision.

Changhee Yang

Changhee Yang received the B.S. degree in electronics engineering from Dankook University, Jukjeon, South Korea, in 2022, and he is currently pursuing the M.S. degree in electronics engineering. His current research interests include image processing, 3D pose estimation, and computer vision.

Beoungwoo Kang

Beoungwoo Kang received the B.S. degree in electronics engineering from Sogang University, Seoul, South Korea, in 2022, where he is currently pursuing the M.S. degree in electronics engineering. His current research interests include deep learning, semantic segmentation, and computer vision.

Suk-Ju Kang

Suk-Ju Kang (Member, IEEE) received a B.S. degree in electronic engineering from Sogang University, South Korea, in 2006, and a Ph.D. degree in electrical and computer engineering from the Pohang University of Science and Technology, in 2011. From 2011 to 2012, he was a Senior Researcher with LG Display, where he was a project leader for resolution enhancement and multi-view 3D system projects. From 2012 to 2015, he was an Assistant Professor of Electrical Engineering at Dong-A University, Busan. He is currently a Professor of Electronic Engineering at Sogang University. He was a recipient of the IEIE/IEEE Joint Award for Young IT Engineer of the Year, in 2019. His current research interests include image analysis and enhancement, video processing, multimedia signal processing, circuit design for display systems, and deep learning systems.