0% found this document useful (0 votes)
59 views10 pages

Human Behaviour Detection Dataset HBDset Using Com - 2024 - Journal of Safety

Uploaded by

jatindhamija9999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views10 pages

Human Behaviour Detection Dataset HBDset Using Com - 2024 - Journal of Safety

Uploaded by

jatindhamija9999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Journal of Safety Science and Resilience 5 (2024) 355–364

Contents lists available at ScienceDirect

Journal of Safety Science and Resilience


journal homepage: [Link]/en/journals/journal-of-safety-science-and-resilience/

Human behaviour detection dataset (HBDset) using computer vision for


evacuation safety and emergency management
Yifei Ding a , Xinghao Chen a , Zilong Wang a , Yuxin Zhang a, b, c, * , Xinyan Huang a, *
a
Research Centre for Fire Safety Engineering, Department of Building Environment and Energy Engineering, The Hong Kong Polytechnic University, Hong Kong, China
b
State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai, China
c
Department of Geotechnical Engineering, Tongji University, Shanghai, China

A R T I C L E I N F O A B S T R A C T

Keywords: During emergency evacuation, it is crucial to accurately detect and classify different groups of evacuees based on
Image dataset their behaviours using computer vision. Traditional object detection models trained on standard image databases
Object detection often fail to recognise individuals in specific groups such as the elderly, disabled individuals and pregnant
Human behaviour
women, who require additional assistance during emergencies. To address this limitation, this study proposes a
Public safety
Evacuation process
novel image dataset called the Human Behaviour Detection Dataset (HBDset), specifically collected and anno-
tated for public safety and emergency response purposes. This dataset contains eight types of human behaviour
categories, i.e. the normal adult, child, holding a crutch, holding a baby, using a wheelchair, pregnant woman,
lugging luggage and using a mobile phone. The dataset comprises more than 1,500 images collected from various
public scenarios, with more than 2,900 bounding box annotations. The images were carefully selected, cleaned
and subsequently manually annotated using the LabelImg tool. To demonstrate the effectiveness of the dataset,
classical object detection algorithms were trained and tested based on the HBDset, and the average detection
accuracy exceeds 90 %, highlighting the robustness and universality of the dataset. The developed open HBDset
has the potential to enhance public safety, provide early disaster warnings and prioritise the needs of vulnerable
individuals during emergency evacuation.

1. Introduction the need for larger evacuation space [9]. Therefore, the provision of
additional help and instruction for groups who need special attention is
For the past few decades, the increase in natural and man-made vitally significant to decrease injuries during disasters [10].
calamities such as earthquakes [1], building fires [2], floods [3] and In recent years, there has been an increasing interest in applying
stampede accidents [4] has promoted the urgent demand for public artificial intelligence (AI) to the research and development of smart
emergency safety research. Once an emergency occurs, prompt and emergency management systems. Especially, visual object detection
proper evacuation is the key priority for human life safety. However, based on deep learning and computer vision has been critical and widely
past disasters have illustrated that humans may lack knowledge of how used in evacuation research and emergency system exploitation. For
or where to evacuate, while inefficient evacuation strategies and be- example, Zhao et al. [11] leveraged machine learning to investigate
haviours cause serious injuries and casualties [5,6]. Furthermore, people factors affecting pre-evacuation decision-making of building occupants.
with special characteristics and behaviours and existing physical or Huang et al. [12] applied computer vision algorithms to estimate crowd
mental troubles, such as pregnant women, elderly people, children or density and simulate evacuation aiming to reduce the stampede risk in
people with disabilities (Fig. 1), on emergency sites delay the evacuation public places. Cheng et al. [13] proposed a graph-based network to
process and increase the difficulty level of evacuation and rescue [7]. On process real-time surveillance videos to detect and tally the number of
the one hand, these special groups account for higher casualty risk, for evacuees in the target area for evacuation navigation. The above-
example, the elderly aged over 65 comprise 32 % of home fire deaths but mentioned studies illustrate the application potential of AI on evacua-
represent only 13 % of the population [8]. On the other hand, they may tion, but they are not focused on vulnerable populations or multiple
obstruct the egress of other evacuees owing to their slower speeds and human behaviours.

* Corresponding authors.
E-mail addresses: yuxinzhang@[Link] (Y. Zhang), [Link]@[Link] (X. Huang).

[Link]
Received 5 January 2024; Received in revised form 13 April 2024; Accepted 26 April 2024
Available online 2 June 2024
2666-4496/© 2024 China Science Publishing & Media Ltd. Publishing Services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access
article under the CC BY-NC-ND license ([Link]
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Object detection algorithms enable real-time and accurate detection n_Behaviour_Detection_Dataset.


of persons in the input frames and further extraction of people’s distri-
bution, movement speed, number of total evacuees and other physical 2. Related work
information. In our previous study [14], the object detection model
(YOLOv4) was redeveloped to build an evacuation monitoring system to 2.1. Object detection algorithms
extract and provide integrated information on the evacuation process;
this system could only detect general human bodies. Because the The advancement of computer vision has greatly contributed to the
vulnerable or people with disability groups as uncertainty in emergency development and improvement of models for visual object detection.
response, the object detection systems not only should cater to the Over the past few years, various object detection algorithms have
average occupants but also to groups that require more special attention. emerged as accurate and lightweight systems widely utilised in pedes-
Therefore, accurate detection of various categories of evacuees and their trian recognition. The conventional object detection method is based on
diverse behaviours is valuable and helpful in guiding emergency evac- handcrafted feature extraction [17], e.g., the scale-invariant feature
uation and rescue during disasters. While most object detection systems transform [18], shape contexts [19] and histogram of gradients [20].
show satisfactory performance for normal person detection, they cannot With the booming development of deep learning, convolutional
classify specific human behaviours. neural network (CNN)-based object detection models have gained
To better identify the items, supervised learning, one of the most tremendous popularity, such as R-CNN (region-based convolutional
crucial branches of deep learning, is adopted and various models are neural network) [21], R-FCN (region-based fully convolutional
developed for visual object detection. Generally, a well-trained object network) [22] and SSD (single shot multiBox detector) [23]. One
detection model requires a big dataset comprising numerous training notable algorithm is YOLO (you only look once) [24], which efficiently
samples and labels. However, most classical human image datasets such combines detection components into a single CNN operating on the
as the COCO dataset [15] and VOC Pascal dataset [16] overlook the entire image. By integrating localisation and classification tasks into a
diverse human categories and fail to distinguish them and label them unified CNN framework, YOLO leverages features from the entire image
differently. Therefore, only the mature AI model architecture is insuf- to make predictions for each bounding box. Moreover, it can predict
ficient for complex human–behaviour identification, and a rich human bounding boxes for multiple classes simultaneously, making global in-
dataset with vulnerable and other disruptive human–behaviour cate- ferences about the entire image and its objects.
gories is necessary for developing a more powerful visual detection Up to now, the YOLO model has upgraded and iterated various
model. classical versions with faster and more accurate detection performance,
In this paper, an open labelled Human Behaviour Detection dataset containing YOLOv2 [25], YOLOv3 [26], YOLOv4 [27] and YOLOv7
(HBDset) containing abundant specific human behaviour images and [28], as well as YOLOv5 [29] and YOLOv8 [30] presented by Ultralytics.
corresponding labels is introduced to help the public community to work The YOLO algorithm has demonstrated great potential in fire engi-
on and enhance the effectiveness and accuracy of object detection al- neering when applied to fire detection. For example, Wang et al. com-
gorithms for vulnerable people and other disruptive behaviour cate- bined YOLO and binocular cameras to realise real-time fire detection
gories during evacuation. The proposed dataset contains common and fire distance estimation [31]. Moreover, the model structures of
human behaviours during public emergencies such as those of pregnant YOLO have the potential to be applied in human-behaviour detection
women, children, people walking while playing on their phones and and evacuation scenarios. For example, Li et al. [32] employed YOLO to
people using wheelchairs and crutches. Advanced object detection al- extract the evacuees’ movement parameters from the videos of an
gorithms are adopted as examples to validate and calibrate the feasi- earthquake evacuation case to build an evacuation veloc-
bility of the dataset. By establishing a more adequate database for ity–classification model. For human abnormal–behaviour detection, Ji
human behaviours, this research provides an intelligent emergency et al. used T-TINY-YOLO and improved the network model with a CNN
management system framework for public emergency safety and early tailoring scheme [33]. Nguyen et al. presented a novel form of real-time
disaster warning and lays the foundation for the development of human human detection using smart video surveillance at the edge [34]. In the
recognition models. Fig. 2 illustrates the overall methodology of this present study, we select YOLOv5, YOLOv7 and YOLOv8 to demonstrate
work: (a) collecting raw image data, (b) classifying, annotating and the performance and feasibility of the proposed HBDset.
splitting data to generate a detection dataset, (c) conducting experi-
ments for the dataset using an object detection model and (d) providing
perspectives of an intelligent monitoring system and a digital twin. The 2.2. Human database
HBDset is released at [Link]
To train a specific AI model for an intelligent human detection

Fig. 1. Examples of special groups during evacuation: (a) children (The image is licensed under a Creative Commons License. Link: [Link]
s/giro555/10803144525/in/photostream/), and (b) a person with disability (The image is licensed under a Creative Commons License. Link: [Link]
m/photos/undpeuropeandcis/5912638068).

356
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Fig. 2. Overall methodology of this work.

system, a well-annotated dataset with various human categories is the advances in computer vision–based person pattern recognition tech-
dominant component. Using human datasets, researchers and system niques, these datasets have neglected the importance of special
developers customise the performance and function of the object human–behaviour detection in emergency scenarios. In this paper, a
detection model to well detect specific human classes. However, most of new open HBDset is introduced focusing more on vulnerable groups
the popular datasets for object detection do not focus on the classifica- such as pregnant women, children, people using a wheelchair or a crutch
tion of different human behaviours. For example, even though PASCAL and those with special behaviours that are not conducive during escape
VOC (Visual Object Classes) datasets [16] annotate more than 10,000 such as playing on phones while walking, holding a baby and lugging
instances of human and this number of MS COCO (Microsoft Common luggage. The comparison of the HBDset (groups with special behaviours
Objects in Context) datasets [15] is so many that more than 580,000 are presented in bold) with existing human datasets is listed in Table 1.
instances, the category of human has only ‘person’. TDB [35] is a The introduction of the novel benchmark dataset will serve to advance
specified pedestrian dataset for supervised learning, but it has one ‘in- the ongoing progress in the domain of specialised human group recog-
dividual pedestrian’ category. Similarly, some human datasets contrib- nition, thereby stimulating heightened research engagement within this
uting to person re-identification such as the CUHK dataset Market1501 particular field.
[36] collect images of groups of people with different view angles and do
not class other categories. 3. Human behaviour detection dataset (HBDset)
In addition, some studies contributed to human action recognition
and established numerous corresponding human datasets with diverse 3.1. Data classification
human actions. For example, Gu et al. [37] proposed the Ava dataset
that comprises 1.59 million action labels and densely annotates 80 Vulnerable groups and people with disruptive behaviours for evac-
atomic visual human actions such as ‘sit’, ‘stand’ and ‘play instrument’. uation are considered to be special attention groups during evacuation
Some datasets classify human images using different attribute tags such scenarios. Once a public emergency occurs, the occupants should escape
as ‘backpack’, ‘short hair’, ‘shorts’, ‘skirt’, ‘woman’ and ‘man’, as well as promptly and properly. However, individuals who belong to special
mutual combinations. For example, the Richly Annotated Pedestrian attention groups, due to physical or mental conditions, may exhibit
(RAP) [38] dataset provides 84,928 images with 72 types of attributes reduced responsiveness and mobility, along with an inadequate
and additional tags. Other such famous human attribute datasets include comprehension of evacuation strategies and routes. Therefore, these
PA100-K [39] and PETA (PEdesTrian Attribute) [40]. groups should be given extra attention, assistance or professional rescue
Although these datasets classify human images into different attri- support such as specialised wheelchairs or stretchers. An intelligent
bute categories and make great contributions to person pattern recog- monitoring system should first encompass the initial identification,
nition, the detection of overly diverse attribute tags, such as clothes, localisation and quantification of these demographic clusters. Naturally,
hairstyles and normal actions, is of little significance in terms of public rich detection datasets with abundant images of special groups are
safety in an emergency. Other researchers proposed road pedestrian vitally significant for visual evacuee detection models based on deep
datasets for autonomous driving and traffic safety. For instance, Zhang learning. In accordance with the abovementioned evacuation moni-
et al. [41] introduced a diverse city pedestrian dataset named City- toring demand, the evacuee detection instances in the HBDset are
Person with 5000 images and four fine-grained person categories divided into eight categories: ‘normal_person’, ‘child’, ‘holding_crutch’,
(pedestrian, rider, sitting and others). Notably, Sharma et al. [42] pro- ‘holding_baby’, ‘pregnancy’ and ‘using_wheelchair’, ‘lugging_luggage’,
vided a new pedestrian dataset named BGVP (BG Vulnerable Pedestrian) and ‘playing_phone’.
focusing on vulnerable groups on a road, which contains 2000 images
and 5932 bounding box annotations and classifies pedestrians into 1) Normal_person: This group of occupants can rationally move and
children without disability, elderly with and without disability and respond immediately once an emergency scenario occurs. These
non-vulnerable people. Although the BGVP dataset tries to propose a people do not have any visible physical disability or vulnerability
guideline for vulnerable person identification, the other vulnerable and behave normally without carrying any items that may affect
groups such as pregnant women and groups with some special human evacuation. In the HBDset, we assume that the instances of this group
behaviours such as playing on phones while walking or holding a baby refer to adults who walk normally without any belongings or special
must also be given increased attention in terms of emergency safety. movements.
Although diverse and abundant human datasets have driven

357
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Table 1
Comparing the HBDset with existing human datasets.
Type/Task Datasets Year No. of human images Human categories Description

General VOC [16] 2010 >100,000 1 For visual classification, detection and segmentation
COCO [15] 2014 >580,000 1 /
Person CUHK01 [43] 2012 >1900 1 Focus on a person from various view angles
re-id Market1501 [36] 2015 >500,000 1 /
Action recognition Ava [37] 2018 1.59 M 80 Human action: sit, kick, eat, stand and so on
Human attributes PETA [40] 2014 19,000 61 Human attribute tags: gender, age, hairstyle, dress and so on
RAP [38] 2016 84,928 72 /
PA100-K [39] 2017 26 /
Road pedestrian TDB [35] 2008 25,551 1 Pedestrian detection
CityPerson [41] 2017 5000 4 Pedestrian, rider, sitting and others
BGVP [42] 2022 2000 4 Vulnerable and disabled groups
Special groups HBDset 2023 1523 8 Vulnerable and those with special human–behaviour groups
(this work)

2) Child: This group refers to children without disability. Nonetheless, a data from well-established classical datasets. Notably, a portion of
child has limited understanding of the severity of an emergency the ‘normal_person’ class images was sourced from the MS COCO
scenario and weaker physical ability. Moreover, a child may not be dataset [10]. Furthermore, we drew inspiration from various clas-
familiar with the evacuation route and can easily become frightened sical image datasets, which draw their images from online resources.
and panicked during the escape, making them more vulnerable to This inspiration led us to employ web crawlers and complementary
injury. Therefore, children must be in the special attention groups plug-in components to systematically collect images from publicly
and should be provided extra assistance and support during accessible domains on the Internet. Additionally, a subset of the
evacuation. images incorporated into our dataset was captured within a public
3) Holding_crutch: Most people in this group are elderly people or domain. In summary, the images of human behaviours are divided
disabled. They have limited mobility and balance issues, making into eight categories based on vulnerable groups and improper
them more susceptible to falling or tripping during urgent evacua- behaviours.
tion. Moreover, they will find it very difficult to pass if they meet
obstacles in the egress route. Therefore, focusing on their trajectory 3.2. Data annotation
during evacuation and providing prompt assistance are critical.
4) Holding_baby: People who are holding a baby during an evacuation In computer vision, the PASCAL VOC [16] format is widely used as
scenario are required to use at least one arm to hold the baby, and the standard data annotation format in object detection–model training.
their visibility may be obstructed by the baby, making it harder for The annotation works are conducted using the professional open-source
them to move and navigate quickly. At the same time, they need image annotation software LabelImg [44] (Fig. 3). The labelled profiles
more energy to safeguard their and the baby’s safety. are saved as XML files in the PASCAL VOC format, which contain key
5) Pregnancy: Pregnant women have severe physical limitations, are information regarding the images and annotations containing storage
more at risk of injury and are even prone to miscarriage. Hence, path, image resolution, class and object coordinates. The VOC format is
helping them evacuate or rescuing them is rather challenging, as it the more general format for object detection–model training, while the
requires specialised strategies and labour assistance to guarantee recent versions of YOLO such as Versions 5–8 have used specialised TXT
their and the unborn baby’s health as well as emotional support to files in the YOLO format. It simplifies the annotation process using a
help them stay calm. single TXT file per image, which contains all the necessary information
6) Using_wheelchair: People using wheelchairs are highly likely to be including the class label and bounding box coordinates of the objects.
elderly or disabled with limited mobility. The stress and panic during This streamlined format reduces the complexity of managing multiple
an emergency scenario can make it more difficult for them to remain XML files, and the two formats (PASCAL VOC format and VOC format)
calm and focused, which can further impede their ability to evacuate could be converted into each other using a Python programme. In this
safely. Their dependence on wheelchairs makes it even harder for study, the annotation data are saved in the YOLO format. After data
them to cross obstacles and stairs. Therefore, the monitoring system annotations, the results were inspected by other authors to ensure
must detect and track them in real time and provide extra profes- accuracy.
sional assistance and specialised rescue support.
7) Lugging luggage: This behaviour is disruptive for crowd evacuation. 3.3. Data statistics
This group is not vulnerable or disabled, but their baggage is likely to
make them move slowly and even inadvertently block the egress The HBDset comprises 1523 images with a total of 2923 objects, and
pathways or obstruct others’ movement. Moreover, they may be each object was annotated using a bounding box and a label. On
more susceptible to falls, tripping or other accidents due to their average, each image contains two bounding boxes. The image and object
luggage. Given these safety concerns, people who have lugging distributions are shown in Fig. 4, providing a quantitative understanding
luggage during evacuation must also be given special attention. of the dataset. In the database, approximately 200 images were available
8) Playing_phone: This behaviour is disruptive for crowd evacuation. for each category within the vulnerability groups, collectively consti-
Playing on phones during an evacuation scenario is very improper tuting 13 %–14 % of the comprehensive database. By contrast, the total
and dangerous. It may slow down the crowd evacuation process and objects, which were substantially fewer in number, represented only 7.3
put others at risk of injury. In addition, playing on phones can be a %–18 % of the entire database. Notably, the images depicting in-
distraction and will prevent evacuees from being aware of their dividuals classified as ‘normal’ comprised a mere 5.6 % of the total
surroundings and the urgent atmosphere, making it harder for them database, yet the targets associated with this category constituted a
to respond promptly to changing conditions or to follow instructions significant majority, encompassing 25 % of the comprehensive database.
from emergency personnel. Hence, those who are playing on their This observation aligns with real-world scenarios encountered within
phones must be detected, and on-time warnings must be provided. public spaces. The number of child objects is more than 500, achieving
In the process of constructing the database, we initially acquired the highest ranking in all special attention groups. The category with the

358
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Fig. 3. Demonstration of data annotation by LablelImg (the original image is taken by authors).

Fig. 4. Distribution of objects and images in each class: (a) images and (b) objects.

least number of objects is ‘using_wheelchair’ because there is only one YOLOv7 achieves state-of-the-art performance and surpasses all known
object in most images of this group. More details of the dataset statistics others on the MS COCO dataset. Moreover, YOLOv5 and YOLOv8 pre-
are provided in Table 2. sented by Ultralytics are well packaged and achieve advanced robust-
ness, which are widely used in academia and industry. Therefore, we
4. Experiments for datasets select YOLOv5, YOLOv7 and YOLOv8 to conduct the experiment to
validate the robustness and feasibility of our HBDset.
In this section, the object detection model is trained and tested using
the proposed HBDset to present the feasibility of diverse human
4.1. Model training
behaviour detection and provide a benchmark for relevant evaluation.
In this study, the recent official versions of the YOLO families, namely,
The colour transformation techniques including blur, darker,
YOLOv5, YOLOv7 and YOLOv8, are selected to evaluate its performance
brighter and salt noise processing are used as the data augmentation
on the HBDset. YOLOv7 optimises the model architecture and in-
method during model training. The colour transformation of images can
troduces many innovative techniques such as planned re-parameterised
increase image diversity to some extent, improve model generalisation
convolution to enhance the accuracy of real-time object detection.
for deep learning and reduce model over-fitting problems. Moreover, the
augmented dataset has better robustness and reduces the interference of
different backgrounds. Some instance images are shown in Fig. 5.
Table 2
Image and object number of each category. The images and annotations in the database were randomly disor-
dered and then split into the training subset (1066 images, 70 %),
Category No. of images No. of objects
validation subset (228 images, 15 %) and test subset (229 images, 15 %).
No special attention Normal_person 85 730 The training set is utilised to train the model, the validation set is
Vulnerable Child 200 525
employed to assess the performance of the trained model during training
Holding_crutch 208 214
Holding_baby 206 239 and the test set is utilised to measure the accuracy of object detection.
Using_wheelchair 210 231 The training process is conducted on a desktop computer with the
Pregnancy 218 239 following configuration: NVIDIA Geforce RTX 3070, 12th Gen Intel(R)
Disruptive behaviour Lugging_luggage 201 385 Core (TM) i5–12490F 3.00 GHz. Pytorch [45] is applied as the deep
Playing_phone 195 360
learning framework to build the model structure. During training, the

359
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Fig. 5. Demonstration of data augmentation (The raw unprocessed image is taken by authors ).

batch size is set at 16 and a total of 300 epochs are set to make the model L obj represents the loss of detection induced by the confidence for
fully trained to convergence. To reduce training time and computational the inclusion of a target object,
resources, the methodology of transfer learning [46] is applied to L dfl represents the distribution focal loss induced by the predicted
improve the performance of training results with the limited sample bounding box offset and
size. Transfer learning is a technique in deep learning in which parts of λ1 , λ2 and λ3 are hyperparameters, and the setting values are listed in
the knowledge learned from the original task are re-used to boost per- Table 3.
formance on a related similar task. The official YOLO version, as pub- Another metric to evaluate the training performance of the deep
lished, encompasses the original task of ‘person’ detection and the learning object detection model is the mean average precision (mAP)
associated network weights. Consequently, the insights and knowledge [47]. mAP is a comprehensive evaluation index combining Recall (R)
acquired during the ‘person’ detection training phase can be effectively and Precision (P) shown in Eq. (3 and 4), which eliminates the limitation
leveraged when training the model for the detection of various other of using a single metric. Fig. 6 shows the average mAP curves of all
human groups and behaviours. During the practical training process, classes in the training process, in which the blue solid curve represents
transfer learning is executed by iteratively adjusting the weights while the mAP score when the threshold equals 0.5, denoted as mAP @0.5,
keeping the official weight profiles such as ‘[Link]’ frozen and and the blue dash curve represents the average mAP scores in different
retaining a portion of the neural network structure. thresholds ranging from 0.5 to 0.95, denoted as mAP @0.5:0.95. The
Fig. 6 illustrates the training loss and validation loss during the mAP curves show that the prediction accuracy increases with the
training process. The total loss adopted by YOLOv5 and YOLOv7 is number of epochs and the average mAP score of all classes after 300
accumulated in the loss of bounding box, loss of detection and loss of training epochs is >0.70. Overall, the three models afford stable training
classification, shown as Eq. (1), while the total loss adopted by YOLOv8 results, illustrating the high quality of the HBDset.
is accumulated in the loss of bounding box, loss of classification and
distribution focal loss, shown as Eq. (2). As the epoch increases, the ∫1
training loss and validation loss steadily decrease until reaching a very AP = P(R)dR (3)
low value. This indicates that the model has converged and achieved a 0

consistent and optimal performance. The learning curve provides valu- ∑N


able insights into the robustness and generalisability of the dataset. i=1 APi
mAP = (4)
N
loss1 = λ1 L bb + λ2 L cls + λ3 L obj , (1)

loss2 = λ1 L bb + λ2 L cls + λ3 L dfl , (2) Table 3


Hyperparameters of loss weights.
where YOLOv5 YOLOv7 YOLOv8
L bb represents the loss of the bounding box induced by pixel co-
λ1 0.05 0.05 7.5
ordinates of the bounding box, λ2 0.005 0.03 0.5
L cls represents the loss of classification induced by the recognition of λ3 1 0.7 1.5
object category,

Fig. 6. Training process of the three object detection models on the HBDset: (a) YOLOv5, (b) YOLOv7 and (c) YOLOv8.

360
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

perspective, strabismus was used for shooting experiments, and the


4.2. Model test target crossed multiple times during the tracking process. The result is
shown in Fig. 8. The well-trained model by the HBDset can accurately
After training, the well-trained models were evaluated on the testing complete the identification and the successful ratio of detection is >90
dataset containing a total of 229 images of eight categories. The Recall %, and the tracking performance is relatively stable, as shown in Videos
(R), Precision (P) and mAP@0.5 (mAP) were used as the metrics, and the S1 and S2 in the Supplementary Material.
testing results are shown in Table 4. The results show that the average
mAP score of the three models is >0.89, where YOLOv7 achieves a 5. Perspectives of intelligent monitoring and digital twin
higher mAP (0.91) score than YOLOv5 (0.86) and YOLOv8 (0.89). In systems
addition, the category of ‘using_wheelchair’ achieves the highest mAP
score, and the average mAP scores of the categories of ‘holding_baby’, The HBDset not only contributes to human recognition research but
‘holding_crutch’ and ‘pregnancy’ are all >0.90. Notably, the class of also helps develop the intelligent monitoring and digital twin systems.
‘normal_person’ achieves the lowest accuracy score, which is signifi- The HBDset could be used to train the object detection model and ach-
cantly lower than the average mAP score. The attributes of each class are ieve automatic detection of vulnerable populations and other special
reasons for different classes achieving different mAP scores. For behaviours, facilitating a more intelligent public safety system. Below is
example, highly accurate classes such as including ‘holding_baby’ and the discussion of the practical potential and contributions of the HBDset
‘using_wheelchair’ have some very specialised characteristics such as for the intelligent monitoring system and emergency management.
the baby and the wheelchair, which make it much easier for the CNN to In public emergency disasters, intelligent monitoring should possess
identify these attributes. the capability to facilitate the precise localisation of occupants, monitor
Moreover, the relevant images were randomly selected from the test and record their movement trajectories and transfer the acquired
dataset to demonstrate the detection performance of the well-trained monitoring data automatically. To realise these functions, an AI-based
models. The test dataset has never been used in the training process. human detection algorithm is the critical prerequisite. Because evacu-
We distribute the boxes of different human categories with various ation behaviours in current public emergencies such as building fires or
colours. A demonstration of some detection instances is shown in Fig. 7. earthquakes are more diverse and complex, the system must detect
In the demonstration, eight classes of human groups or human behav- multiple human groups or behaviours as well as consider various spe-
iours can be detected with high accuracy, and most of them achieve a cific human safety scenarios during evacuation.
confidence of >0.90. The results reveal that the detected human cate- Furthermore, the intelligent digital twin system transferring the
gories are accurately classified with few errors. physical scenario as the virtual data integration would benefit public
One of the limitations of this study is assuming that one individual safety and emergency response. On the one hand, the digital twin system
has one specific behaviour or characteristic. However, some individuals allows the monitoring of the trajectories and behaviour of various
may have several characteristics or show varying behaviours, which human groups, enabling the optimisation of evacuation strategies. On
may lead to a more vulnerable situation. Therefore, the accurate the other hand, it can integrate real-time processed information into the
detection and corresponding evacuation strategy of pedestrians with virtual interface, enhancing monitoring efficiency. To achieve this goal,
multi-attributes would be further researched. the framework of an intelligent multi-behaviour digital twin system for
public safety (Fig. 9) is proposed based on our HBDset. The flowchart of
4.3. Demonstration in Hong Kong international airport the proposed digital twin system contains three parts: (a) technical
installation based on a CCTV (closed-circuit television) network, (b) AI
To test the performance of the modified objection models trained by engine based on a well-trained human-detection model and (c) moni-
the HBDset, a demonstration in a public scenario of the Hong Kong In- toring user interface. These are responsible for raw video capturing,
ternational Airport is conducted for detecting and tracking special vision processing and bounding box generation, respectively, which can
human groups, in which the tested videos are publicly available and be employed to present detection results and issue corresponding in-
collected from the Internet. In real evacuation scenarios, the types and structions. For the implementation of the system, the CCTV network in
distribution of people are often complex. To verify the robustness and public places can be used directly or more monitoring cameras can be
universality of the HBDset and provide an evacuation scenario appli- installed as the eyes of the system.
cation case, we combined YOLOv7 and Deep Simple Online Real-time A CCTV network captures raw videos in real time and synchronously
Tracking (DeepSORT) [48]. transfers data to a cloud server or edge computing equipment. An AI
DeepSORT incorporates appearance information and employs Kal- engine would invoke the well-trained multi-behaviour detection model
man filtering to track objects in an image. It employs a Hungarian al- to process the received video signals and assign bounding boxes for
gorithm to perform frame-by-frame association and quantifies through detected humans. The detected objects would be marked with different
an association metric [49]. Therefore, this algorithm can be utilised to coloured boxes based on the behaviour categories. In this step, the object
label personnel based on target detection, which will be beneficial to detection model must be trained using the proposed HBDset before
evacuation tracking and command. To simulate the surveillance deploying the algorithm. In the future, we will add more top-view

Table 4
Testing results of the three models on the HBDset.
YOLOv5 YOLOv7 YOLOv8

Class P R mAP P R mAP P R mAP

Normal_person 0.76 0.63 0.70 0.72 0.71 0.91 0.65 0.64 0.69
Child 0.87 0.79 0.90 0.80 0.83 0.84 0.90 0.70 0.86
Holding_crutch 0.84 0.90 0.93 0.86 0.99 0.99 0.97 0.91 0.97
Holding_baby 0.88 0.98 0.97 0.90 0.91 0.97 0.68 0.82 0.82
Lugging_luggage 0.77 0.69 0.77 0.84 0.85 0.87 0.91 0.92 0.93
Playing_phone 0.62 0.74 0.68 0.86 0.87 0.93 0.87 0.79 0.89
Pregnancy 0.99 0.79 0.96 0.83 0.97 0.91 0.92 0.80 0.94
Using_wheelchair 0.95 0.89 0.97 0.90 0.96 0.98 0.99 0.89 0.99
All 0.84 0.80 0.86 0.84 0.89 0.91 0.86 0.81 0.89

361
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

Fig. 7. Demonstration of detection effect in the test dataset. (The original undetected images are licensed under a Creative Commons License for free to share and
adapt. Link: (a) [Link] (b) [Link] (c) htt
ps://[Link]/news/403689/Tehran-Municipality-to-enhance-services-to-people-with-disabilities (d) [Link]
3/21314785534/in/photostream/ (e) [Link] (f) [Link]
[Link] (g) [Link]

Fig. 8. Detection and tracking effect in the Hong Kong International Airport (copyright: authors).

Fig. 9. Flowchart of an intelligent digital twin system for monitoring public safety and improving evacuation during earthquakes, building fires and stampedes.

362
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

images to this dataset to further improve the model performance. The Declaration of competing interest
third part is the output edge to perform the processed information in the
user interface for on-site or remote evacuation directors. The displayed The authors declare that they have no known competing financial
relevant information should contain the real-time locations, trajectories interests or personal relationships that could have appeared to influence
and counts of different human groups. The intelligent digital twin sys- the work reported in this paper.
tem enables providing commands and advice based on the information
of detected vulnerable groups and people with improper behaviours. For
Acknowledgments
example, the system would provide suggestions that more than five
ambulance stretchers are needed or that one pregnant woman and two
This work is funded by the Hong Kong Research Grants Council
women holding babies need extra help. Meanwhile, during the escape,
Theme-based Research Scheme (T22-505/19-N), the National Natural
the system would warn people playing on their phones or people with
Science Foundation of China (52204232) and MTR Research Fund (PTU-
lugging baggage via alarms or on-site instructors to avoid falling or even
23005).
causing a stampede.
The core component of the proposed intelligent system is the
modified object detection model pre-trained using the corresponding Data availability
human image dataset. Therefore, the presented HBDset lays the foun-
dation for developing the described intelligent system, which is the The HBDset uses the images for non-commercial research and/or
major contribution of this study. Building a practical intelligent digital educational purposes.
twin system for use in public emergency would need the resolution of a HBDset with images and annotations, the Python script of splitting
multitude of engineering challenges, such as whether the latency of data dataset into train, valid and test subset, well-trained weights of cus-
transmission caused by the limits of wireless devices and computation tomed YOLOv5, YOLOv7 and YOLOv8 for human behaviour detection
resources affects the system performance of real-time instruction and and corresponding code repository links are available on GitHub
feedback. Moreover, the artificial factor undertakes much major work in ([Link]
the intelligent system instead of a fully unmanned operation. Nonethe- Behaviour_Detection_Dataset). The file [Link] plays the role of
less, the automatic recognition and detection of diverse human behav- guidance of HBDset for the users.
iours is the first step to building a fully unmanned and high-level smart
monitoring system framework in the future. Supplementary materials

6. Conclusions Supplementary material associated with this article can be found, in


the online version, at doi:10.1016/[Link].2024.04.002.
In this paper, an open human behaviour–detection dataset for deep
learning in public emergency safety is proposed, denoted as the HBDset. References
The HBDset collected and annotated more than 1500 images with more
than 2900 object bounding boxes, containing diverse vulnerable human [1] S. Grimaz, P. Malisan, A. Pividori, Sharing the post-earthquake situation for
groups or people with improper behaviours during evacuations. The emergency response management in transborder areas: the e-Atlas tool, J. Saf. Sci.
Resil. 3 (1) (2022) 72–86, [Link]
collected images were augmented to increase the generalisation and [2] Y. Zhang, X. Zhang, X. Huang, Design a safe firefighting time (SFT) for major fire
annotated as standard object detection format for deep learning. The disaster emergency response, Int. J. Disaster Risk Reduct. 88 (2023) 103606.
continued refinement of classical object detection models for complex [3] A.F. Lee, A.V. Saenz, Y. Kawata, On the calibration of the parameters governing the
PWRI distributed hydrological model for flood prediction, J. Saf. Sci. Resil. 1 (2)
human behaviour detection, resulting in the attainment of significant
(2020) 80–90, [Link]
levels of accuracy in vulnerability group detection, serves to underscore [4] X. Hu, H. Zhao, Y. Bai, J. Wu, Risk analysis of stampede in sporting venues based
the suitability of the dataset for the advancement of human behaviour on catastrophe theory and Bayesian network, Int. J. Disaster Risk Reduct. 78
(2022) 103111, [Link]
detection models. Furthermore, a comprehensive framework for an
[5] R. Lovreglio, D. Borri, L. dell’Olio, A. Ibeas, A discrete choice model based on
intelligent multi-behaviour monitoring system is expounded, with po- random utilities for exit choice in emergency evacuations, Saf. Sci. 62 (2014)
tential applications in the domain of public emergency management and 418–426, [Link]
early disaster alerts. [6] R. Lovreglio, A. Fonzone, L. dell’Olio, D. Borri, A study of herding behaviour in exit
choice during emergencies based on random utility theory, Saf. Sci. 82 (2016)
In summary, by constructing this repository and the envisioned 421–431, [Link]
intelligent digital twin system for monitoring human conduct and public [7] W. Weng, J. Wang, L. Shen, Y. Song, Review of analyses on crowd-gathering risk
safety, we aspire to draw increased focus towards susceptible de- and its evaluation methods, J. Saf. Sci. Resil. 4 (1) (2023) 93–107, [Link]
10.1016/[Link].2022.10.004.
mographics during disaster evacuation. Through the identification of [8] S.W. Gilbert, D.T. Butry, Identifying vulnerable populations to death and injuries
spatial distribution, quantification and trajectory of special human from residential fires, Inj. Prev. 24 (5) (2018) 358–364.
groups, the consideration afforded to the human behaviours can be [9] J. Koo, Y.S. Kim, B.-I. Kim, Estimating the impact of residents with disabilities on
the evacuation in a high-rise building: a simulation study, Simul. Model. Pract.
augmented, thus enhancing the likelihood of secure egress from disaster- Theory 24 (2012) 71–83.
stricken areas. Additionally, such data will empower decision-makers to [10] J. Koo, Y.S. Kim, B.-I. Kim, K.M. Christensen, A comparative study of evacuation
formulate more efficacious evacuation strategies, which will facilitate strategies for people with disabilities in high-rise building evacuation, Expert Syst.
Appl. 40 (2) (2013) 408–417, [Link]
the entire evacuation process and public emergency response.
[11] X. Zhao, R. Lovreglio, D. Nilsson, Modelling and interpreting pre-evacuation
decision-making using machine learning, Autom. Constr. 113 (2020) 103140,
CRediT authorship contribution statement [Link]
[12] S. Huang, J. Ji, Y. Wang, W. Li, Y. Zheng, A machine vision-based method for
crowd density estimation and evacuation simulation, Saf. Sci. 167 (2023) 106285,
Yifei Ding: Writing – original draft, Methodology, Investigation, [Link]
Formal analysis, Conceptualization. Xinghao Chen: Writing – original [13] J.C.P. Cheng, K. Chen, P.K.-Y. Wong, W. Chen, C.T. Li, Graph-based network
draft, Investigation, Formal analysis. Zilong Wang: Writing – review & generation and CCTV processing techniques for fire evacuation, Build. Res. Inf. 49
(2) (2021) 179–196, [Link]
editing, Supervision, Conceptualization. Yuxin Zhang: Writing – review [14] Y. Ding, Y. Zhang, X. Huang, Intelligent emergency digital twin system for
& editing, Supervision, Funding acquisition, Conceptualization. Xinyan monitoring building fire evacuation, J. Build. Eng. 77 (2023) 107416, [Link]
Huang: Writing – review & editing, Supervision, Funding acquisition, org/10.1016/[Link].2023.107416.
[15] T.-Y. Lin, et al., Microsoft coco: common objects in context, in: Computer
Conceptualization. Vision–ECCV2014: 13th European Conference, Springer, Zurich, Switzerland,
2014, pp. 740–755. September 6-12, 2014Proceedings, Part V 13.

363
Y. Ding et al. Journal of Safety Science and Resilience 5 (2024) 355–364

[16] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal [33] H. Ji, et al., Human abnormal behavior detection method based on T-TINY-YOLO,
visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2010) 303–338. in: Proceedings of the 5th International Conference on Multimedia and Image
[17] J. Cao, Y. Pang, J. Xie, F.S. Khan, L. Shao, From handcrafted to deep features for Processing, Association for Computing Machinery, New York, NY, USA, 2020,
pedestrian detection: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 44 (9) pp. 1–5, [Link] ICMIP ’20.
(2022) 4913–4934, [Link] [34] H.H. Nguyen, T.N. Ta, N.C. Nguyen, V.T. Bui, H.M. Pham, D.M. Nguyen, YOLO
[18] N. Chumuang, S. Hiranchan, M. Ketcham, W. Yimyam, P. Pramkeaw, based real-time human detection for smart video surveillance at the edge, in: 2020
T. Jensuttiwetchakult, Face detection system for public transport service based on IEEE Eighth International Conference on Communications and Electronics (ICCE),
scale-invariant feature transform, in: 2020 15th International Joint Symposium on 2021, pp. 439–444, [Link]
Artificial Intelligence and Natural Language Processing (iSAI-NLP), 2020, pp. 1–6, [35] G. Overett, L. Petersson, N. Brewer, L. Andersson, N. Pettersson, A new pedestrian
[Link] dataset for supervised learning, in: 2008 IEEE Intelligent Vehicles Symposium,
[19] B. Yang, W. Zhan, P. Wang, C. Chan, Y. Cai, N. Wang, Crossing or not? Context- 2008, pp. 373–378, [Link]
based recognition of pedestrian crossing intention in the urban environment, IEEE [36] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-
Trans. Intell. Transp. Syst. 23 (6) (2022) 5338–5349, [Link] identification: a benchmark, in: Proceedings of the IEEE international conference
TITS.2021.3053031. on computer vision, 2015, pp. 1116–1124.
[20] H. Zhou, G. Yu, Research on pedestrian detection technology based on the SVM [37] C. Gu, et al., Ava: a video dataset of spatio-temporally localized atomic visual
classifier trained by HOG and LTP features, Futur. Gener. Comput. Syst. 125 (2021) actions, in: Proceedings of the IEEE conference on computer vision and pattern
604–615. recognition, 2018, pp. 6047–6056.
[21] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on [38] D. Li, Z. Zhang, X. Chen, K. Huang, A richly annotated pedestrian dataset for person
computer vision, 2015, pp. 1440–1448. retrieval in real surveillance scenarios, IEEE Trans. Image Process. 28 (4) (2018)
[22] P. Yang, G. Zhang, L. Wang, L. Xu, Q. Deng, M.-H. Yang, A part-aware multi-scale 1575–1590.
fully convolutional network for pedestrian detection, IEEE Trans. Intell. Transp. [39] X. Liu, et al., Hydraplus-net: attentive deep features for pedestrian analysis, in:
Syst. 22 (2) (2021) 1125–1137, [Link] Proceedings of the IEEE International Conference on Computer Vision, 2017,
[23] W. Liu, et al., Ssd: single shot multibox detector, in: European conference on pp. 350–359.
computer vision, Springer, 2016, pp. 21–37. [40] Y. Deng, P. Luo, C.C. Loy, X. Tang, Pedestrian attribute recognition at far distance,
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014,
real-time object detection,” 2016. [Online]. Available: [Link] pp. 789–792.
[25] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, in: Proceedings of the [41] S. Zhang, R. Benenson, B. Schiele, Citypersons: a diverse dataset for pedestrian
IEEE conference on computer vision and pattern recognition, 2017, detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
pp. 7263–7271. Recognition, 2017, pp. 3213–3221.
[26] J. Redmon and A. Farhadi, “Yolov3: an incremental improvement,” arXiv Prepr. [42] D. Sharma, T. Hade, and Q. Tian, “Comparison Of Deep Object Detectors On A New
arXiv1804.02767, 2018. Vulnerable Pedestrian Dataset,” arXiv Prepr. arXiv2212.06218, 2022.
[27] A. Bochkovskiy, C.-Y. Wang, and H.-Y.M. Liao, “Yolov4: optimal speed and [43] W. Li, R. Zhao, X. Wang, Human reidentification with transferred metric learning,
accuracy of object detection,” arXiv Prepr. arXiv2004.10934, 2020. in: Computer Vision–ACCV 2012: 11th Asian Conference on Computer Vision,
[28] C.-Y. Wang, A. Bochkovskiy, H.-Y.M. Liao, YOLOv7: trainable bag-of-freebies sets Springer, Daejeon, Korea, 2013, pp. 31–44. November 5-9, 2012Revised Selected
new state-of-the-art for real-time object detectors, in: Proceedings of the IEEE/CVF Papers, Part I 11.
Conference on Computer Vision and Pattern Recognition, 2023, pp. 7464–7475. [44] D. Tzutalin, LabelImg, GitHub Repos 6 (2015).
[29] P. Ultralytics, “YOLOv5,” Github Repository. [Online]. Available: [Link] [45] A. Paszke, et al., Pytorch: an imperative style, high-performance deep learning
com/ultralytics/yolov5. library, Adv. Neural Inf. Process. Syst. 32 (2019).
[30] P. Ultralytics, “YOLOv8,” Github Repository. [Online]. Available: [Link] [46] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22
com/ultralytics/ultralytics. (10) (2010) 1345–1359, [Link]
[31] Z. Wang, Y. Ding, T. Zhang, X. Huang, Automatic real-time fire distance, size and [47] Z. Zou, K. Chen, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, in:
power measurement driven by stereo camera and deep learning, Fire Saf. J. 140 Proc. IEEE, 2023.
(2023) 103891, [Link] [48] N. Wojke, A. Bewley, D. Paulus, Simple online and realtime tracking with a deep
[32] S. Li, L. Tong, C. Zhai, Extraction and modelling application of evacuation association metric, in: 2017 IEEE International Conference on Image Processing
movement characteristic parameters in real earthquake evacuation video based on (ICIP), 2017, pp. 3645–3649, [Link]
deep learning, Int. J. Disaster Risk Reduct. 80 (Oct. 2022) 103213, [Link] [49] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft, Simple online and realtime tracking,
10.1016/[Link].2022.103213. in: 2016 IEEE International Conference on Image Processing (ICIP), 2016,
pp. 3464–3468, [Link]

364

You might also like