• Ешқандай Нәтиже Табылған Жоқ

Submitted to the School of Engineering and Digital Sciences in partial fulfillment of the requirements for the degree of

N/A
N/A
Protected

Academic year: 2023

Share "Submitted to the School of Engineering and Digital Sciences in partial fulfillment of the requirements for the degree of"

Copied!
48
0
0

Толық мәтін

Submitted to the School of Engineering and Digital Sciences on April 29, 2022, in partial fulfillment of the. I would also like to express my appreciation for the assistance provided by Enver Ever and Hakan Yatbaz, and the Computer Science Department of Nazarbayev University. When it comes to monitoring the health of the elderly, one of the risky scenarios for them is falling, which causes their body to fall brutally on the ground [19].

Falls are the most common cause of injury in the elderly, and they are also responsible for over 60% of significant injuries sustained by people of all ages according to population-based studies [19]. To provide the necessary care and protection, the user's actions must first be identified, which is one of the challenges for intelligent systems. Some of the popular classification methods for HAR are Random Forest, SVM, CNN, LSTM and k-Nearest Neighbors [10].

There is an emerging trend to design hybrid action recognition systems in response to concerns that fall detection systems using only inertial sensors produce too many false alarms and video-based activity recognition systems lack precision and accuracy. To build the state-of-the-art inertial sensor-based activity recognition model incorporating falls. Much work has been done on activity recognition from inertial or video sensors.

Experiments show that both models achieve state-of-the-art accuracy scores for their tasks.

HAR from inertial sensor data

HAR from video data

The proposed ConvLSTM model achieved 95% accuracy on UMAFall dataset and 98.39% on SisFall dataset, making it state-of-art for corresponding datasets. Using accelerometer built into smartphones became a desirable solution for human activity recognition as it is practical and easy to implement. The proposed CNN achieved an accuracy rate of 82.41% on DMLSmartActions, becoming the state-of-the-art for this dataset.

They published a paper presenting a real-time multi-person activity recognition system based on deep learning. It is a complex system that captures a video stream of a scene, uses YOLOv3 to identify bounding box coordinates of a person in the frame, implements a FaceNet recognition approach for face recognition, and is able to perform automatic "zooming" when people in the frame are too far from the camera. Similar to implementing an action recognition and drop detection system using YOLOv3 to detect objects in the frame.

YCL have achieved an accuracy rate of 93.74% on the dataset collected and developed by the authors. Additionally, a smartphone application that sends an alert SMS to a subject's relatives once the act of falling was confirmed was developed by the authors.

Hybrid HAR approaches

Data description

Inertial sensor data or video frames are required to train the human activity recognition model. DMLSmartActions data from the University of British Columbia's Digital Multimedia Lab is also used to create the CNN model as an alternative dataset. Five separate areas of the body were equipped with accelerometers and gyroscopes (neck, wrist, waist, ankle, and pocket).

For the video-based activity recognition module, we only used image frames from one of the two RGB cameras.

Figure 3-1: Some human activities from combined dataset from a Kinect sensor and two HD streams from RGB cameras [6].
Figure 3-1: Some human activities from combined dataset from a Kinect sensor and two HD streams from RGB cameras [6].

Proposed architectures

The camera is turned off, but the accelerometer sensors continue to collect data and analyze the person's actions, in case the fall is not confirmed. Furthermore, because the camera is not running all the time, the memory limit is reduced. This enables proposed system to work more efficiently and robustly than the system presented in the existing literature.

The development of deep learning-based approaches led to significant improvements in the accuracy of machine learning models. In particular, convolutional neural networks (CNN) are considered one of the best in image data classification. A CNN is a specific architecture of artificial neural networks designed to process data in a pixel format [22].

We chose to develop a convolutional neural network from scratch to adapt the architecture of our CNN model to the datasets we used. Hyperparameter optimization is applied to determine an optimal architecture for the model. Different models with different architectures are also investigated in addition to hyperparameter tuning to find the best performing model.

Transfer learning is a technology in which the features pre-trained for a given model are reused in a new machine learning model. We used ResNet50, trained on ImageNet and added a dense layer with 128 nodes and ReLu activation function at the end. Another type of deep neural network architecture that is now gaining popularity in various challenging tasks is called transformer.

A transformer weighs the significance of each part of the incoming data separately using the self-attention mechanism [14]. The system takes a 72 by 72 pixel image and transforms it into a grid of 6 by 6 patches, which will be fed into our vision transformer model. The architecture of the vision transformer model is similar to the approach presented in [14].

Figure 3-4: CNN model architecture for vision-based HAR
Figure 3-4: CNN model architecture for vision-based HAR

Inertial sensor based activity recognition

Data description

Feature selection

Long short-term memory

Hybrid approach for human activity recognition

  • Data description
  • Preprocessing
  • Feature level fusion
  • Combined dataset for fall detection
  • DMLSmartActions dataset
  • UP-Fall dataset

The optical flow was calculated according to the method of Horn and Schunck, described in [17]. For the combined fall detection dataset, the hyperparameters tuned by Tensor-Board are the number of convolutional layers (ranging from one to three), the size of convolutional layers, and the number of dense layers (ranging from zero to two). The class weights are assigned to each class of images due to the unbalanced nature of the dataset.

An alternative technique that can be used to balance the datasets is the hybrid approach that directly changes the number of samples in each class. The number of images in the "fallen" and "dormant" classes is sufficiently comparable and close to the average. The first task was performed by increasing the number of samples in “sitting” by a factor of two by flipping each image horizontally.

The results with this dataset are similar to the results of the dataset obtained using the class weights. 4-2 represents the average confusion matrix of the model for the case where the hybrid approach is used to balance the data set. The DMLSmartActions dataset from the Digital Multimedia Lab at the University of British Columbia is also used as an alternative dataset to develop the CNN model.

The final architecture of the CNN model is slightly different from the original architecture shown in Fig. We assigned class weights to each class to handle the unbalanced nature of the data set here as well. This time we performed 10-fold cross-validation, as the larger size of the data set gives more samples for testing, thus allowing training on a larger number of folds.

The results show that the model is able to classify most actions with an accuracy of about 87%. 4-3 represents the confusion matrix for one of the folds of the proposed CNN model for the DMLSmartActions dataset. Most of the misclassifications occur with the "walk" and "nothing" classes, which is understandable since "walk" contains many samples in which the subjects are moving out of the frame and only small parts of their bodies are visible, making it looks like nobody. is in the frame.

Figure 3-8: Feature level fusion
Figure 3-8: Feature level fusion

Inertial sensor based module

In the Transformer model, we used AdamW optimizer, an Adam optimizer with weight decay. Since transformers have become so popular in image classification recently, it is not surprising that they outperformed CNN and became our best model.

Multimodal system

In contrast, our models perform classification among all 11 classes and include the only multimodal activity recognition modality trained on UP-Fall. However, we recognize that there is a problem in most works on activity recognition, either from inertial data or from video data. To properly evaluate the model, the out-of-sample data must be properly separated.

Instead, the data must be preprocessed in such a way that some of the sequences for testing are separated before the data is mixed. Furthermore, for temporal and time series data, the out-of-sample must be some part of the data in the future. This is why during the preprocessing of our sequences, we have allocated the last 10% of the data for testing.

We argue that this is the most realistic way to test out of sample. As a result, our multimodal activity recognition model achieved an accuracy of 85.84% in multiclass classification (11 labels) with an appropriate preprocessing method. The model combines features obtained from LSTM for inertial data and ConvLSTM for a vision-based approach.

Therefore, fall detection and human activity recognition are integral parts of a system designed to help prevent the severe consequences of human falls. Research on Human Activity Recognition in Smart Homes Based on IoT Sensor Algorithms: Taxonomies, Challenges and Opportunities with Deep Learning. Optimized spatiotemporal descriptors for real-time fall detection: A comparison of support vector machine and Adaboost-based classification.

Fred, and Hugo Gamboa, editors, Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2022, Volume 5: HEALTH-INF, Online Streaming, February pages 812–820. A vision-based approach for fall detection using multiple cameras and convolutional neural networks: A case study using the fall detection dataset. Deep neural network-based double check method for fall detection using imu-l sensor and rgb camera data.

Comparison CNN architectures for fall dataset

Accuracy scores of CNN for DMLSmartActions

Comparison of the Proposed Model with the Literature

Performance of CNN model on UP-fall dataset’s videos

Performance of different deep learning models on UP-fall videos

Accuracy scores of LSTM for inertial data module

Сурет

Figure 3-1: Some human activities from combined dataset from a Kinect sensor and two HD streams from RGB cameras [6].
Figure 3-2: Some human activities from DMLSmartActions
Figure 3-3: The experimental setup from UP-Fall dataset
Figure 3-4: CNN model architecture for vision-based HAR
+7

Ақпарат көздері

СӘЙКЕС КЕЛЕТІН ҚҰЖАТТАР

Model calculations and comparison with observations from the Nordic air and preci- pitation network for 1978 and