Human activities play a vital role in many health-related fields of research. For example, changes in activities of daily living can predict neurodegenerative diseases like Alzheimer’s disease or the risk of suffering from a fall. Therefore, automatically recognizing activities in videos has become of scientific interest. In contrast to wearable devices, video cameras have the advantage of being truly unobtrusive without any physical contact, continuous as they do not require charging, and cannot be forgotten to wear. This work proposes a novel approach for multi-camera and multi-person human activity recognition (HAR) in videos. The aim is to classify each person in each frame of a video in one of the five classes: “Lying”, “Sitting”, “Standing”, “Walking” or “Falling”. We use a combination of YOLOv4 (person detection), DeepSort (tracking), a convolutional neural network (CNN, features extraction), and an attention-based multi-layer long short-term memory network (classification) to track actions of multiple subjects. Our entire dataset comprises of four publicly-available datasets (ETRI - Activity 3D, MSR Daily Activity 3D, HAR-UP Fall Dataset, High-Quality Fall Simulation Data) with a total file size of 300 gigabytes. In our experiments, we pick random subsets of 1% and 0.25% of data for training and testing respectively. We achieve a classification accuracy of 95%. In the presence of a single subject in the room, predictions from multiple cameras are combined using soft voting, which further improves the accuracy. In summary, HAR in videos is feasible using the proposed combination of machine learning techniques.
|