Proceedings Article | 22 April 2022
KEYWORDS: Data modeling, Machine learning, Lawrencium, Performance modeling, Statistical modeling, Statistical analysis, Expectation maximization algorithms, Data analysis, Diagnostics, Analytical research
The prediction on depression with respect to the effect of safety behaviour during COVID-19 has been seldom investigated previously. Furthermore, the effect of balance of data based on regenerating methods is hardly ever discussed. In this paper, the performance of prediction is investigated with data collected across 26 countries across the world in consideration of the effect on the variables of potential affecting factors. Specifically, the data was retrieved from the open-source dataset conducted by IGHI, at imperial college London, containing 384,250 valid individuals with measurement of age, gender, country, covid status, employment status and behaviour score. Five machine-learning methods, namely logistic regression, MLR, RF, SVM and k-NN, were used for comparison of the performance metric by different statistical measurements. Based on the six chosen latent factors, RF is evaluated as an optimal model with the highest F1 score (0.787) and G-mean (0.503) without using re-sampling methods. Linear SVM, on the other hand, has the highest specificity (0.998) with original data. Furthermore, although there is an increase in sensitivity, using oversampling and undersampling procedure reduce the prediction accuracy to a nearly random value (0.5). Overall, RF without re-sampling method is considered to be the comparatively best model for its highest sensitivity, precision, F1-score and G-mean among all five data analysis algorithms; especially for minimizing the false positive rate such that all patients with depression are successfully identified. These results shed light on the choice of models when applied on prediction of depression status under different scenario.