Background: Lung cancer is one of the most common cancers in the United States and the most fatal, with 142,670 deaths in 2019. Accurately determining tumor response is critical to clinical treatment decisions, ultimately impacting patient survival. To better differentiate between non-small cell lung cancer (NSCLC) responders and non-responders to therapy, radiomic analysis is emerging as a promising approach to identify associated imaging features undetectable by the human eye. However, the plethora of variables extracted from an image may actually undermine the performance of computer-aided prognostic assessment, known as the curse of dimensionality. In the present study, we show that correlative-driven hierarchical clustering improves high-dimensional radiomics-based feature selection and dimensionality reduction, ultimately predicting overall survival in NSCLC patients. Methods: To select features for high-dimensional radiomics data, a correlation-incorporated hierarchical clustering algorithm automatically categorizes features into several groups. The truncation distance in the resulting dendrogram graph is used to control the categorization of the features, initiating low-rank dimensionality reduction in each cluster, and providing descriptive features for Cox proportional hazards (CPH)-based survival analysis. Using a publicly available non- NSCLC radiogenomic dataset of 204 patients’ CT images, 429 established radiomics features were extracted. Low-rank dimensionality reduction via principal component analysis (PCA) was employed (𝒌 = 𝟏, 𝒏 < 𝟏) to find the representative components of each cluster of features and calculate cluster robustness using the relative weighted consistency metric. Results: Hierarchical clustering categorized radiomic features into several groups without primary initialization of cluster numbers using the correlation distance metric (as a function) to truncate the resulting dendrogram into different distances. The dimensionality was reduced from 429 to 67 features (for truncation distance of 0.1). The robustness within the features in clusters was varied from -1.12 to -30.02 for truncation distances of 0.1 to 1.8, respectively, which indicated that the robustness decreases with increasing truncation distance when smaller number of feature classes (i.e., clusters) are selected. The best multivariate CPH survival model had a C-statistic of 0.71 for truncation distance of 0.1, outperforming conventional PCA approaches by 0.04, even when the same number of principal components was considered for feature dimensionality. Conclusions: Correlative hierarchical clustering algorithm truncation distance is directly associated with robustness of the clusters of features selected and can effectively reduce feature dimensionality while improving outcome prediction.
|