In response to the critical need for timely and precise detection of lung lesions, we explored an innovative active learning approach for optimally selecting training data for deep-learning segmentation of computed tomography scans from nonhuman primates. Our guiding hypothesis was that by maximizing the information within a training set—accomplished by choosing images uniformly distributed in n-dimensional radiomic feature space—we may attain similar or superior segmentation results to random dataset selection, despite the use of fewer labeled images. To test this hypothesis, we compared segmentation models trained on different subsets of the available training data. Subsets that maximized the diversity among datasets (i.e., diverse data) were compared with subsets that minimized diversity among datasets (i.e., concentrated data) and randomly chosen subsets (i.e., random data). A two-tiered feature-selection technique was used to reduce the radiomic feature space to reliable, relevant, and non-redundant features. We generated learning curves to assess the model performance as a function of the number of training dataset samples. We found that models trained on uniformly distributed data consistently outperformed those trained on concentrated data, achieving higher median test Dice scores with less variance. These results suggest that active learning and intelligent selection of data that are diverse and uniformly distributed within a radiomic feature space can significantly enhance segmentation model performance. This improvement has substantial implications for optimizing lung lesion characterization, disease management, and evaluation of treatments and underscores the potential benefit of active learning and intelligent data selection in medical imaging segmentation tasks.
|