Evaluating medical imaging algorithm performance in a test set may lead to a biased result, especially if the number of images is low. In the case of the Medical Imaging Data and Resource Center (midrc.org) sequestered imaging data commons, developers may seek the evaluation of subsequent iterations of an algorithm using additional test subsets drawn from the sequestered commons, allowing for repeat testing but also possibly resulting in learning the sequestered commons when test samples overlap. We developed a method to measure image reuse in test subsets and to evaluate the impact of degree of image reuse on over- or under-estimation of performance by using the load factor, a metric from hash-table methodology that can be used to summarize the average test subset pairings per image. We established a relationship between the standard error of the area under the receiver operating curve (AUC) and load factor, and compared the relationship to interquartile range of AUC for the case of an image-derived predictor for COVID-19 severity on chest radiographs. As expected, AUC variation was inversely related to load factor while image usage increased with load factor, with similar performances between both predicted and actual AUC variation and load factor. Notably, low AUC variation was observed in load factors well above 1, the load factor typically described in the hash-table literature as optimal. These results translate the use of load factor for characterization of stand-alone test sets, supporting future work for operationalizing the use of sequestered test subsets for algorithm evaluation.
|