Previous studies reported that the cancer subtypes radiologists struggling to detect successfully varied across countries in mammography interpretation. However, little is known whether such variation is also in radiologists’ perception of local cancer-free areas. This study compared the cancer-free areas incorrectly flagged as cancer by radiologists from two populations in reading dense screening mammograms. We collected reading data from 20 Chinese and 16 Australian radiologists who previously evaluated 60 dense screening cases. For each cohort, findings from all readers were pooled together, and the local cancer-free areas classified as cancer were identified. Particularly the areas misclassified by readers from both cohorts were recognized and displayed on the mammograms as overlaps. For each overlap, we counted the error rate, the proportion of readers failing to distinguish between normality and abnormality, as a measure of the actual difficulty level for each reader cohort. Afterward, the Spearman correlation was performed to explore whether the calculated cohort-specific difficulty levels were correlated. A similar analysis was conducted on two geographically-distant groups within China. Results showed that between Chinese and Australian radiologists, the correlation was only found in the cancer-free views of cancer cases (r=0.902, p=0.004). However, between the two groups within China, we found strong correlations in both cancer-containing (r=0.833, p=0.333) and cancer-free views (r=0.955, p=0.022) of cancer cases, despite an insignificant correlation in normal cases. In conclusion, radiologists from different populations display various error-making patterns in reading dense screening mammograms, while those with similar demographic characteristics share the diagnosis to a certain degree.
KEYWORDS: Digital breast tomosynthesis, Mammography, Education and training, Cancer, Breast cancer, Breast imaging, Diagnostics, Breast, Cancer detection, Radiology
The final stage in the medical imaging diagnostic system is the radiologist’s interpretation of the images, though research on the factors influencing performance in digital breast tomosynthesis (DBT) is inconclusive. This study seeks to understand the performance of radiologists in reading DBT images and the parameters impacting observer performance in three different countries. The study used a DBT mammogram test to compare the performance of radiologists from Australia, China and Iran in reading thirty-five DBT cases. A range of performance metrics including specificity, sensitivity, lesion sensitivity, ROC AUC and JAFROC FOM were generated for each radiologist upon the conclusion of the test set. The radiologists also provided demographic information relating to their experience in reading digital mammograms and DBT. Each country had a greater percentage of radiologists that have completed a breast imaging fellowship compared to those that have not. Australia had a greater percentage of radiologists that have completed training in DBT reading (Australia=88.2%), while China and Iran had a smaller percentage of radiologists that have not completed training in DBT reading (China=37%, Iran=40%). Significant differences were identified between the three countries in specificity (p=.001), lesion sensitivity (p=.016), ROC (p<.001) and JAFROC (p<.001). Australia had the highest mean value for all performance metrics, while China had the lowest mean value for all performance metrics. Australian radiologists have a moderate positive correlation between lesion sensitivity and the number of years reading DBT images (r=.513, p=.042). Iranian radiologists who read more than 20 DBT cases per week obtained significantly higher performance in lesion sensitivity 73.3% vs. 51.8%; p=.032) than the ones who read less than 20 DBT cases per week.
KEYWORDS: Digital breast tomosynthesis, Cancer, Cancer detection, Breast cancer, Education and training, Mammography, Breast, Diagnostics, Architectural distortion, Breast density
Introduction: Breast cancer is the most common cancer among women in China and early detection is key to reducing mortality. This study aimed to understand diagnostic performances of Chinese radiologists between FFDM (full-field digital mammography) and DBT (digital breast tomosynthesis) images in terms of lesion features and reader characteristics.
Methods: 32 Chinese radiologists read two mammogram test sets to identify cancer cases and to detect lesions. The first set was of FFDM images (60 cases, 21 cancers) and the second was of DBT images (35 cases, 15 cancers). The accuracy in cancer case detection and lesion detection of radiologists in each test set were analysed. Comparison of diagnostic performances of radiologists with different working experiences were also undertaken. Results were compared using the Wilcoxon Sign Rank and Mann-Whitney U tests.
Results: Chinese radiologists recorded higher diagnostic accuracy with FFDM than DBT for detecting certain lesion types (calcifications, architectural distortion, mixed types) and lesions ≤ 10 mm. There was no significant difference in the accuracy for cancer case detection between FFDM and DBT. Radiologists who had more than eight years working experience, read more than 60 cases per week or had no DBT training had significantly higher lesion accuracy with FFDM than DBT.
Conclusion: Chinese radiologists had higher lesion accuracy with FFDM in certain lesion types and sizes than DBT. This may be related to the lack of appropriate DBT training for radiologists in China.
KEYWORDS: Digital breast tomosynthesis, Breast density, Mammography, Breast, Cancer, Education and training, Diagnostics, Breast cancer, Cancer detection, Radiology
PurposeThis study aims to investigate the diagnostic performances of Australian and Shanghai-based Chinese radiologists in reading full-field digital mammogram (FFDM) and digital breast tomosynthesis (DBT) with different levels of breast density.ApproachEighty-two Australian radiologists interpreted a 60-case FFDM set, and 29 radiologists also reported a 35-case DBT set. Sixty Shanghai radiologists read the same FFDM set, and 32 radiologists read the DBT set. The diagnostic performances of Australian and Shanghai radiologists were assessed using truth data (cancer cases were biopsy proven) and compared overall in specificity, case sensitivity, lesion sensitivity, receiver operating characteristics (ROC) area under the curve, and jack-knife free-response receiver operating characteristics (JAFROC) figure of merit, and they were stratified by case characteristics using the Mann–Whitney U test. The Spearman rank test was used to explore the association between radiologists’ performances and their work experience in mammogram interpretation.ResultsThere were significantly higher performances of Australian radiologists compared with Shanghai radiologists in low breast density for case sensitivity, lesion sensitivity, ROC, and JAFROC in the FFDM set (P < 0.0001); in high breast density, Shanghai radiologists’ performances in lesion sensitivity and JAFROC were also lower than Australian radiologists (P < 0.0001). In the DBT test set, Australian radiologists performed better than Shanghai radiologists in cancer detection in both low and high breast density. The work experience of Australian radiologists was positively linked to their diagnostic performances, whereas this association was not statistically significant in Shanghai radiologists.ConclusionThere were significant variations in reading performances between Australian and Shanghai radiologists in FFDM and DBT across different levels of breast density, lesion types, and lesion sizes. An effective training initiative tailored to suit local readers is essential to enhancing the diagnostic accuracy of Shanghai radiologists.
Objectives: To study the effect on radiology trainees’ observer performance through the availability of prior screening mammograms as part of seven unique education test sets. Methods: Australian radiology trainees (n=150) completed 469 readings of seven educational test sets (each set with 60 cases, 40 normal and 20 cancer cases). The percentage of cases with a prior screening mammogram was 68.7%. Mammographic density (MD) evaluated via BIRADS was spread across the test sets, with 40.5% having 25-50% glandular tissue (BIRADS “B”), 37.4% of cases having 50-75% or “C”, 12.6% have a >75% MD and 9.5% having the lowest MD rating “A”. Trainees were asked to score the cases on a scale of 1 (normal), 2 (benign), 3 (equivocal findings), 4 (suspicious finding) and 5 (highly suggestive malignancy). Mann-Whitney U was used to compare the specificity and sensitivity of radiology trainees among cases with and without prior images. Results: Radiology trainees had significantly higher sensitivity across all MD levels when prior images were not available (A-B, P=0.006; C-D, P=0.027). Specificity was also significantly higher for cases of high (C-D) MD without prior images compared with priors available by trainees who read less than 20 cases per week (P=0.008). Conclusions: In a simulated environment, radiology trainees achieved better results in cases without prior images, especially for those who read less than 20 cases per week. The utility of prior case inclusion when providing education and training in reading screening mammograms needs to be revisited, especially for women with high MD.
Current literature has described the usefulness of the DBT in addition to FFDM because of the increase in cancer detection and decrease in recall rates. The primary limitations of using FFDM plus DBT for screening are the rise in radiation dose, which approximately doubles if both modalities are used. Subsequently, synthesized two-dimensional views can be reconstructed from DBT slices with the ideal to replace FFDM. Although many studies have explored the value of DBT in addition to FFDM, little attention is given to the effectiveness that synthesized views might bring to the radiologists as a supplement view for DBT. The aim of this study is to investigate the diagnostic accuracy of radiology trainees with DBT only compared with DBT plus the synthesized view (C-View). Twenty radiology trainees were asked to report a set of 35 two-projection DBT images of left and right breasts (15 were cancer cases). Another group of 8 trainees read the same DBT set with the addition of the C-View. Participants searched for the presence of lesions within the cases using the Tabar RANZCR system where 2 represented a benign lesion; 3-5 represented the suspicion of a malignancy with a higher value indicating a higher malignant possibility. The readers’ performances were evaluated via specificity, sensitivity, lesion sensitivity, ROC and JAFROC between two reading modes. The results demonstrated diagnostic metrics of participants were not significantly different in reading DBT only compared with the group reading DBT plus synthesized view (P<0.05). This finding implies that viewing DBT only could be equivalent to DBT plus C-View for radiology trainees.
KEYWORDS: Breast, Mammography, Breast cancer, Diagnostics, Cancer, Radiology, Picture Archiving and Communication System, Breast imaging, Tissues, Receivers
Unlike Australia, China has no population-based early detection screening program with radiological expertise being a barrier to implementation. This study explores observer performance between breast radiologists from China and Australia, and the role of peer-assisted reading in Chinese radiologists’ performance. A test set of 60 high density screening mammograms (40 normal, 20 cancer cases) was constructed with eight Chinese and 17 Australian radiologists reading the test set independently, while another ten Chinese radiologists read the test set as a peer-duo, where discussion was encouraged but lesion marking was done separately. For independent readings by radiologists who read >20 cases per week, Chinese readers had lower performance in sensitivity, lesion sensitivity, AUC and JAFROC. There was no significant difference in performance between independent reading and peer-assisted reading Chinese readers and this strategy may have limited valued in improving diagnostic efficacy.
Mammographic test sets are a simulation-based training methodology for radiologists to assess and improve their performance. However, while test-set records have indicated over-time improvements in participants' performance within the tests, little is known about how those improvements translate into breast-screening readers’ performance in the clinic. This study investigated how the performance of readers who completed test-set training in the BreastScreen Reader Assessment Strategy (BREAST) platform have evolved in comparison to readers who have no history of test-set participation. Investigating 10-year clinical audit data of 46 breast screening readers in New South Wales, Australia indicated that BREAST readers improved their positive predictive value (PPV) (p=0.001) in association with their testset participation. They also had higher detection rates for invasive cancers (p=0.01), ductal carcinoma in situ (DCIS) (p=0.03), and the detection rate of all cancers and DCIS (p=0.01). In comparison, non-BREAST readers improved their recall rate in subsequent screens (p=0.03) and PPV (p=0.02). In conclusion, test-set participation is linked to enhanced capability of cancer detection, which can be due to the high proportion of cancer cases in the test sets in comparison to normal practice.
This study investigated whether radiologists from different countries share the same sensitivity to certain mammographic features. Retrospective data were collected from Chinese and Australian radiologists reading a high-density test set which contained 40 normal and 20 cancerous mammographic cases. Sixteen Australian radiologists, and 30 Chinese radiologists, including 18 from Nanchang and 12 from Hong Kong SAR/Shenzhen, were asked to read all images in this test set using the Royal Australian and New Zealand College of Radiologists (RANZCR) rating system and annotate the suspicious lesion(s). For each case and each radiologist group, the percentage of radiologists making the correct diagnoses was calculated. For cancer cases, we also calculated the percentage of radiologists who located the lesion correctly. Spearman correlation coefficient was used to explore the association between two radiologist groups. Data demonstrated a high correlation between Chinese and Australian radiologists in identifying cancer cases (r=0.839, p<0.0001), and locating lesions (r=0.802, p<0.0001), but no statistically significant relationship in identifying normal cases (r=0.236, p=0.142). However, between radiologists from two geographic regions of China, strong correlations were found in detecting cancer cases (r=0.686, p=0.0008), marking lesions (r=0.803, p<0.0001) and recognizing normal cases (r=0.562, p=0.0002). In conclusion, although Chinese and Australian radiologists may share the same difficulty in diagnosing and locating cancers, a difference in the challenge of identifying normal cases between them was shown. However, the performance by radiologists within China, although from different regions, remained consistent when reading high-density mammograms.
This study investigated the possibility of building an end-to-end deep learning-based model for the prediction of a future breast cancer based on prior negative mammograms. We explored whether the probability of abnormal class membership given by the model was correlated with the gist of the abnormal as perceived by radiologists in negative prior mammograms. To build the model, an end-to-end network, previously developed for breast cancer detection, was fine-tuned for breast cancer prediction by using a dataset containing 650 prior mammograms from women, who were diagnosed with breast cancer in a subsequent screening and 1000 cancer-free women. On a set of 630 test images, the model achieved an AUC of 0.73. For extracting gist responses, 17 experienced radiologists were recruited, viewed mammograms for 500 milliseconds and gave a score showing whether they would categorize the case as normal or abnormal on the scale of 0- 100. The image set contained 40 normal, 40 current cancer images along with 72 prior mammograms from women who would eventually develop a breast cancer. We averaged the scores from 17 readers and produced a single score per image. The network achieved an AUC of 0.75 for differentiating prior images from normal images. For 72 prior mammograms, the output of the network was significantly correlated with the strength of the gist of the abnormal as perceived by experienced radiologists (Spearman’s correlation=0.84, p<0.01). This finding suggested that the network successfully learned the representation of the gist of the abnormal in prior mammograms as perceived by experienced radiologists.
Mammographic test sets are a prominent form of quality assurance in breast screening and they have been associated in the lab with positive changes in radiologists’ performance. Focusing on this educational value, we examined the clinical audit history of 19 participants in the BreastScreen Reader Assessment Strategy (BREAST) test sets to investigate if changes in clinical performance reflected test-set participation. Included participants were radiologists who have read for BreastScreen New South Wales (NSW) in the period between 2010 and 2018 and who read on average 2000 cases or more in those years. Their audit data included 2 years before and 2 years after test-set participation. Wilcoxon Signed Ranks tests were used to investigate the difference in recall rates, cancer detection rates, and positive predictive value (PPV) for the cohort before and after testset participation. The data indicated that, over time, radiologists have significantly improved recall rate (screening rounds 2+), PPV, and the detection of ductal carcinoma in situ (DCIS). Those results suggest that breast screen readers who participate with test-set readings improve their clinical performance.
Several previous studies investigate the performance of radiologists in western countries when reading 3D mammographic cases, however the diagnostic efficacy of this modality in China is understudied. This study aimed to improve the understanding of reading performance of 3D mammography among Chinese radiologists and compare their performances with Australian radiologists. One test set consisting of 35 3D mammography cases was used to assess reading performance. Twelve Chinese and twelve Australian radiologists read the test set independently and provide a score of 1-5 to each perceived cancer lesion. Case sensitivity, specificity, lesion sensitivity and Area Under the receiver operating characteristic Curve (AUC) were used to assess performance and radiologists’ characteristics were collected. Performance metrics and characteristics were compared using Mann-Whitney U tests and Fisher’s Exact tests. Higher specificity (0.65 vs 0.38, p=0.0003), lesion sensitivity (0.70 vs 0.40, p=0.0172) and AUC (0.81 vs 0.57, p=0.0001) were found in Australian radiologists compared to their Chinese counterparts. There was no difference between case sensitivity (0.82 vs 0.75, p=0.31). Higher values for number of years reading 3D mammography (p=0.0194) and cases read per week (p=0.0122) and numbers of hours of reading per week of 2D mammography (p=0.0094) were shown among the Australian group. In conclusion, Australian radiologists had higher reading performance when reading a 3D mammography test set compared to Chinese radiologists. Training and education programs of 3D mammography may effectively address this discrepancy.
This study explored whether having a better performance in usual presentation condition, more years of experience, and higher volume of annual mammogram assessment make a radiologist better at perceiving the gist of the abnormal on a mammogram. Nineteen radiologists were recruited for two experiments. In the first one (gist experiment), the initial impressions of the radiologists were collected based on a half-second image presentation on a scale of 0 (confident normal) and 100 (confident abnormal). In the second one, radiologists viewed similar set of cases using BreastScreen Reader Assessment Strategy platform and rated each case on a scale of 1-5. Using Spearman correlation, we explored if the area under receiver operating characteristics curve (AUC) in two experiments were correlated. Radiologists were also grouped based on variables describing their experience levels and workload and their performance in both experiments were compared among the groups. The AUC values in the gist experiment was not significantly correlated to the AUC values in the normal reporting experiment (Spearman correlation=0.183, p-value=0.453). Radiologists’ performances under the normal reporting conditions, was linked to the number of cases per week (p=0.044), number of hours per week currently spent reading mammograms(p=0.028), and number of years they have been reading mammograms (p=0.041). However, none of the variables reached a p-value<0.05 for the AUC of the gist experiment. The results suggest that further studies should be done to establish relationships between the gist response and radiologists’ characteristics since being a high-performing radiologist, highly experienced radiologist, or reading high volume of mammograms does not indicate superior capability when perceiving the gist of the abnormal.
Considering the rapid rise in breast cancer incidence in China and lack of calibrated breast cancer prediction models for the Chinese female population, developing a breast cancer model targeting the Chinese women is necessary. This study aimed at generating a breast cancer risk prediction model for Chinese women. A total of 1079 (85 images contralateral to a cancer and 994 cases without breast cancer) women were recruited from Fudan University Shanghai Cancer Centre. For each case, we collected sixteen demographic variables such as age, BMI, number of children, family history of breast cancer, and age at menarche. Moreover, the dense tissue was automatically segmented by AutoDensity. A set of quantitative features were extracted from the dense area. Using the 80th percentile of intensity values in the dense area, the segmented area was thresholded again and the second set of computer-extracted features was calculated. The features, i.e. the demographic variables, and texture features extracted from the mammographically dense areas of the image, have been fed into an ensemble of 250 decision trees, whose results were combined using RUSBoost. The classifier achieved an AUC of 0.88 (CI: 0.84 - 0.91) for identifying high-risk images. Therefore, adopting such model might lead to the augmentation of discriminatory power of currently-used risk prediction models. However, it should be noted that the cancer cases were retrieved from the diagnostic environment (not screening) and further validation on a dataset from a screening set-up will be required.
This study explored the possibility of using the gist signal (radiologists’ first impression about a case) for improving the performance of two recently developed deep learning-based breast cancer detection tools. We investigated whether by combining the cancer class probability from the networks with the gist signal, higher performance in identifying malignant cases can be achieved. In total, we recruited 53 radiologists, who provided an abnormality score on a scale from 0 to 100 to unilateral mammograms following a 500-millisecond presentation of the image. Twenty cancer cases, 40 benign cases, and 20 normal were included. Two state-ofthe-art deep learning-based tools (M1 and M2) for breast cancer detection were adopted. The abnormality scores from the networks and the gist responses for each observer were fed into a support vector machine (SVM). The SVM was personalized for each radiologist and its performance was evaluated using leave-one-out cross-validation. We also considered the average reader; whose gist responses were the mean abnormality scores given by all 53 readers to each image. The mean and range of AUCs in the gist experiment were 0.643 and 0.492-0.794, respectively. The AUC values for M1 and M2 were 0.789 (0.632-0.892) and 0.814 (0.673-0.897), respectively. For the average reader, the AUC for gist, gist+M1, and gist+M2 were 0.760 (0.617-0.862), 0.847 (0.754-0.928), 0.897 (0.789-0.946). For 45 readers, the performance of at least one of the models improved after aggregating its output with the gist signal. The results showed that the gist signal has the potential to improve the performance of adopted deep learning-based tools.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.