TY - JOUR
T1 - Subgroup evaluation to understand performance gaps in deep learning-based classification of regions of interest on mammography
AU - Woo, Min Jae
AU - Zhang, Linglin
AU - Brown-Mulry, Beatrice
AU - Hwang, In Chan
AU - Gichoya, Judy Wawira
AU - Gastounioti, Aimilia
AU - Banerjee, Imon
AU - Seyyed-Kalantari, Laleh
AU - Trivedi, Hari
N1 - Publisher Copyright:
© 2025 Woo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
PY - 2025/4
Y1 - 2025/4
N2 - This study evaluates a deep learning model for classifying normal versus potentially abnormal regions of interest (ROIs) on mammography, aiming to identify imaging, pathologic, and demographic characteristics that may induce suboptimal model performance in certain patient subgroups. We utilized the EMory BrEast imaging Dataset (EMBED), containing 3.4 million mammographic images from 115,931 patients. Full-field digital mammograms from women aged 18 years or older were used to create positive and negative patches with the patches matched based on size, location, patient demographics, and imaging features. Several convolutional neural network (CNN) architectures were tested, with ResNet152V2 demonstrating the best performance. The dataset was split into training (29,144 patches), validation (9,910 patches), and testing (13,390 patches) sets. Performance metrics included accuracy, AUC, recall, precision, F1 score, false negative rate, and false positive rate. Subgroup analysis was conducted using univariate and multivariate regression models to control for confounding effects. The classification model achieved an AUC of 0.975 and a recall of 0.927. False negative predictions were significantly associated with White patients (RR = 1.208; p = 0.050), those never biopsied (RR = 1.079; p = 0.011), and cases with architectural distortion (RR = 1.037; p < 0.001). Higher breast density significantly increased the risk of false positives, with BI-RADS density C (RR = 1.891; p < 0.001) and D (RR = 2.486; p < 0.001). Race and age were not significant predictors for false positives in multivariate analysis. These findings suggest that deep learning models for mammography may underperform in specific subgroups. The study underscores the need for more precise patient subgroup analysis and emphasizes the importance of considering confounding factors in deep learning model evaluations. These insights can help develop fair and interpretable decision-making models in mammography, ultimately enhancing the performance and equity of CADe and CADx applications.
AB - This study evaluates a deep learning model for classifying normal versus potentially abnormal regions of interest (ROIs) on mammography, aiming to identify imaging, pathologic, and demographic characteristics that may induce suboptimal model performance in certain patient subgroups. We utilized the EMory BrEast imaging Dataset (EMBED), containing 3.4 million mammographic images from 115,931 patients. Full-field digital mammograms from women aged 18 years or older were used to create positive and negative patches with the patches matched based on size, location, patient demographics, and imaging features. Several convolutional neural network (CNN) architectures were tested, with ResNet152V2 demonstrating the best performance. The dataset was split into training (29,144 patches), validation (9,910 patches), and testing (13,390 patches) sets. Performance metrics included accuracy, AUC, recall, precision, F1 score, false negative rate, and false positive rate. Subgroup analysis was conducted using univariate and multivariate regression models to control for confounding effects. The classification model achieved an AUC of 0.975 and a recall of 0.927. False negative predictions were significantly associated with White patients (RR = 1.208; p = 0.050), those never biopsied (RR = 1.079; p = 0.011), and cases with architectural distortion (RR = 1.037; p < 0.001). Higher breast density significantly increased the risk of false positives, with BI-RADS density C (RR = 1.891; p < 0.001) and D (RR = 2.486; p < 0.001). Race and age were not significant predictors for false positives in multivariate analysis. These findings suggest that deep learning models for mammography may underperform in specific subgroups. The study underscores the need for more precise patient subgroup analysis and emphasizes the importance of considering confounding factors in deep learning model evaluations. These insights can help develop fair and interpretable decision-making models in mammography, ultimately enhancing the performance and equity of CADe and CADx applications.
UR - https://www.scopus.com/pages/publications/105002255253
U2 - 10.1371/journal.pdig.0000811
DO - 10.1371/journal.pdig.0000811
M3 - Article
C2 - 40198652
AN - SCOPUS:105002255253
SN - 2767-3170
VL - 4
JO - PLOS Digital Health
JF - PLOS Digital Health
IS - 4
M1 - e0000811
ER -