Subgroup evaluation to understand performance gaps in deep learning-based classification of regions of interest on mammography

  • Min Jae Woo
  • , Linglin Zhang
  • , Beatrice Brown-Mulry
  • , In Chan Hwang
  • , Judy Wawira Gichoya
  • , Aimilia Gastounioti
  • , Imon Banerjee
  • , Laleh Seyyed-Kalantari
  • , Hari Trivedi

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

This study evaluates a deep learning model for classifying normal versus potentially abnormal regions of interest (ROIs) on mammography, aiming to identify imaging, pathologic, and demographic characteristics that may induce suboptimal model performance in certain patient subgroups. We utilized the EMory BrEast imaging Dataset (EMBED), containing 3.4 million mammographic images from 115,931 patients. Full-field digital mammograms from women aged 18 years or older were used to create positive and negative patches with the patches matched based on size, location, patient demographics, and imaging features. Several convolutional neural network (CNN) architectures were tested, with ResNet152V2 demonstrating the best performance. The dataset was split into training (29,144 patches), validation (9,910 patches), and testing (13,390 patches) sets. Performance metrics included accuracy, AUC, recall, precision, F1 score, false negative rate, and false positive rate. Subgroup analysis was conducted using univariate and multivariate regression models to control for confounding effects. The classification model achieved an AUC of 0.975 and a recall of 0.927. False negative predictions were significantly associated with White patients (RR = 1.208; p = 0.050), those never biopsied (RR = 1.079; p = 0.011), and cases with architectural distortion (RR = 1.037; p < 0.001). Higher breast density significantly increased the risk of false positives, with BI-RADS density C (RR = 1.891; p < 0.001) and D (RR = 2.486; p < 0.001). Race and age were not significant predictors for false positives in multivariate analysis. These findings suggest that deep learning models for mammography may underperform in specific subgroups. The study underscores the need for more precise patient subgroup analysis and emphasizes the importance of considering confounding factors in deep learning model evaluations. These insights can help develop fair and interpretable decision-making models in mammography, ultimately enhancing the performance and equity of CADe and CADx applications.

Original languageEnglish
Article numbere0000811
JournalPLOS Digital Health
Volume4
Issue number4
DOIs
StatePublished - Apr 2025

Fingerprint

Dive into the research topics of 'Subgroup evaluation to understand performance gaps in deep learning-based classification of regions of interest on mammography'. Together they form a unique fingerprint.

Cite this