TY - JOUR
T1 - Natural Language Processing-Assisted Classification Models to Confirm Monoclonal Gammopathy of Undetermined Significance and Progression in Veterans' Electronic Health Records
AU - Wang, Mei
AU - Yu, Yao Chi
AU - Liu, Lawrence
AU - Schoen, Martin W.
AU - Kumar, Akhil
AU - Vargo, Kristin
AU - Colditz, Graham
AU - Thomas, Theodore
AU - Chang, Su Hsin
PY - 2023/9/1
Y1 - 2023/9/1
N2 - PURPOSE: To develop and validate natural language processing (NLP)-assisted machine learning (ML)-based classification models to confirm diagnoses of monoclonal gammopathy of undetermined significance (MGUS) and multiple myeloma (MM) from electronic health records (EHRs) in the Veterans Health Administration (VHA). MATERIALS AND METHODS: We developed precompiled lexicons and classification rules as features for the following ML classifiers: logistic regression, random forest, and support vector machines (SVMs). These features were trained on 36,044 EHR documents from a random sample of 400 patients with at least one International Classification of Disease code for MGUS diagnosis from 1999 to 2021. The best-performing feature combination was calibrated in the validation set (17,826 documents/200 patients) and evaluated in the testing set (9,250 documents/100 patients). Model performance in diagnosis confirmation was compared with manual chart review results (gold standard) using recall, precision, accuracy, and F1 score. For patients correctly labeled as disease-positive, the difference between model-identified diagnosis dates and the gold standard was also computed. RESULTS: In the testing set, the NLP-assisted classification model using SVMs achieved best performance in both MGUS and MM confirmation with recall/precision/accuracy/F1 of 98.8%/93.3%/93.0%/96.0% for MGUS and 100.0%/92.3%/99.0%/96.0% for MM. Dates of diagnoses matched (±45 days) with those of gold standard in 73.0% of model-confirmed MGUS and 84.6% of model-confirmed MM. CONCLUSION: An NLP-assisted classification model can reliably confirm MGUS and MM diagnoses and dates and extract laboratory results using automated interpretation of EHR data. This algorithm has the potential to be adapted to other disease areas in VHA EHR system.
AB - PURPOSE: To develop and validate natural language processing (NLP)-assisted machine learning (ML)-based classification models to confirm diagnoses of monoclonal gammopathy of undetermined significance (MGUS) and multiple myeloma (MM) from electronic health records (EHRs) in the Veterans Health Administration (VHA). MATERIALS AND METHODS: We developed precompiled lexicons and classification rules as features for the following ML classifiers: logistic regression, random forest, and support vector machines (SVMs). These features were trained on 36,044 EHR documents from a random sample of 400 patients with at least one International Classification of Disease code for MGUS diagnosis from 1999 to 2021. The best-performing feature combination was calibrated in the validation set (17,826 documents/200 patients) and evaluated in the testing set (9,250 documents/100 patients). Model performance in diagnosis confirmation was compared with manual chart review results (gold standard) using recall, precision, accuracy, and F1 score. For patients correctly labeled as disease-positive, the difference between model-identified diagnosis dates and the gold standard was also computed. RESULTS: In the testing set, the NLP-assisted classification model using SVMs achieved best performance in both MGUS and MM confirmation with recall/precision/accuracy/F1 of 98.8%/93.3%/93.0%/96.0% for MGUS and 100.0%/92.3%/99.0%/96.0% for MM. Dates of diagnoses matched (±45 days) with those of gold standard in 73.0% of model-confirmed MGUS and 84.6% of model-confirmed MM. CONCLUSION: An NLP-assisted classification model can reliably confirm MGUS and MM diagnoses and dates and extract laboratory results using automated interpretation of EHR data. This algorithm has the potential to be adapted to other disease areas in VHA EHR system.
UR - http://www.scopus.com/inward/record.url?scp=85178616843&partnerID=8YFLogxK
U2 - 10.1200/CCI.23.00081
DO - 10.1200/CCI.23.00081
M3 - Article
C2 - 38048516
AN - SCOPUS:85178616843
SN - 2473-4276
VL - 7
SP - e2300081
JO - JCO Clinical Cancer Informatics
JF - JCO Clinical Cancer Informatics
ER -