Natural Language Processing-Assisted Classification Models to Confirm Monoclonal Gammopathy of Undetermined Significance and Progression in Veterans' Electronic Health Records

Mei Wang, Yao Chi Yu, Lawrence Liu, Martin W. Schoen, Akhil Kumar, Kristin Vargo, Graham Colditz, Theodore Thomas, Su Hsin Chang

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

PURPOSE: To develop and validate natural language processing (NLP)-assisted machine learning (ML)-based classification models to confirm diagnoses of monoclonal gammopathy of undetermined significance (MGUS) and multiple myeloma (MM) from electronic health records (EHRs) in the Veterans Health Administration (VHA). MATERIALS AND METHODS: We developed precompiled lexicons and classification rules as features for the following ML classifiers: logistic regression, random forest, and support vector machines (SVMs). These features were trained on 36,044 EHR documents from a random sample of 400 patients with at least one International Classification of Disease code for MGUS diagnosis from 1999 to 2021. The best-performing feature combination was calibrated in the validation set (17,826 documents/200 patients) and evaluated in the testing set (9,250 documents/100 patients). Model performance in diagnosis confirmation was compared with manual chart review results (gold standard) using recall, precision, accuracy, and F1 score. For patients correctly labeled as disease-positive, the difference between model-identified diagnosis dates and the gold standard was also computed. RESULTS: In the testing set, the NLP-assisted classification model using SVMs achieved best performance in both MGUS and MM confirmation with recall/precision/accuracy/F1 of 98.8%/93.3%/93.0%/96.0% for MGUS and 100.0%/92.3%/99.0%/96.0% for MM. Dates of diagnoses matched (±45 days) with those of gold standard in 73.0% of model-confirmed MGUS and 84.6% of model-confirmed MM. CONCLUSION: An NLP-assisted classification model can reliably confirm MGUS and MM diagnoses and dates and extract laboratory results using automated interpretation of EHR data. This algorithm has the potential to be adapted to other disease areas in VHA EHR system.

Original languageEnglish
Pages (from-to)e2300081
JournalJCO Clinical Cancer Informatics
Volume7
DOIs
StatePublished - Sep 1 2023

Fingerprint

Dive into the research topics of 'Natural Language Processing-Assisted Classification Models to Confirm Monoclonal Gammopathy of Undetermined Significance and Progression in Veterans' Electronic Health Records'. Together they form a unique fingerprint.

Cite this