TY - JOUR
T1 - Large-Scale Evaluation of Machine Learning Models in Identifying Follow-Up Recommendations in Radiology Reports
AU - Xiao, Pan
AU - Yu, Xiaobing
AU - Ha, Sung Min
AU - Bani, Abdalla
AU - Mintz, Aaron
AU - JieqiWang,
AU - Elbanan, Mohamed
AU - Mokkarala, Mahati
AU - Mattay, Govind
AU - Nazeri, Arash
AU - Kannampallil, Thomas
AU - Lai, Albert M.
AU - Narra, Vamsi R.
AU - Marcus, Daniel S.
AU - Bierhals, Andrew J.
AU - Sotiras, Aristeidis
N1 - Publisher Copyright:
© (2025), (Radiological Society of North America Inc.). All rights reserved.
PY - 2025/11/11
Y1 - 2025/11/11
N2 - Background: Radiology reports often contain follow-up recommendations vital for optimal patient care, prevention of complications, and mitigation of legal risk. However, there is a lack of comprehensive comparison methods for identifying these recommendations across a large volume of reports from various modalities, including open-source large language models. Purpose: To evaluate the performance of machine learning (ML) models, including Meta’s open-source LLAMA3 and OpenAI’s Health Insurance Portability and Accountability Act–compliant Generative Pre-trained Transformer, in identifying follow-up recommendations in radiology reports. Materials and Methods: In this retrospective study, three sets of radiology reports were analyzed across multiple imaging modalities from a large urban academic medical center: an expert annotated dataset (n = 11 901) from January 1 to January 10, 2015; a dataset (n = 32 959) extracted through regular expressions (ie, sequences of characters that define search patterns in text) from January 11, 2015, to January 1, 2017; and a dataset (n = 4909) annotated during dictation from September 8, 2018, to February 23, 2021. To assess generalization on impressions, two expertly annotated datasets were used: 2000 chest radiography reports from the publicly available MIMIC-CXR database for external testing and 100 institutional CT reports from January 1 to January 15, 2024, for temporal testing. Thirty-two text classification methods were evaluated separately based on the findings and impression sections of these reports. Performance metrics included precision, recall, accuracy, and F1 score; with 95% bootstrapped CIs and areas under the precision-recall curve. Statistical comparisons were performed by using the McNemar test. Results: The study included 49 769 reports from 35 509 patients (mean age, 52.2 years ± 22.0 [SD]; 18 477 female patients) for training (n = 37 140), validation (n = 2584), and internal testing (n = 10 045). For the findings section, a generative-discriminative model initialized with Google’s Word2vec embeddings (Hybrid-google) achieved the highest F1 score (0.835; 95% CI: 0.825, 0.845). For the impression section, an attention-based bidirectional long short-term memory (LSTM) with random initialization (AttBiLSTM-random) performed best, with an F1 score of 0.979 (95% CI: 0.976, 0.982). Prefixed prompting with GPT-4 demonstrated superior external and temporal generalization performance on the MIMIC-CXR and institutional CT datasets, achieving F1 scores of 0.969 (95% CI: 0.961, 0.977) and 0.973 (95% CI: 0.937, 1.000), respectively. Conclusion: ML models showed promise for automating the classification of follow-up recommendations in radiology reports.
AB - Background: Radiology reports often contain follow-up recommendations vital for optimal patient care, prevention of complications, and mitigation of legal risk. However, there is a lack of comprehensive comparison methods for identifying these recommendations across a large volume of reports from various modalities, including open-source large language models. Purpose: To evaluate the performance of machine learning (ML) models, including Meta’s open-source LLAMA3 and OpenAI’s Health Insurance Portability and Accountability Act–compliant Generative Pre-trained Transformer, in identifying follow-up recommendations in radiology reports. Materials and Methods: In this retrospective study, three sets of radiology reports were analyzed across multiple imaging modalities from a large urban academic medical center: an expert annotated dataset (n = 11 901) from January 1 to January 10, 2015; a dataset (n = 32 959) extracted through regular expressions (ie, sequences of characters that define search patterns in text) from January 11, 2015, to January 1, 2017; and a dataset (n = 4909) annotated during dictation from September 8, 2018, to February 23, 2021. To assess generalization on impressions, two expertly annotated datasets were used: 2000 chest radiography reports from the publicly available MIMIC-CXR database for external testing and 100 institutional CT reports from January 1 to January 15, 2024, for temporal testing. Thirty-two text classification methods were evaluated separately based on the findings and impression sections of these reports. Performance metrics included precision, recall, accuracy, and F1 score; with 95% bootstrapped CIs and areas under the precision-recall curve. Statistical comparisons were performed by using the McNemar test. Results: The study included 49 769 reports from 35 509 patients (mean age, 52.2 years ± 22.0 [SD]; 18 477 female patients) for training (n = 37 140), validation (n = 2584), and internal testing (n = 10 045). For the findings section, a generative-discriminative model initialized with Google’s Word2vec embeddings (Hybrid-google) achieved the highest F1 score (0.835; 95% CI: 0.825, 0.845). For the impression section, an attention-based bidirectional long short-term memory (LSTM) with random initialization (AttBiLSTM-random) performed best, with an F1 score of 0.979 (95% CI: 0.976, 0.982). Prefixed prompting with GPT-4 demonstrated superior external and temporal generalization performance on the MIMIC-CXR and institutional CT datasets, achieving F1 scores of 0.969 (95% CI: 0.961, 0.977) and 0.973 (95% CI: 0.937, 1.000), respectively. Conclusion: ML models showed promise for automating the classification of follow-up recommendations in radiology reports.
UR - https://www.scopus.com/pages/publications/105021461305
U2 - 10.1148/radiol.242167
DO - 10.1148/radiol.242167
M3 - Article
C2 - 41217283
AN - SCOPUS:105021461305
SN - 0033-8419
VL - 317
JO - Radiology
JF - Radiology
IS - 2
M1 - e242167
ER -