Abstract

Background: Radiology reports often contain follow-up recommendations vital for optimal patient care, prevention of complications, and mitigation of legal risk. However, there is a lack of comprehensive comparison methods for identifying these recommendations across a large volume of reports from various modalities, including open-source large language models. Purpose: To evaluate the performance of machine learning (ML) models, including Meta’s open-source LLAMA3 and OpenAI’s Health Insurance Portability and Accountability Act–compliant Generative Pre-trained Transformer, in identifying follow-up recommendations in radiology reports. Materials and Methods: In this retrospective study, three sets of radiology reports were analyzed across multiple imaging modalities from a large urban academic medical center: an expert annotated dataset (n = 11 901) from January 1 to January 10, 2015; a dataset (n = 32 959) extracted through regular expressions (ie, sequences of characters that define search patterns in text) from January 11, 2015, to January 1, 2017; and a dataset (n = 4909) annotated during dictation from September 8, 2018, to February 23, 2021. To assess generalization on impressions, two expertly annotated datasets were used: 2000 chest radiography reports from the publicly available MIMIC-CXR database for external testing and 100 institutional CT reports from January 1 to January 15, 2024, for temporal testing. Thirty-two text classification methods were evaluated separately based on the findings and impression sections of these reports. Performance metrics included precision, recall, accuracy, and F1 score; with 95% bootstrapped CIs and areas under the precision-recall curve. Statistical comparisons were performed by using the McNemar test. Results: The study included 49 769 reports from 35 509 patients (mean age, 52.2 years ± 22.0 [SD]; 18 477 female patients) for training (n = 37 140), validation (n = 2584), and internal testing (n = 10 045). For the findings section, a generative-discriminative model initialized with Google’s Word2vec embeddings (Hybrid-google) achieved the highest F1 score (0.835; 95% CI: 0.825, 0.845). For the impression section, an attention-based bidirectional long short-term memory (LSTM) with random initialization (AttBiLSTM-random) performed best, with an F1 score of 0.979 (95% CI: 0.976, 0.982). Prefixed prompting with GPT-4 demonstrated superior external and temporal generalization performance on the MIMIC-CXR and institutional CT datasets, achieving F1 scores of 0.969 (95% CI: 0.961, 0.977) and 0.973 (95% CI: 0.937, 1.000), respectively. Conclusion: ML models showed promise for automating the classification of follow-up recommendations in radiology reports.

Original languageEnglish
Article numbere242167
JournalRadiology
Volume317
Issue number2
DOIs
StatePublished - Nov 11 2025

Fingerprint

Dive into the research topics of 'Large-Scale Evaluation of Machine Learning Models in Identifying Follow-Up Recommendations in Radiology Reports'. Together they form a unique fingerprint.

Cite this