TY - JOUR
T1 - GPT-4 Underperforms Experts in Detecting IV Fluid Contamination
AU - Spies, Nicholas C.
AU - Hubler, Zita
AU - Roper, Stephen M.
AU - Omosule, Catherine L.
AU - Senter-Zapata, Michael
AU - Roemmich, Brittany L.
AU - Brown, Hannah Marie
AU - Gimple, Ryan
AU - Farnsworth, Christopher W.
N1 - Publisher Copyright:
© 2023 Association for Diagnostics & Laboratory Medicine. All rights reserved.
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Background: Specimens contaminated with intravenous (IV) fluids are common in clinical laboratories. Current methods for detecting contamination rely on insensitive and workflow-disrupting delta checks or manual technologist review. Herein, we assessed the utility of large language models for detecting contamination by IV crystalloids and compared its performance to multiple, but variably trained healthcare personnel (HCP). Methods: Contamination of basic metabolic panels was simulated using 0.9% normal saline (NS), with (n = 30) and without (n = 30) 5% dextrose (D5NS), at mixture ratios of 0.10 and 0.25. A multimodal language model (GPT-4) and a diverse panel of 8 HCP were asked to adjudicate between real and contaminated results. Classification performance, mixture quantification, and confidence was compared by Wilcoxon rank sum. Results: The 95% CIs for accuracy were 0.57-0.71 vs 0.73-0.80 for GPT-4 and HCP, respectively, on the NS set and 0.57-0.57 vs 0.73-0.80 on the D5NS set. HCP overestimated severity of contamination in the 0.10 mixture group (95% CI of estimate error, 0.05-0.20) for both fluids, while GPT-4 markedly overestimated the D5NS mixture at both ratios (0.16-0.33 for NS, 0.11-0.35 for D5NS). There was no correlation between reported confidence and likelihood of a correct classification. Conclusions: GPT-4 is less accurate than trained HCP for detecting IV fluid contamination of basic metabolic panel results. However, trained individuals were imperfect at identifying contaminated specimens implying the need for novel, automated tools for its detection.
AB - Background: Specimens contaminated with intravenous (IV) fluids are common in clinical laboratories. Current methods for detecting contamination rely on insensitive and workflow-disrupting delta checks or manual technologist review. Herein, we assessed the utility of large language models for detecting contamination by IV crystalloids and compared its performance to multiple, but variably trained healthcare personnel (HCP). Methods: Contamination of basic metabolic panels was simulated using 0.9% normal saline (NS), with (n = 30) and without (n = 30) 5% dextrose (D5NS), at mixture ratios of 0.10 and 0.25. A multimodal language model (GPT-4) and a diverse panel of 8 HCP were asked to adjudicate between real and contaminated results. Classification performance, mixture quantification, and confidence was compared by Wilcoxon rank sum. Results: The 95% CIs for accuracy were 0.57-0.71 vs 0.73-0.80 for GPT-4 and HCP, respectively, on the NS set and 0.57-0.57 vs 0.73-0.80 on the D5NS set. HCP overestimated severity of contamination in the 0.10 mixture group (95% CI of estimate error, 0.05-0.20) for both fluids, while GPT-4 markedly overestimated the D5NS mixture at both ratios (0.16-0.33 for NS, 0.11-0.35 for D5NS). There was no correlation between reported confidence and likelihood of a correct classification. Conclusions: GPT-4 is less accurate than trained HCP for detecting IV fluid contamination of basic metabolic panel results. However, trained individuals were imperfect at identifying contaminated specimens implying the need for novel, automated tools for its detection.
UR - http://www.scopus.com/inward/record.url?scp=85176495723&partnerID=8YFLogxK
U2 - 10.1093/jalm/jfad058
DO - 10.1093/jalm/jfad058
M3 - Article
C2 - 37702018
AN - SCOPUS:85176495723
SN - 2576-9456
VL - 8
SP - 1092
EP - 1100
JO - The journal of applied laboratory medicine
JF - The journal of applied laboratory medicine
IS - 6
ER -