TY - JOUR
T1 - Accuracy and Reliability of Chatbot Responses to Physician Questions
AU - Goodman, Rachel S.
AU - Patrinely, J. Randall
AU - Stone, Cosby A.
AU - Zimmerman, Eli
AU - Donald, Rebecca R.
AU - Chang, Sam S.
AU - Berkowitz, Sean T.
AU - Finn, Avni P.
AU - Jahangir, Eiman
AU - Scoville, Elizabeth A.
AU - Reese, Tyler S.
AU - Friedman, Debra L.
AU - Bastarache, Julie A.
AU - van der Heijden, Yuri F.
AU - Wright, Jordan J.
AU - Ye, Fei
AU - Carter, Nicholas
AU - Alexander, Matthew R.
AU - Choe, Jennifer H.
AU - Chastain, Cody A.
AU - Zic, John A.
AU - Horst, Sara N.
AU - Turker, Isik
AU - Agarwal, Rajiv
AU - Osmundson, Evan
AU - Idrees, Kamran
AU - Kiernan, Colleen M.
AU - Padmanabhan, Chandrasekhar
AU - Bailey, Christina E.
AU - Schlegel, Cameron E.
AU - Chambless, Lola B.
AU - Gibson, Michael K.
AU - Osterman, Travis J.
AU - Wheless, Lee E.
AU - Johnson, Douglas B.
N1 - Publisher Copyright:
© 2023 American Medical Association. All rights reserved.
PY - 2023/10/2
Y1 - 2023/10/2
N2 - IMPORTANCE Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. OBJECTIVE To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence–generated medical information. DESIGN, SETTING, AND PARTICIPANTS Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. MAIN OUTCOMES AND MEASURES Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. RESULTS Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5),
AB - IMPORTANCE Natural language processing tools, such as ChatGPT (generative pretrained transformer, hereafter referred to as chatbot), have the potential to radically enhance the accessibility of medical information for health professionals and patients. Assessing the safety and efficacy of these tools in answering physician-generated questions is critical to determining their suitability in clinical settings, facilitating complex decision-making, and optimizing health care efficiency. OBJECTIVE To assess the accuracy and comprehensiveness of chatbot-generated responses to physician-developed medical queries, highlighting the reliability and limitations of artificial intelligence–generated medical information. DESIGN, SETTING, AND PARTICIPANTS Thirty-three physicians across 17 specialties generated 284 medical questions that they subjectively classified as easy, medium, or hard with either binary (yes or no) or descriptive answers. The physicians then graded the chatbot-generated answers to these questions for accuracy (6-point Likert scale with 1 being completely incorrect and 6 being completely correct) and completeness (3-point Likert scale, with 1 being incomplete and 3 being complete plus additional context). Scores were summarized with descriptive statistics and compared using the Mann-Whitney U test or the Kruskal-Wallis test. The study (including data analysis) was conducted from January to May 2023. MAIN OUTCOMES AND MEASURES Accuracy, completeness, and consistency over time and between 2 different versions (GPT-3.5 and GPT-4) of chatbot-generated medical responses. RESULTS Across all questions (n = 284) generated by 33 physicians (31 faculty members and 2 recent graduates from residency or fellowship programs) across 17 specialties, the median accuracy score was 5.5 (IQR, 4.0-6.0) (between almost completely and complete correct) with a mean (SD) score of 4.8 (1.6) (between mostly and almost completely correct). The median completeness score was 3.0 (IQR, 2.0-3.0) (complete and comprehensive) with a mean (SD) score of 2.5 (0.7). For questions rated easy, medium, and hard, the median accuracy scores were 6.0 (IQR, 5.0-6.0), 5.5 (IQR, 5.0-6.0), and 5.0 (IQR, 4.0-6.0), respectively (mean [SD] scores were 5.0 [1.5], 4.7 [1.7], and 4.6 [1.6], respectively; P = .05). Accuracy scores for binary and descriptive questions were similar (median score, 6.0 [IQR, 4.0-6.0] vs 5.0 [IQR, 3.4-6.0]; mean [SD] score, 4.9 [1.6] vs 4.7 [1.6]; P = .07). Of 36 questions with scores of 1.0 to 2.0, 34 were requeried or regraded 8 to 17 days later with substantial improvement (median score 2.0 [IQR, 1.0-3.0] vs 4.0 [IQR, 2.0-5.3]; P < .01). A subset of questions, regardless of initial scores (version 3.5),
UR - http://www.scopus.com/inward/record.url?scp=85173041456&partnerID=8YFLogxK
U2 - 10.1001/jamanetworkopen.2023.36483
DO - 10.1001/jamanetworkopen.2023.36483
M3 - Article
C2 - 37782499
AN - SCOPUS:85173041456
SN - 2574-3805
SP - E2336483
JO - JAMA Network Open
JF - JAMA Network Open
ER -