TY - JOUR
T1 - Comparison of the Usability and Reliability of Answers to Clinical Questions
T2 - AI-Generated ChatGPT versus a Human-Authored Resource
AU - Manian, Farrin A.
AU - Garland, Katherine
AU - Ding, Jimin
N1 - Publisher Copyright:
Copyright © 2024 The Author(s). Published by Wolters Kluwer Health, Inc.
PY - 2024/8/1
Y1 - 2024/8/1
N2 - Objectives Our aim was to compare the usability and reliability of answers to clinical questions posed of Chat-Generative Pre-Trained Transformer (ChatGPT) compared to those of a human-authored Web source (www.Pearls4Peers.com) in response to "real-world"clinical questions raised during the care of patients. Methods Two domains of clinical information quality were studied: usability, based on organization/readability, relevance, and usefulness, and reliability, based on clarity, accuracy, and thoroughness. The top 36 most viewed real-world questions from a human-authored Web site (www.Pearls4Peers.com [P4P]) were posed to ChatGPT 3.5. Anonymized answers by ChatGPT and P4P (without literature citations) were separately assessed for usability by 18 practicing physicians ("clinician users") in triplicate and for reliability by 21 expert providers ("content experts") on a Likert scale ("definitely yes,""generally yes,"or "no") in duplicate or triplicate. Participants also directly compared the usability and reliability of paired answers. Results The usability and reliability of ChatGPT answers varied widely depending on the question posed. ChatGPT answers were not considered useful or accurate in 13.9% and 13.1% of cases, respectively. In within-individual rankings for usability, ChatGPT was inferior to P4P in organization/readability, relevance, and usefulness in 29.6%, 28.3%, and 29.6% of cases, respectively, and for reliability, inferior to P4P in clarity, accuracy, and thoroughness in 38.1%, 34.5%, and 31% of cases, respectively. Conclusions The quality of ChatGPT responses to real-world clinical questions varied widely, with nearly one-third or more answers considered inferior to a human-authored source in several aspects of usability and reliability. Caution is advised when using ChatGPT in clinical decision making.
AB - Objectives Our aim was to compare the usability and reliability of answers to clinical questions posed of Chat-Generative Pre-Trained Transformer (ChatGPT) compared to those of a human-authored Web source (www.Pearls4Peers.com) in response to "real-world"clinical questions raised during the care of patients. Methods Two domains of clinical information quality were studied: usability, based on organization/readability, relevance, and usefulness, and reliability, based on clarity, accuracy, and thoroughness. The top 36 most viewed real-world questions from a human-authored Web site (www.Pearls4Peers.com [P4P]) were posed to ChatGPT 3.5. Anonymized answers by ChatGPT and P4P (without literature citations) were separately assessed for usability by 18 practicing physicians ("clinician users") in triplicate and for reliability by 21 expert providers ("content experts") on a Likert scale ("definitely yes,""generally yes,"or "no") in duplicate or triplicate. Participants also directly compared the usability and reliability of paired answers. Results The usability and reliability of ChatGPT answers varied widely depending on the question posed. ChatGPT answers were not considered useful or accurate in 13.9% and 13.1% of cases, respectively. In within-individual rankings for usability, ChatGPT was inferior to P4P in organization/readability, relevance, and usefulness in 29.6%, 28.3%, and 29.6% of cases, respectively, and for reliability, inferior to P4P in clarity, accuracy, and thoroughness in 38.1%, 34.5%, and 31% of cases, respectively. Conclusions The quality of ChatGPT responses to real-world clinical questions varied widely, with nearly one-third or more answers considered inferior to a human-authored source in several aspects of usability and reliability. Caution is advised when using ChatGPT in clinical decision making.
KW - ChatGPT
KW - Pearls4Peers
KW - artificial intelligence
KW - reliability
KW - usability
UR - https://www.scopus.com/pages/publications/85200531644
U2 - 10.14423/SMJ.0000000000001715
DO - 10.14423/SMJ.0000000000001715
M3 - Article
C2 - 39094795
AN - SCOPUS:85200531644
SN - 0038-4348
VL - 117
SP - 467
EP - 473
JO - Southern medical journal
JF - Southern medical journal
IS - 8
ER -