TY - JOUR
T1 - Solving Complex Pediatric Surgical Case Studies
T2 - A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance
AU - Gnatzy, Richard
AU - Lacher, Martin
AU - Berger, Michael
AU - Boettcher, Michael
AU - Deffaa, Oliver J.
AU - Kübler, Joachim
AU - Madadi-Sanjani, Omid
AU - Martynov, Illya
AU - Mayer, Steffi
AU - Pakarinen, Mikko P.
AU - Wagner, Richard
AU - Wester, Tomas
AU - Zani, Augusto
AU - Aubert, Ophelia
N1 - Publisher Copyright:
© 2025. Thieme. All rights reserved.
PY - 2025/4/2
Y1 - 2025/4/2
N2 - Introduction The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases. Methods We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses. Results ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons. Conclusion This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.
AB - Introduction The emergence of large language models (LLMs) has led to notable advancements across multiple sectors, including medicine. Yet, their effect in pediatric surgery remains largely unexplored. This study aims to assess the ability of the artificial intelligence (AI) models ChatGPT-4 and Microsoft Copilot to propose diagnostic procedures, primary and differential diagnoses, as well as answer clinical questions using complex clinical case vignettes of classic pediatric surgical diseases. Methods We conducted the study in April 2024. We evaluated the performance of LLMs using 13 complex clinical case vignettes of pediatric surgical diseases and compared responses to a human cohort of experienced pediatric surgeons. Additionally, pediatric surgeons rated the diagnostic recommendations of LLMs for completeness and accuracy. To determine differences in performance, we performed statistical analyses. Results ChatGPT-4 achieved a higher test score (52.1%) compared to Copilot (47.9%) but less than pediatric surgeons (68.8%). Overall differences in performance between ChatGPT-4, Copilot, and pediatric surgeons were found to be statistically significant (p < 0.01). ChatGPT-4 demonstrated superior performance in generating differential diagnoses compared to Copilot (p < 0.05). No statistically significant differences were found between the AI models regarding suggestions for diagnostics and primary diagnosis. Overall, the recommendations of LLMs were rated as average by pediatric surgeons. Conclusion This study reveals significant limitations in the performance of AI models in pediatric surgery. Although LLMs exhibit potential across various areas, their reliability and accuracy in handling clinical decision-making tasks is limited. Further research is needed to improve AI capabilities and establish its usefulness in the clinical setting.
KW - artificial intelligence
KW - case studies
KW - large language models
KW - natural language processing
KW - pediatric surgery
UR - https://www.scopus.com/pages/publications/105002181118
U2 - 10.1055/a-2551-2131
DO - 10.1055/a-2551-2131
M3 - Article
C2 - 40043742
AN - SCOPUS:105002181118
SN - 0939-7248
VL - 35
SP - 382
EP - 389
JO - European Journal of Pediatric Surgery
JF - European Journal of Pediatric Surgery
IS - 5
ER -