TY - JOUR
T1 - Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning
AU - Tordjman, Mickael
AU - Liu, Zelong
AU - Yuce, Murat
AU - Fauveau, Valentin
AU - Mei, Yunhao
AU - Hadjadj, Jerome
AU - Bolger, Ian
AU - Almansour, Haidara
AU - Horst, Carolyn
AU - Parihar, Ashwin Singh
AU - Geahchan, Amine
AU - Meribout, Anis
AU - Yatim, Nader
AU - Ng, Nicole
AU - Robson, Phillip
AU - Zhou, Alexander
AU - Lewis, Sara
AU - Huang, Mingqian
AU - Deyer, Timothy
AU - Taouli, Bachir
AU - Lee, Hao Chih
AU - Fayad, Zahi A.
AU - Mei, Xueyan
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature America, Inc. 2025.
PY - 2025/8
Y1 - 2025/8
N2 - DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.
AB - DeepSeek is a newly introduced large language model (LLM) designed for enhanced reasoning, but its medical-domain capabilities have not yet been evaluated. Here we assessed the capabilities of three LLMs— DeepSeek-R1, ChatGPT-o1 and Llama 3.1-405B—in performing four different medical tasks: answering questions from the United States Medical Licensing Examination (USMLE), interpreting and reasoning on the basis of text-based diagnostic and management cases, providing tumor classification according to RECIST 1.1 criteria and providing summaries of diagnostic imaging reports across multiple modalities. In the USMLE test, the performance of DeepSeek-R1 (accuracy 0.92) was slightly inferior to that of ChatGPT-o1 (accuracy 0.95; P = 0.04) but better than that of Llama 3.1-405B (accuracy 0.83; P < 10−3). For text-based case challenges, DeepSeek-R1 performed similarly to ChatGPT-o1 (accuracy of 0.57 versus 0.55; P = 0.76 and 0.74 versus 0.76; P = 0.06, using New England Journal of Medicine and Médicilline databases, respectively). For RECIST classifications, DeepSeek-R1 also performed similarly to ChatGPT-o1 (0.74 versus 0.81; P = 0.10). Diagnostic reasoning steps provided by DeepSeek were deemed more accurate than those provided by ChatGPT and Llama 3.1-405B (average Likert score of 3.61, 3.22 and 3.13, respectively, P = 0.005 and P < 10−3). However, summarized imaging reports provided by DeepSeek-R1 exhibited lower global quality than those provided by ChatGPT-o1 (5-point Likert score: 4.5 versus 4.8; P < 10−3). This study highlights the potential of DeepSeek-R1 LLM for medical applications but also underlines areas needing improvements.
UR - https://www.scopus.com/pages/publications/105004456262
U2 - 10.1038/s41591-025-03726-3
DO - 10.1038/s41591-025-03726-3
M3 - Article
C2 - 40267969
AN - SCOPUS:105004456262
SN - 1078-8956
VL - 31
SP - 2550
EP - 2555
JO - Nature medicine
JF - Nature medicine
IS - 8
ER -