TY - JOUR
T1 - A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection
T2 - Algorithm Development and Validation
AU - National COVID Cohort Collaborative
AU - Wen, Andrew
AU - Wang, Liwei
AU - He, Huan
AU - Fu, Sunyang
AU - Liu, Sijia
AU - Hanauer, David A.
AU - Harris, Daniel R.
AU - Kavuluru, Ramakanth
AU - Zhang, Rui
AU - Natarajan, Karthik
AU - Pavinkurve, Nishanth P.
AU - Hajagos, Janos
AU - Rajupet, Sritha
AU - Lingam, Veena
AU - Saltz, Mary
AU - Elowsky, Corey
AU - Moffitt, Richard A.
AU - Koraishy, Farrukh M.
AU - Palchuk, Matvey B.
AU - Donovan, Jordan
AU - Lingrey, Lora
AU - Stone-Derhagopian, Garo
AU - Miller, Robert T.
AU - Williams, Andrew E.
AU - Leese, Peter J.
AU - Kovach, Paul I.
AU - Pfaff, Emily R.
AU - Zemmel, Mikhail
AU - Pates, Robert D.
AU - Guthe, Nick
AU - Haendel, Melissa A.
AU - Chute, Christopher G.
AU - Liu, Hongfang
AU - Wilcox, Adam b.
AU - Lee, Adam M.
AU - Graves, Alexis
AU - Anzalone, Alfred (jerrod)
AU - Manna, Amin
AU - Saha, Amit
AU - Olex, Amy
AU - Zhou, Andrea
AU - Southerland, Andrew
AU - Girvin, Andrew T.
AU - Walden, Anita
AU - Sharathkumar, Anjali A.
AU - Amor, Benjamin
AU - Bates, Benjamin
AU - Hendricks, Brian
AU - Patel, Brijesh
AU - Alexander, Caleb
AU - Bramante, Carolyn
AU - Ward-Caviness, Cavin
AU - Madlock-Brown, Charisse
AU - Suver, Christine
AU - Dillon, Christopher
AU - Wu, Chunlei
AU - Schmitt, Clare
AU - Takemoto, Cliff
AU - Housman, Dan
AU - Gabriel, Davera
AU - Eichmann, David A.
AU - Mazzotti, Diego
AU - Brown, Don
AU - Boudreau, Eilis
AU - Hill, Elaine
AU - Zampino, Elizabeth
AU - Marti, Emily Carlson
AU - French, Evan
AU - Mariona, Federico
AU - Prior, Fred
AU - Sokos, George
AU - Martin, Greg
AU - Lehmann, Harold
AU - Spratt, Heidi
AU - Mehta, Hemalkumar
AU - Sidky, Hythem
AU - Hayanga, Jw Awori
AU - Pincavitch, Jami
AU - Clark, Jaylyn
AU - Harper, Jeremy Richard
AU - Islam, Jessica
AU - Ge, Jin
AU - Gagnier, Joel
AU - Saltz, Joel H.
AU - Saltz, Joel
AU - Loomba, Johanna
AU - Buse, John
AU - Mathew, Jomol
AU - Rutter, Joni L.
AU - Mcmurry, Julie A.
AU - Guinney, Justin
AU - Starren, Justin
AU - Crowley, Karen
AU - Bradwell, Katie Rebecca
AU - Walters, Kellie M.
AU - Wilkins, Ken
AU - Gersing, Kenneth R.
AU - Cato, Kenrick Dwain
AU - Murray, Kimberly
AU - Kostka, Kristin
AU - Northington, Lavance
AU - Pyles, Lee Allan
AU - Misquitta, Leonie
AU - Cottrell, Lesley
AU - Portilla, Lili
AU - Deacy, Mariam
AU - Bissell, Mark M.
AU - Clark, Marshall
AU - Emmett, Mary
AU - Adams, Meredith
AU - Temple-O'Connor, Meredith
AU - Kurilla, Michael G.
AU - Morris, Michele
AU - Qureshi, Nabeel
AU - Safdar, Nasia
AU - Garbarini, Nicole
AU - Sharafeldin, Noha
AU - Sadan, Ofer
AU - Francis, Patricia A.
AU - Burgoon, Penny Wung
AU - Robinson, Peter
AU - Payne, Philip Ro
AU - Fuentes, Rafael
AU - Jawa, Randeep
AU - Erwin-Cohen, Rebecca
AU - Patel, Rena
AU - Zhu, Richard L.
AU - Kamaleswaran, Rishi
AU - Hurley, Robert
AU - Miller, Robert T.
AU - Pyarajan, Saiju
AU - Michael, Sam G.
AU - Bozzette, Samuel
AU - Mallipattu, Sandeep
AU - Vedula, Satyanarayana
AU - Chapman, Scott
AU - O'Neil, Shawn T.
AU - Setoguchi, Soko
AU - Hong, Stephanie S.
AU - Johnson, Steve
AU - Bennett, Tellen D.
AU - Callahan, Tiffany
AU - Topaloglu, Umit
AU - Sheikh, Usman
AU - Gordon, Valery
AU - Subbian, Vignesh
AU - Kibbe, Warren A.
AU - Hernandez, Wenndy
AU - Beasley, Will
AU - Cooper, Will
AU - Hillegass, William
AU - Zhang, Xiaohan Tanner
N1 - Publisher Copyright:
© 2024 JMIR Publications Inc.. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Background: A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is emergent, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC). Objective: This study aims to highlight the current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding emergent clinical concepts and to illustrate our approach to addressing these issues through the use case of developing an NLP system for the signs and symptoms of COVID-19 and PASC. Methods: We used 2 preexisting studies on PASC as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System to autonomously generate an expanded lexicon to weakly annotate a training set, which was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites that were also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm. Results: An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2366 unique concepts was developed to extract COVID-19 or PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the 5 sites. Conclusions: The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach using the Open Health Natural Language Processing Toolkit aimed at addressing these needs with a dictionary-based weak labeling step that minimizes the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.
AB - Background: A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is emergent, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and postacute sequelae of SARS-CoV-2 infection (PASC). Objective: This study aims to highlight the current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding emergent clinical concepts and to illustrate our approach to addressing these issues through the use case of developing an NLP system for the signs and symptoms of COVID-19 and PASC. Methods: We used 2 preexisting studies on PASC as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System to autonomously generate an expanded lexicon to weakly annotate a training set, which was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites that were also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm. Results: An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2366 unique concepts was developed to extract COVID-19 or PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the 5 sites. Conclusions: The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach using the Open Health Natural Language Processing Toolkit aimed at addressing these needs with a dictionary-based weak labeling step that minimizes the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.
KW - COVID
KW - COVID-19
KW - NLP
KW - OHNLP
KW - Open Health Natural Language Processing
KW - PASC
KW - SARS-CoV-2
KW - clinical information extraction
KW - clinical phenotyping
KW - extract
KW - extraction
KW - narratives
KW - natural language processing
KW - phenotype
KW - phenotyping
KW - unstructured
UR - http://www.scopus.com/inward/record.url?scp=85204977309&partnerID=8YFLogxK
U2 - 10.2196/49997
DO - 10.2196/49997
M3 - Article
C2 - 39250782
AN - SCOPUS:85204977309
SN - 2291-9694
VL - 12
JO - JMIR Medical Informatics
JF - JMIR Medical Informatics
M1 - e49997
ER -