TY - JOUR
T1 - Tree-based classification model for Long-COVID infection prediction with age stratification using data from the National COVID Cohort Collaborative
AU - The National COVID Cohort Collaborative (N3C) Consortium
AU - Wang, Will Ke
AU - Jeong, Hayoung
AU - Hershkovich, Leeor
AU - Cho, Peter
AU - Singh, Karnika
AU - Lederer, Lauren
AU - Roghanizad, Ali R.
AU - Shandhi, Md Mobashir Hasan
AU - Kibbe, Warren A.
AU - Dunn, Jessilyn
AU - Wilcox, Adam B.
AU - Lee, Adam M.
AU - Graves, Alexis
AU - Anzalone, Alfred Jerrod
AU - Manna, Amin
AU - Saha, Amit
AU - Olex, Amy
AU - Zhou, Andrea
AU - Williams, Andrew E.
AU - Southerland, Andrew
AU - Girvin, Andrew T.
AU - Walden, Anita
AU - Sharathkumar, Anjali A.
AU - Amor, Benjamin
AU - Bates, Benjamin
AU - Hendricks, Brian
AU - Patel, Brijesh
AU - Alexander, Caleb
AU - Bramante, Carolyn
AU - Ward-Caviness, Cavin
AU - Madlock-Brown, Charisse
AU - Suver, Christine
AU - Chute, Christopher
AU - Dillon, Christopher
AU - Wu, Chunlei
AU - Schmitt, Clare
AU - Takemoto, Cliff
AU - Housman, Dan
AU - Gabriel, Davera
AU - Eichmann, David A.
AU - Mazzotti, Diego
AU - Brown, Don
AU - Boudreau, Eilis
AU - Hill, Elaine
AU - Zampino, Elizabeth
AU - Marti, Emily Carlson
AU - Pfaff, Emily R.
AU - French, Evan
AU - Koraishy, Farrukh M.
AU - Mariona, Federico
AU - Prior, Fred
AU - Sokos, George
AU - Martin, Greg
AU - Lehmann, Harold
AU - Spratt, Heidi
AU - Mehta, Hemalkumar
AU - Liu, Hongfang
AU - Sidky, Hythem
AU - Awori Hayanga, J. W.
AU - Pincavitch, Jami
AU - Clark, Jaylyn
AU - Harper, Jeremy Richard
AU - Islam, Jessica
AU - Ge, Jin
AU - Gagnier, Joel
AU - Saltz, Joel H.
AU - Saltz, Joel
AU - Loomba, Johanna
AU - Buse, John
AU - Mathew, Jomol
AU - Rutter, Joni L.
AU - McMurry, Julie A.
AU - Guinney, Justin
AU - Starren, Justin
AU - Crowley, Karen
AU - Bradwell, Katie Rebecca
AU - Walters, Kellie M.
AU - Wilkins, Ken
AU - Gersing, Kenneth R.
AU - Cato, Kenrick Dwain
AU - Murray, Kimberly
AU - Kostka, Kristin
AU - Northington, Lavance
AU - Pyles, Lee Allan
AU - Misquitta, Leonie
AU - Cottrell, Lesley
AU - Portilla, Lili
AU - Deacy, Mariam
AU - Bissell, Mark M.
AU - Clark, Marshall
AU - Emmett, Mary
AU - Saltz, Mary Morrison
AU - Palchuk, Matvey B.
AU - Haendel, Melissa A.
AU - Adams, Meredith
AU - Temple-O'Connor, Meredith
AU - Kurilla, Michael G.
AU - Morris, Michele
AU - Qureshi, Nabeel
AU - Safdar, Nasia
AU - Garbarini, Nicole
AU - Sharafeldin, Noha
AU - Sadan, Ofer
AU - Francis, Patricia A.
AU - Burgoon, Penny Wung
AU - Robinson, Peter
AU - Payne, Philip R.O.
AU - Fuentes, Rafael
AU - Jawa, Randeep
AU - Erwin-Cohen, Rebecca
AU - Patel, Rena
AU - Moffitt, Richard A.
AU - Zhu, Richard L.
AU - Kamaleswaran, Rishi
AU - Hurley, Robert
AU - Miller, Robert T.
AU - Pyarajan, Saiju
AU - Michael, Sam G.
AU - Bozzette, Samuel
AU - Mallipattu, Sandeep
AU - Vedula, Satyanarayana
AU - Chapman, Scott
AU - O'Neil, Shawn T.
AU - Setoguchi, Soko
AU - Hong, Stephanie S.
AU - Johnson, Steve
AU - Bennett, Tellen D.
AU - Callahan, Tiffany
AU - Topaloglu, Umit
AU - Sheikh, Usman
AU - Gordon, Valery
AU - Subbian, Vignesh
AU - Hernandez, Wenndy
AU - Beasley, Will
AU - Cooper, Will
AU - Hillegass, William
AU - Zhang, Xiaohan Tanner
N1 - Publisher Copyright:
© The Author(s) 2024. Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2024/12/1
Y1 - 2024/12/1
N2 - Objectives: We propose and validate a domain knowledge-driven classification model for diagnosing post-acute sequelae of SARS-CoV-2 infection (PASC), also known as Long COVID, using Electronic Health Records (EHRs) data. Materials and Methods: We developed a robust model that incorporates features strongly indicative of PASC or associated with the severity of COVID-19 symptoms as identified in our literature review. The XGBoost tree-based architecture was chosen for its ability to handle class-imbalanced data and its potential for high interpretability. Using the training data provided by the Long COVID Computation Challenge (L3C), which was a sample of the National COVID Cohort Collaborative (N3C), our models were fine-tuned and calibrated to optimize Area Under the Receiver Operating characteristic curve (AUROC) and the F1 score, following best practices for the class-imbalanced N3C data. Results: Our age-stratified classification model demonstrated strong performance with an average 5-fold cross-validated AUROC of 0.844 and F1 score of 0.539 across the young adult, mid-aged, and older-aged populations in the training data. In an independent testing dataset, which was made available after the challenge was over, we achieved an overall AUROC score of 0.814 and F1 score of 0.545. Discussion: The results demonstrated the utility of knowledge-driven feature engineering in a sparse EHR data and demographic stratification in model development to diagnose a complex and heterogeneously presenting condition like PASC. The model's architecture, mirroring natural clinician decision-making processes, contributed to its robustness and interpretability, which are crucial for clinical translatability. Further, the model's generalizability was evaluated over a new cross-sectional data as provided in the later stages of the L3C challenge. Conclusion: The study proposed and validated the effectiveness of age-stratified, tree-based classification models to diagnose PASC. Our approach highlights the potential of machine learning in addressing the diagnostic challenges posed by the heterogeneity of Long-COVID symptoms.
AB - Objectives: We propose and validate a domain knowledge-driven classification model for diagnosing post-acute sequelae of SARS-CoV-2 infection (PASC), also known as Long COVID, using Electronic Health Records (EHRs) data. Materials and Methods: We developed a robust model that incorporates features strongly indicative of PASC or associated with the severity of COVID-19 symptoms as identified in our literature review. The XGBoost tree-based architecture was chosen for its ability to handle class-imbalanced data and its potential for high interpretability. Using the training data provided by the Long COVID Computation Challenge (L3C), which was a sample of the National COVID Cohort Collaborative (N3C), our models were fine-tuned and calibrated to optimize Area Under the Receiver Operating characteristic curve (AUROC) and the F1 score, following best practices for the class-imbalanced N3C data. Results: Our age-stratified classification model demonstrated strong performance with an average 5-fold cross-validated AUROC of 0.844 and F1 score of 0.539 across the young adult, mid-aged, and older-aged populations in the training data. In an independent testing dataset, which was made available after the challenge was over, we achieved an overall AUROC score of 0.814 and F1 score of 0.545. Discussion: The results demonstrated the utility of knowledge-driven feature engineering in a sparse EHR data and demographic stratification in model development to diagnose a complex and heterogeneously presenting condition like PASC. The model's architecture, mirroring natural clinician decision-making processes, contributed to its robustness and interpretability, which are crucial for clinical translatability. Further, the model's generalizability was evaluated over a new cross-sectional data as provided in the later stages of the L3C challenge. Conclusion: The study proposed and validated the effectiveness of age-stratified, tree-based classification models to diagnose PASC. Our approach highlights the potential of machine learning in addressing the diagnostic challenges posed by the heterogeneity of Long-COVID symptoms.
KW - Long COVID
KW - N3C
KW - PASC
KW - clinical decision model
KW - electronic health records
UR - http://www.scopus.com/inward/record.url?scp=85209120158&partnerID=8YFLogxK
U2 - 10.1093/jamiaopen/ooae111
DO - 10.1093/jamiaopen/ooae111
M3 - Article
C2 - 39524607
AN - SCOPUS:85209120158
SN - 2574-2531
VL - 7
JO - JAMIA Open
JF - JAMIA Open
IS - 4
M1 - ooae111
ER -