TY - JOUR
T1 - Predicting Long COVID in the National COVID Cohort Collaborative Using Super Learner
T2 - Cohort Study
AU - The National COVID Cohort Collaborative (N3C) Consortium
AU - Butzin-Dozier, Zachary
AU - Ji, Yunwen
AU - Li, Haodong
AU - Coyle, Jeremy
AU - Shi, Junming
AU - Phillips, Rachael V.
AU - Mertens, Andrew N.
AU - Pirracchio, Romain
AU - van der Laan, Mark J.
AU - Patel, Rena C.
AU - Colford, John M.
AU - Hubbard, Alan E.
AU - Wilcox, Adam B.
AU - Lee, Adam M.
AU - Graves, Alexis
AU - Anzalone, Alfred (Jerrod)
AU - Manna, Amin
AU - Saha, Amit
AU - Olex, Amy
AU - Zhou, Andrea
AU - Williams, Andrew E.
AU - Southerland, Andrew
AU - Girvin, Andrew T.
AU - Walden, Anita
AU - Sharathkumar, Anjali A.
AU - Amor, Benjamin
AU - Bates, Benjamin
AU - Hendricks, Brian
AU - Patel, Brijesh
AU - Alexander, Caleb
AU - Bramante, Carolyn
AU - Ward-Caviness, Cavin
AU - Madlock-Brown, Charisse
AU - Suver, Christine
AU - Chute, Christopher G.
AU - Dillon, Christopher
AU - Wu, Chunlei
AU - Schmitt, Clare
AU - Takemoto, Cliff
AU - Housman, Dan
AU - Gabriel, Davera
AU - Eichmann, David A.
AU - Mazzotti, Diego
AU - Brown, Don
AU - Boudreau, Eilis
AU - Hill, Elaine
AU - Zampino, Elizabeth
AU - Marti, Emily Carlson
AU - Pfaff, Emily R.
AU - French, Evan
AU - Koraishy, Farrukh M.
AU - Mariona, Federico
AU - Prior, Fred
AU - Sokos, George
AU - Martin, Greg
AU - Lehmann, Harold
AU - Spratt, Heidi
AU - Mehta, Hemalkumar
AU - Liu, Hongfang
AU - Sidky, Hythem
AU - Hayanga, J. W.Awori
AU - Pincavitch, Jami
AU - Clark, Jaylyn
AU - Harper, Jeremy Richard
AU - Islam, Jessica
AU - Ge, Jin
AU - Gagnier, Joel
AU - Saltz, Joel H.
AU - Saltz, Joel
AU - Loomba, Johanna
AU - Buse, John
AU - Mathew, Jomol
AU - Rutter, Joni L.
AU - McMurry, Julie A.
AU - Guinney, Justin
AU - Starren, Justin
AU - Crowley, Karen
AU - Bradwell, Katie Rebecca
AU - Walters, Kellie M.
AU - Wilkins, Ken
AU - Gersing, Kenneth R.
AU - Cato, Kenrick Dwain
AU - Murray, Kimberly
AU - Kostka, Kristin
AU - Northington, Lavance
AU - Pyles, Lee Allan
AU - Misquitta, Leonie
AU - Cottrell, Lesley
AU - Portilla, Lili
AU - Deacy, Mariam
AU - Bissell, Mark M.
AU - Clark, Marshall
AU - Emmett, Mary
AU - Saltz, Mary Morrison
AU - Palchuk, Matvey B.
AU - Haendel, Melissa A.
AU - Adams, Meredith
AU - Temple-O'Connor, Meredith
AU - Kurilla, Michael G.
AU - Morris, Michele
AU - Qureshi, Nabeel
AU - Safdar, Nasia
AU - Garbarini, Nicole
AU - Sharafeldin, Noha
AU - Sadan, Ofer
AU - Francis, Patricia A.
AU - Burgoon, Penny Wung
AU - Robinson, Peter
AU - Payne, Philip R.O.
AU - Fuentes, Rafael
AU - Jawa, Randeep
AU - Erwin-Cohen, Rebecca
AU - Patel, Rena
AU - Moffitt, Richard A.
AU - Zhu, Richard L.
AU - Kamaleswaran, Rishi
AU - Hurley, Robert
AU - Miller, Robert T.
AU - Pyarajan, Saiju
AU - Michael, Sam G.
AU - Bozzette, Samuel
AU - Mallipattu, Sandeep
AU - Vedula, Satyanarayana
AU - Chapman, Scott
AU - O'Neil, Shawn T.
AU - Setoguchi, Soko
AU - Hong, Stephanie S.
AU - Johnson, Steve
AU - Bennett, Tellen D.
AU - Callahan, Tiffany
AU - Topaloglu, Umit
AU - Sheikh, Usman
AU - Gordon, Valery
AU - Subbian, Vignesh
AU - Kibbe, Warren A.
AU - Hernandez, Wenndy
AU - Beasley, Will
AU - Cooper, Will
AU - Hillegass, William
AU - Zhang, Xiaohan Tanner
N1 - Publisher Copyright:
©Zachary Butzin-Dozier, Yunwen Ji, Haodong Li, Jeremy Coyle, Junming Shi, Rachael V Phillips, Andrew N Mertens, Romain Pirracchio, Mark J van der Laan, Rena C Patel, John M Colford, Alan E Hubbard, The National COVID Cohort Collaborative (N3C) Consortium.
PY - 2024
Y1 - 2024
N2 - Background: Postacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. Objective: Using a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. Methods: We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. Results: We were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. Conclusions: The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment.
AB - Background: Postacute sequelae of COVID-19 (PASC), also known as long COVID, is a broad grouping of a range of long-term symptoms following acute COVID-19. These symptoms can occur across a range of biological systems, leading to challenges in determining risk factors for PASC and the causal etiology of this disorder. An understanding of characteristics that are predictive of future PASC is valuable, as this can inform the identification of high-risk individuals and future preventative efforts. However, current knowledge regarding PASC risk factors is limited. Objective: Using a sample of 55,257 patients (at a ratio of 1 patient with PASC to 4 matched controls) from the National COVID Cohort Collaborative, as part of the National Institutes of Health Long COVID Computational Challenge, we sought to predict individual risk of PASC diagnosis from a curated set of clinically informed covariates. The National COVID Cohort Collaborative includes electronic health records for more than 22 million patients from 84 sites across the United States. Methods: We predicted individual PASC status, given covariate information, using Super Learner (an ensemble machine learning algorithm also known as stacking) to learn the optimal combination of gradient boosting and random forest algorithms to maximize the area under the receiver operator curve. We evaluated variable importance (Shapley values) based on 3 levels: individual features, temporal windows, and clinical domains. We externally validated these findings using a holdout set of randomly selected study sites. Results: We were able to predict individual PASC diagnoses accurately (area under the curve 0.874). The individual features of the length of observation period, number of health care interactions during acute COVID-19, and viral lower respiratory infection were the most predictive of subsequent PASC diagnosis. Temporally, we found that baseline characteristics were the most predictive of future PASC diagnosis, compared with characteristics immediately before, during, or after acute COVID-19. We found that the clinical domains of health care use, demographics or anthropometry, and respiratory factors were the most predictive of PASC diagnosis. Conclusions: The methods outlined here provide an open-source, applied example of using Super Learner to predict PASC status using electronic health record data, which can be replicated across a variety of settings. Across individual predictors and clinical domains, we consistently found that factors related to health care use were the strongest predictors of PASC diagnosis. This indicates that any observational studies using PASC diagnosis as a primary outcome must rigorously account for heterogeneous health care use. Our temporal findings support the hypothesis that clinicians may be able to accurately assess the risk of PASC in patients before acute COVID-19 diagnosis, which could improve early interventions and preventive care. Our findings also highlight the importance of respiratory characteristics in PASC risk assessment.
KW - chronic
KW - covariate
KW - covariates
KW - COVID-19
KW - ensemble
KW - infectious
KW - long COVID
KW - long term
KW - machine learning
KW - predict
KW - prediction
KW - predictions
KW - predictive
KW - respiratory
KW - risk
KW - risks
KW - SARS-CoV-2
KW - sequelae
KW - stacking
KW - Super Learner
UR - http://www.scopus.com/inward/record.url?scp=85201348130&partnerID=8YFLogxK
U2 - 10.2196/53322
DO - 10.2196/53322
M3 - Article
C2 - 39146534
AN - SCOPUS:85201348130
SN - 2369-2960
VL - 10
JO - JMIR Public Health and Surveillance
JF - JMIR Public Health and Surveillance
M1 - e53322
ER -