TY - JOUR
T1 - The national COVID cohort collaborative
T2 - Analyses of original and computationally derived electronic health record data
AU - Foraker, Randi
AU - Guo, Aixia
AU - Thomas, Jason
AU - Zamstein, Noa
AU - Payne, Philip R.O.
AU - Wilcox, Adam
N1 - Publisher Copyright:
©Randi Foraker, Aixia Guo, Jason Thomas, Noa Zamstein, Philip RO Payne, Adam Wilcox, N3C Collaborative.
PY - 2021/10
Y1 - 2021/10
N2 - Background: Computationally derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic. Objective: We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. Methods: We used the National COVID Cohort Collaborative’s instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19–positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19–related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data. Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts. Conclusions: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.
AB - Background: Computationally derived (“synthetic”) data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic. Objective: We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. Methods: We used the National COVID Cohort Collaborative’s instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19–positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19–related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data. Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts. Conclusions: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights.
KW - COVID-19
KW - Data analysis
KW - Electronic health records and systems
KW - Protected health information
KW - Synthetic data
UR - http://www.scopus.com/inward/record.url?scp=85116545156&partnerID=8YFLogxK
U2 - 10.2196/30697
DO - 10.2196/30697
M3 - Article
C2 - 34559671
AN - SCOPUS:85116545156
SN - 1438-8871
VL - 23
JO - Journal of medical Internet research
JF - Journal of medical Internet research
IS - 10
M1 - e30697
ER -