Abstract
Much of the microdata used for epidemiological studies contain sensitive measurements on real individuals. As a result, such microdata cannot be published out of privacy concerns, and without public access to these data, any statistical analyses originally published on them are nearly impos-sible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic high-dimensional microdatasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and pre-serve these conditional relationships, including both nonlinearities and inter-actions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.
| Original language | English |
|---|---|
| Pages (from-to) | 2577-2602 |
| Number of pages | 26 |
| Journal | Annals of Applied Statistics |
| Volume | 16 |
| Issue number | 4 |
| DOIs | |
| State | Published - Dec 2022 |
Keywords
- Copula
- data privacy
- factor model
- nonparametric regression
Fingerprint
Dive into the research topics of 'BAYESIAN DATA SYNTHESIS AND THE UTILITY-RISK TRADE-OFF FOR MIXED EPIDEMIOLOGICAL DATA'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver