BAYESIAN DATA SYNTHESIS AND THE UTILITY-RISK TRADE-OFF FOR MIXED EPIDEMIOLOGICAL DATA

  • Joseph Feldman
  • , Daniel R. Kowal

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Much of the microdata used for epidemiological studies contain sensitive measurements on real individuals. As a result, such microdata cannot be published out of privacy concerns, and without public access to these data, any statistical analyses originally published on them are nearly impos-sible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic high-dimensional microdatasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and pre-serve these conditional relationships, including both nonlinearities and inter-actions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.

    Original languageEnglish
    Pages (from-to)2577-2602
    Number of pages26
    JournalAnnals of Applied Statistics
    Volume16
    Issue number4
    DOIs
    StatePublished - Dec 2022

    Keywords

    • Copula
    • data privacy
    • factor model
    • nonparametric regression

    Fingerprint

    Dive into the research topics of 'BAYESIAN DATA SYNTHESIS AND THE UTILITY-RISK TRADE-OFF FOR MIXED EPIDEMIOLOGICAL DATA'. Together they form a unique fingerprint.

    Cite this