De-identifying Socioeconomic Data at the Census Tract Level for Medical Research Through Constraint-based Clustering

  • Yongtai Liu
  • , Douglas Conway
  • , Zhiyu Wan
  • , Murat Kantarcioglu
  • , Yevgeniy Vorobeychik
  • , Bradley A. Malin

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Numerous studies have shown that a person's health status is closely related to their socioeconomic status. It is evident that incorporating socioeconomic data associated with a patient's geographic area of residence into clinical datasets will promote medical research. However, most socioeconomic variables are unique in combination and are affiliated with small geographical regions (e.g., census tracts) that are often associated with less than 20,000 people. Thus, sharing such tract-level data can violate the Safe Harbor implementation of de-identification under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). In this paper, we introduce a constraint-based k-means clustering approach to generate census tract-level socioeconomic data that is de-identification compliant. Our experimental analysis with data from the American Community Survey illustrates that the approach generates a protected dataset with high similarity to the unaltered values, and achieves a substantially better data utility than the HIPAA Safe Harbor recommendation of 3-digit ZIP code.

Original languageEnglish
Pages (from-to)793-802
Number of pages10
JournalAMIA ... Annual Symposium proceedings. AMIA Symposium
Volume2021
StatePublished - 2021

Fingerprint

Dive into the research topics of 'De-identifying Socioeconomic Data at the Census Tract Level for Medical Research Through Constraint-based Clustering'. Together they form a unique fingerprint.

Cite this