TY - JOUR
T1 - Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model
AU - Han, Seong Kyu
AU - Muto, Yoshiharu
AU - Wilson, Parker C.
AU - Humphreys, Benjamin D.
AU - Sampson, Matthew G.
AU - Chakravarti, Aravinda
AU - Lee, Dongwon
N1 - Publisher Copyright:
Copyright © 2022 the Author(s). Published by PNAS.
PY - 2022/12/20
Y1 - 2022/12/20
N2 - Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
AB - Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify “high-quality” (HQ) samples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
KW - chromatin accessibility
KW - gkmQC
KW - quality control
KW - sequence-based model
UR - http://www.scopus.com/inward/record.url?scp=85143994232&partnerID=8YFLogxK
U2 - 10.1073/pnas.2212810119
DO - 10.1073/pnas.2212810119
M3 - Article
C2 - 36508674
AN - SCOPUS:85143994232
SN - 0027-8424
VL - 119
JO - Proceedings of the National Academy of Sciences of the United States of America
JF - Proceedings of the National Academy of Sciences of the United States of America
IS - 51
M1 - e2212810119
ER -