TY - JOUR
T1 - Random forest fishing
T2 - A novel approach to identifying organic group of risk factors in genome-wide association studies
AU - Yang, Wei
AU - Charles Gu, C.
N1 - Funding Information:
This research is supported in part by NIH grants HL091028, HL071782, DA012854, and DA027995.
PY - 2014/2
Y1 - 2014/2
N2 - Genome-wide association studies (GWAS) has brought methodological challenges in handling massive high-dimensional data and also real opportunities for studying the joint effect of many risk factors acting in concert as an organic group. The random forest (RF) methodology is recognized by many for its potential in examining interaction effects in large data sets. However, RF is not designed to directly handle GWAS data, which typically have hundreds of thousands of single-nucleotide polymorphisms as predictor variables. We propose and evaluate a novel extension of RF, called random forest fishing (RFF), for GWAS analysis. RFF repeatedly updates a relatively small set of predictors obtained by RF tests to find globally important groups predictive of the disease phenotype, using a novel search algorithm based on genetic programming and simulated annealing. A key improvement of RFF results from the use of guidance incorporating empirical test results of genome-wide pairwise interactions. Evaluated using simulated and real GWAS data sets, RFF is shown to be effective in identifying important predictors, particularly when both marginal effects and interactions exist, and is applicable to very large GWAS data sets.
AB - Genome-wide association studies (GWAS) has brought methodological challenges in handling massive high-dimensional data and also real opportunities for studying the joint effect of many risk factors acting in concert as an organic group. The random forest (RF) methodology is recognized by many for its potential in examining interaction effects in large data sets. However, RF is not designed to directly handle GWAS data, which typically have hundreds of thousands of single-nucleotide polymorphisms as predictor variables. We propose and evaluate a novel extension of RF, called random forest fishing (RFF), for GWAS analysis. RFF repeatedly updates a relatively small set of predictors obtained by RF tests to find globally important groups predictive of the disease phenotype, using a novel search algorithm based on genetic programming and simulated annealing. A key improvement of RFF results from the use of guidance incorporating empirical test results of genome-wide pairwise interactions. Evaluated using simulated and real GWAS data sets, RFF is shown to be effective in identifying important predictors, particularly when both marginal effects and interactions exist, and is applicable to very large GWAS data sets.
KW - epistasis
KW - genetic algorithms
KW - genome-wide association
KW - interactions
KW - random forest
KW - statistical learning
UR - http://www.scopus.com/inward/record.url?scp=84892831770&partnerID=8YFLogxK
U2 - 10.1038/ejhg.2013.109
DO - 10.1038/ejhg.2013.109
M3 - Article
C2 - 23695277
AN - SCOPUS:84892831770
SN - 1018-4813
VL - 22
SP - 254
EP - 259
JO - European Journal of Human Genetics
JF - European Journal of Human Genetics
IS - 2
ER -