TY - JOUR
T1 - Statistical analysis of zero-inflated nonnegative continuous data
T2 - A review
AU - Liu, Lei
AU - Shih, Ya Chen Tina
AU - Strawderman, Robert L.
AU - Zhang, Daowen
AU - Johnson, Bankole A.
AU - Chai, Haitao
N1 - Funding Information:
This research is partly supported by AHRQ R01 HS 020263, NIH/NCI R01 CA 85848, and NSF DMS-1308009. Dr. Liu is a consultant to Celladon, Zensun, and Outcome Research Solutions, Inc. We are grateful to Drs. Jinsong Chen, Xuelin Huang, Mingyao Li for helpful discussions and comments. Copyright permission has been granted by Wiley and Sage in reusing tables and figures in Section 8. Part of the content has been given in a short course in 2016 Joint Statistical Meetings. The authors thank Gary Deyter, technical writer from the Department of Health Services Research at The University of Texas MD Anderson Cancer Center, for his editorial assistance.
Publisher Copyright:
© Institute of Mathematical Statistics, 2019.
PY - 2019/5/1
Y1 - 2019/5/1
N2 - Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods.
AB - Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods.
KW - Cure rate
KW - Frailty model
KW - Health econometrics
KW - Joint model
KW - Semiparametric regression
KW - Splines
KW - Tobit model
KW - Two-part model
UR - http://www.scopus.com/inward/record.url?scp=85069883889&partnerID=8YFLogxK
U2 - 10.1214/18-STS681
DO - 10.1214/18-STS681
M3 - Article
AN - SCOPUS:85069883889
SN - 0883-4237
VL - 34
SP - 253
EP - 279
JO - Statistical Science
JF - Statistical Science
IS - 2
ER -