TY - JOUR
T1 - Correlates of cannabis use disorder in the United States
T2 - A comparison of logistic regression, classification trees, and random forests
AU - Dell, Nathaniel A.
AU - Vaughn, Michael G.
AU - Prasad Srivastava, Sweta
AU - Alsolami, Abdulaziz
AU - Salas-Wright, Christopher P.
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/7
Y1 - 2022/7
N2 - Although several recent studies have examined psychosocial and demographic correlates of cannabis use disorder (CUD) in adults, few, if any, recent studies have evaluated the performance of machine learning methods relative to standard logistic regression for identifying correlates of CUD. The present study used pooled data from the 2015–2018 National Survey on Drug Use and Health to evaluate psychosocial and demographic correlates of CUD in adults. In addition, we compared the performance of logistic regression, classification trees, and random forest methods in classifying CUD. When comparing the performance of each method on the test data set, classification trees (AUC = 0.84, 95%CI: 0.82, 0.85) and random forest (AUC = 0.83, 95%CI: 0.82, 8.05) performed similarly and superior to logistic regression (AUC = 0.77, 95%CI: 0.74, 0.79). Results of the random forests reveal that marital status, risk propensity, age, and cocaine dependence variables contributed most to node purity, whereas model accuracy would decrease significantly if county type, income, race, and education variables were excluded from the model. One possible approach to improving the efficiency, interpretability, and clinical insights of CUD correlates is the employment of machine learning techniques.
AB - Although several recent studies have examined psychosocial and demographic correlates of cannabis use disorder (CUD) in adults, few, if any, recent studies have evaluated the performance of machine learning methods relative to standard logistic regression for identifying correlates of CUD. The present study used pooled data from the 2015–2018 National Survey on Drug Use and Health to evaluate psychosocial and demographic correlates of CUD in adults. In addition, we compared the performance of logistic regression, classification trees, and random forest methods in classifying CUD. When comparing the performance of each method on the test data set, classification trees (AUC = 0.84, 95%CI: 0.82, 0.85) and random forest (AUC = 0.83, 95%CI: 0.82, 8.05) performed similarly and superior to logistic regression (AUC = 0.77, 95%CI: 0.74, 0.79). Results of the random forests reveal that marital status, risk propensity, age, and cocaine dependence variables contributed most to node purity, whereas model accuracy would decrease significantly if county type, income, race, and education variables were excluded from the model. One possible approach to improving the efficiency, interpretability, and clinical insights of CUD correlates is the employment of machine learning techniques.
KW - Cannabis
KW - Cannabis use disorder
KW - Classification tree
KW - Machine learning
KW - Random forest
KW - Substance use disorders
UR - https://www.scopus.com/pages/publications/85130792177
U2 - 10.1016/j.jpsychires.2022.05.021
DO - 10.1016/j.jpsychires.2022.05.021
M3 - Article
C2 - 35636037
AN - SCOPUS:85130792177
SN - 0022-3956
VL - 151
SP - 590
EP - 597
JO - Journal of Psychiatric Research
JF - Journal of Psychiatric Research
ER -