TY - JOUR
T1 - Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia
AU - Coombes, Caitlin E.
AU - Abrams, Zachary B.
AU - Li, Suli
AU - Abruzzo, Lynne V.
AU - Coombes, Kevin R.
N1 - Publisher Copyright:
© 2020 The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association.
PY - 2020/7/1
Y1 - 2020/7/1
N2 - Objective: Unsupervised machine learning approaches hold promise for large-scale clinical data. However, the heterogeneity of clinical data raises new methodological challenges in feature selection, choosing a distance metric that captures biological meaning, and visualization. We hypothesized that clustering could discover prognostic groups from patients with chronic lymphocytic leukemia, a disease that provides biological validation through well-understood outcomes. Methods: To address this challenge, we applied k-medoids clustering with 10 distance metrics to 2 experiments ("A"and "B") with mixed clinical features collapsed to binary vectors and visualized with both multidimensional scaling and t-stochastic neighbor embedding. To assess prognostic utility, we performed survival analysis using a Cox proportional hazard model, log-rank test, and Kaplan-Meier curves. Results: In both experiments, survival analysis revealed a statistically significant association between clusters and survival outcomes (A: overall survival, P =. 0164; B: time from diagnosis to treatment, P =. 0039). Multidimensional scaling separated clusters along a gradient mirroring the order of overall survival. Longer survival was associated with mutated immunoglobulin heavy-chain variable region gene (IGHV) status, absent Zap 70 expression, female sex, and younger age. Conclusions: This approach to mixed-type data handling and selection of distance metric captured well-understood, binary, prognostic markers in chronic lymphocytic leukemia (sex, IGHV mutation status, ZAP70 expression status) with high fidelity.
AB - Objective: Unsupervised machine learning approaches hold promise for large-scale clinical data. However, the heterogeneity of clinical data raises new methodological challenges in feature selection, choosing a distance metric that captures biological meaning, and visualization. We hypothesized that clustering could discover prognostic groups from patients with chronic lymphocytic leukemia, a disease that provides biological validation through well-understood outcomes. Methods: To address this challenge, we applied k-medoids clustering with 10 distance metrics to 2 experiments ("A"and "B") with mixed clinical features collapsed to binary vectors and visualized with both multidimensional scaling and t-stochastic neighbor embedding. To assess prognostic utility, we performed survival analysis using a Cox proportional hazard model, log-rank test, and Kaplan-Meier curves. Results: In both experiments, survival analysis revealed a statistically significant association between clusters and survival outcomes (A: overall survival, P =. 0164; B: time from diagnosis to treatment, P =. 0039). Multidimensional scaling separated clusters along a gradient mirroring the order of overall survival. Longer survival was associated with mutated immunoglobulin heavy-chain variable region gene (IGHV) status, absent Zap 70 expression, female sex, and younger age. Conclusions: This approach to mixed-type data handling and selection of distance metric captured well-understood, binary, prognostic markers in chronic lymphocytic leukemia (sex, IGHV mutation status, ZAP70 expression status) with high fidelity.
KW - chronic lymphocytic leukemia
KW - clinical informatics, mixed-type data
KW - clustering
KW - unsupervised machine learning
UR - http://www.scopus.com/inward/record.url?scp=85088606882&partnerID=8YFLogxK
U2 - 10.1093/jamia/ocaa060
DO - 10.1093/jamia/ocaa060
M3 - Article
C2 - 32483590
AN - SCOPUS:85088606882
SN - 1067-5027
VL - 27
SP - 1019
EP - 1027
JO - Journal of the American Medical Informatics Association
JF - Journal of the American Medical Informatics Association
IS - 7
ER -