TY - GEN
T1 - MDL hierarchical clustering for stemmatology
AU - Lai, Po Hsiang
AU - Roos, Teemu
AU - O'Sullivan, Joseph A.
PY - 2010
Y1 - 2010
N2 - In real life, one often encounters situations where one needs to infer a structural relationship among data points based on an incomplete dataset. Stemmatology and phylogenetics are two classes of such problems where partial text scripts or genome sequences are available and the goal is to reconstruct the copying history of scripts or evolutionary relations among species. In this paper, we study the potential applications of minimum description length (MDL) concepts to the structural inference problem, particularly focusing on stemmatology where in addition to missing data points, the available data points have missing values. We offer new insights on how to handle these issues, especially missing values. We develop a general algorithm based on MDL insights that is simple to implement and can be used along with other existing algorithms, and propose a generic MDL encoder with minimal assumptions made about the data. In simulations, our method performs reasonably well on a simple dataset and outperforms major existing methods in a larger and much more realistic dataset. We discuss directions and ongoing efforts to further improve performance.
AB - In real life, one often encounters situations where one needs to infer a structural relationship among data points based on an incomplete dataset. Stemmatology and phylogenetics are two classes of such problems where partial text scripts or genome sequences are available and the goal is to reconstruct the copying history of scripts or evolutionary relations among species. In this paper, we study the potential applications of minimum description length (MDL) concepts to the structural inference problem, particularly focusing on stemmatology where in addition to missing data points, the available data points have missing values. We offer new insights on how to handle these issues, especially missing values. We develop a general algorithm based on MDL insights that is simple to implement and can be used along with other existing algorithms, and propose a generic MDL encoder with minimal assumptions made about the data. In simulations, our method performs reasonably well on a simple dataset and outperforms major existing methods in a larger and much more realistic dataset. We discuss directions and ongoing efforts to further improve performance.
UR - https://www.scopus.com/pages/publications/77955699041
U2 - 10.1109/ISIT.2010.5513627
DO - 10.1109/ISIT.2010.5513627
M3 - Conference contribution
AN - SCOPUS:77955699041
SN - 9781424469604
T3 - IEEE International Symposium on Information Theory - Proceedings
SP - 1403
EP - 1407
BT - 2010 IEEE International Symposium on Information Theory, ISIT 2010 - Proceedings
T2 - 2010 IEEE International Symposium on Information Theory, ISIT 2010
Y2 - 13 June 2010 through 18 June 2010
ER -