TY - GEN
T1 - Discovering significant OPSM subspace clusters in massive gene expression data
AU - Gao, Byron J.
AU - Griffith, Obi L.
AU - Ester, Martin
AU - Jones, Steven J.M.
PY - 2006
Y1 - 2006
N2 - Order-preserving submatrixes (OPSMs) have been accepted as a biologically meaningful subspace cluster model, capturing the general tendency of gene expressions across a subset of conditions. In an OPSM, the expression levels of all genes induce the same linear ordering of the conditions. OPSM mining is reducible to a special case of the sequential pattern mining problem, in which a pattern and its supporting sequences uniquely specify an OPSM cluster. Those small twig clusters, specified by long patterns with naturally low support, incur explosive computational costs and would be completely pruned off by most existing methods for massive datasets containing thousands of conditions and hundreds of thousands of genes, which are common in today's gene expression analysis. However, it is in particular interest of biologists to reveal such small groups of genes that are tightly coregulated under many conditions, and some pathways or processes might require only two genes to act in concert. In this paper, we introduce the KiWi mining framework for massive datasets, that exploits two parameters k and w to provide a biased testing on a bounded number of candidates, substantially reducing the search space and problem scale, targeting on highly promising seeds that lead to significant clusters and twig clusters. Extensive biological and computational evaluations on real datasets demonstrate that KiWi can effectively mine biologically meaningful OPSM subspace clusters with good efficiency and scalability.
AB - Order-preserving submatrixes (OPSMs) have been accepted as a biologically meaningful subspace cluster model, capturing the general tendency of gene expressions across a subset of conditions. In an OPSM, the expression levels of all genes induce the same linear ordering of the conditions. OPSM mining is reducible to a special case of the sequential pattern mining problem, in which a pattern and its supporting sequences uniquely specify an OPSM cluster. Those small twig clusters, specified by long patterns with naturally low support, incur explosive computational costs and would be completely pruned off by most existing methods for massive datasets containing thousands of conditions and hundreds of thousands of genes, which are common in today's gene expression analysis. However, it is in particular interest of biologists to reveal such small groups of genes that are tightly coregulated under many conditions, and some pathways or processes might require only two genes to act in concert. In this paper, we introduce the KiWi mining framework for massive datasets, that exploits two parameters k and w to provide a biased testing on a bounded number of candidates, substantially reducing the search space and problem scale, targeting on highly promising seeds that lead to significant clusters and twig clusters. Extensive biological and computational evaluations on real datasets demonstrate that KiWi can effectively mine biologically meaningful OPSM subspace clusters with good efficiency and scalability.
KW - Gene expression data
KW - Order-preserving submatrix
KW - Scalability
KW - Sub-space clustering
KW - Twig cluster
UR - http://www.scopus.com/inward/record.url?scp=33749583437&partnerID=8YFLogxK
U2 - 10.1145/1150402.1150529
DO - 10.1145/1150402.1150529
M3 - Conference contribution
AN - SCOPUS:33749583437
SN - 1595933395
SN - 9781595933393
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 922
EP - 928
BT - KDD 2006
PB - Association for Computing Machinery (ACM)
T2 - KDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 20 August 2006 through 23 August 2006
ER -