Abstract
Order-preserving submatrix (OPSM) has been widely accepted as a biologically meaningful cluster model, capturing the general tendency of gene expression across a subset of experiments. In an OPSM, the expression levels of all genes induce the same linear ordering of the experiments. The OPSM problem is to discover those statistically significant OPSMs from a given data matrix. The problem is reducible to a special case of the sequential pattern mining problem, where a pattern and its supporting sequences uniquely specify an OPSM. Unfortunately, existing methods do not scale well to massive data sets containing thousands of experiments and hundreds of thousands of genes, which are common in todays gene expression analysis. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs in their discovery and are completely pruned off by existing methods. However, it is of particular interest of biologists to determine small groups of genes that are tightly coregulated across many experiments, and some pathways or processes may require as few as two genes to act in concert. In this paper, we study the discovery of deep OPSMs from massive data sets. We propose a novel best effort mining framework KiWi that exploits two parameters k and w to bound the available computational resources and search a selected search space, and does what it can to find as many as possible deep OPSMs. Extensive biological and computational evaluations on real data sets demonstrate the validity and importance of the deep OPSM problem, and the efficiency and effectiveness of the KiWi mining framework.
Original language | English |
---|---|
Article number | 5645633 |
Pages (from-to) | 309-325 |
Number of pages | 17 |
Journal | IEEE Transactions on Knowledge and Data Engineering |
Volume | 24 |
Issue number | 2 |
DOIs | |
State | Published - 2012 |
Keywords
- OPSM
- Order-preserving submatrix
- best effort
- data mining
- deep OPSM
- deep pattern
- gene expression analysis
- negative correlation
- pattern-based clustering
- scalability
- sequential pattern mining
- subspace clustering