TY - JOUR
T1 - Stream data clustering based on grid density and attraction
AU - Tu, Li
AU - Chen, Yixin
PY - 2009/7/1
Y1 - 2009/7/1
N2 - Clustering real-time stream data is an important and challenging problem. Existing algorithms such as CluStream are based on the k-means algorithm. These clustering algorithms have difficulties finding clusters of arbitrary shapes and handling outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this article proposes D-Stream, a framework for clustering stream data using a density-based approach. Our algorithm uses an online component that maps each input data record into a grid and an offline component that computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream and a attraction-based mechanism to accurately generate cluster boundaries. Exploiting the intricate relationships among the decay factor, attraction, data density, and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.
AB - Clustering real-time stream data is an important and challenging problem. Existing algorithms such as CluStream are based on the k-means algorithm. These clustering algorithms have difficulties finding clusters of arbitrary shapes and handling outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this article proposes D-Stream, a framework for clustering stream data using a density-based approach. Our algorithm uses an online component that maps each input data record into a grid and an offline component that computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream and a attraction-based mechanism to accurately generate cluster boundaries. Exploiting the intricate relationships among the decay factor, attraction, data density, and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.
KW - Clustering
KW - Data mining
KW - Density-based algorithms
KW - Stream data
UR - https://www.scopus.com/pages/publications/73349087457
U2 - 10.1145/1552303.1552305
DO - 10.1145/1552303.1552305
M3 - Article
AN - SCOPUS:73349087457
SN - 1556-4681
VL - 3
JO - ACM Transactions on Knowledge Discovery from Data
JF - ACM Transactions on Knowledge Discovery from Data
IS - 3
M1 - 12
ER -