Density-based clustering for real-time stream data

  • Yixin Chen
  • , Li Tu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

516 Scopus citations

Abstract

Existing data-stream clustering algorithms such as CluStream arebased on k-means. These clustering algorithms are incompetent tofind clusters of arbitrary shapes and cannot handle outliers. Further, they require the knowledge of k and user-specified time window. To address these issues, this paper proposes D-Stream, a framework for clustering stream data using adensity-based approach. The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density. The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream. Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time. Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system. The technique makes high-speed data stream clustering feasible without degrading the clustering quality. The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.

Original languageEnglish
Title of host publicationKDD-2007
Subtitle of host publicationProceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages133-142
Number of pages10
DOIs
StatePublished - 2007
EventKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - San Jose, CA, United States
Duration: Aug 12 2007Aug 15 2007

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

ConferenceKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CitySan Jose, CA
Period08/12/0708/15/07

Keywords

  • D-stream
  • Density-based clustering
  • Sporadic grids
  • Stream data mining

Fingerprint

Dive into the research topics of 'Density-based clustering for real-time stream data'. Together they form a unique fingerprint.

Cite this