ABSTRACT

The problem of data clustering has been widely studied in the data mining and machine learning literature because of its numerous applications to summarization, learning, segmentation, and target marketing. In probabilistic models, the core idea is to model the data from a generative process. Generative models are among the most fundamental of all clustering methods, because they try to understand the underlying process through which a cluster is generated. Density- and grid-based methods are two closely related classes, which try to explore the data space at high levels of granularity. The streaming scenario is particularly challenging for clustering algorithms due to the requirements of real-time analysis, and the evolution and concept-drift in the underlying data. While streaming algorithms work under the assumption that the data are too large to be stored explicitly, the big data framework leverages advances in storage technology in order to actually store the data and process them.