ABSTRACT

This chapter describes the notion of Data Mining in depth. Data mining has been used traditionally for extracting the hidden, potential, useful and valuable information from very large amount of data, thanks to the state-of-the-art Data mining tools, which can potentially handle high dimensionality, heterogeneous and complex type data, including the so-called non-traditional data. Earlier such data was available in simple form and stored at a central place. Such data was used to mine by firing queries in order to find out the hidden patterns and in turn devising the knowledge from within. In current scenario, traditional techniques are unable to cater to the big multidimensional data since the size is too large and has been distributed at different places besides the issue of heterogeneity. Moreover, the real-world data is always dirty so there is need to convert it into quality data by using data preprocessing techniques. The data preprocessing includes data cleaning, data integration, data selection, and data transformation. After preprocessing, data mining techniques can be applied on such error-free data in order to extract the hidden knowledge. In addition, Data mining strategies include classification, clustering, association, prediction, estimation, etc. Classification techniques are used to classify the given datasets into two or more distinct classes, for example, decision tree, Naive Bayes, SVM, ANN etc. Clustering techniques are used to categorize the given data into a number of clusters so that inter-cluster data items are similar and intra-cluster data items are dissimilar, for example, k-means. Association rule finds the links in the data. The potential applications of data mining include financial data analysis, retail industry, telecommunication industry, biological data analysis and other scientific applications. Recent trends in data mining described in this chapter are visual data mining, distributed data mining, web mining, graph mining, etc. The issues related to classifications are data cleaning, relevance analysis, and data transformation and reduction. In the present chapter, we describe Weka, which is used for the analysis of collected data. Weka is open source software developed at Waikato University at New Zealand, which the potential readers would enjoy in their applications pertinent to big data analytics as the one presented in this book.