ABSTRACT

This chapter has laid emphasis on the paradigm shift of Data Mining to Web Mining owing to the demands from the real world for live data analysis in order to devise valuable, hidden, potential patterns from the large amount of data. The dataset derived in the last chapter demonstrates the data distributed at different locations and in fact is supposedly opening step toward web mining. As presented earlier, web mining implies a methodology to extract the valuable, important pattern or knowledge from the data which is distributed at remote locations or servers. A free shareware tool such as site-analyzer helps in building such a dataset for the purpose of web mining. As demonstrated, the data of approximately 150 commercial websites has been collected and used for the analysis. The site-analyzer tool empowers in gathering useful metrics such as global score, web accessibility, design, texts, multimedia, and networking. This collected data is then further stored in tabular form parameterwise for the purpose of further process of web mining. This is followed by application of preprocessing techniques in order to remove the unexpected and non-relevant data items. As presented in this chapter, sometimes there are issues like not display/hidden data items in case of few commercial websites, which ultimately needs to be removed from the dataset. The values of the parameters are displayed in the form of either ‘yes’ or ‘no’. For such a case, ‘yes’ is replaced with 1 and ‘no’ is replaced with 0. Similarly, other parameters are also handled. It is made sure that the data of all these parameters are free from error and preprocessed completely. As further demonstrated in this chapter, filtering techniques are necessitated to remove unwanted columns from the data of all parameters as the same does not play an important role in analysis. All these steps ensure data in standard form and available for further processing and analysis using Weka. Methodology reported in this chapter is a unique one, which can be applied for any potential domain wherein data is emanating from multiple remote sources.