ABSTRACT

This chapter surveys the important approaches to the deep web crawling, raises major challenges in the area, and outlines the solutions to these challenges. The deep web is abstracted as a graph, and the crawling problem is modeled using random graph theories. We classify the deep web into several categories, each category has its unique challenges in crawling. In model M0, documents have zero variation of being captured; in model Mh, documents have heterogeneous capture probabilities; and in model Mr, documents are ranked and only top k are returned. For each model, we will delineate the cost of crawling, and methods to improve the crawling performance. This chapter serves as the reference for researchers and practitioners in the deep web crawling.