Web Data Extraction using Semantic Generators

doi:10.1201/b12168-11

ABSTRACT

Currently the Web provides a huge amount of unstructured and non-semantic information available for both, users, and automatic crawler programs. Although the Web is evolving to build sites with structured and semantic information, these new kind of Web sites are meant to be deployed in business-to-business (B2B) scenarios. Therefore, most of users (and Web applications) will continue to access data in HTML format [1]. To build systems able to access and extract the information stored in the HTML sources, wrappers are commonly used. Wrappers are specialized programs that automatically extract data from documents and convert the information stored into a structured format. Three main functions need to be implemented in a Wrapper: First, they must be able to download HTML pages from a website. Second, they must search for, recognize, and extract the specified data. Third, they have to save this data in a suitably structured format to enable further manipulation. XML is very interesting to structure information, as there are many tools that can use it (like XPath). Several research fields like: Information Gathering, Information Extraction, or Web Mining, have been involved in the process of extracting and managing the information stored in the the Web. [5]

This paper presents both, a general approach based on rules, that can be used to automatically generate wrappers, and an assistant generator wrapper (called WebMantic) that builds the wrapper. Our approach allows to create wrappers that obtain XML documents from HTML pages. We have defined a flexible filtering and preprocessing technique that allows to translate some pieces

of information, that contain the desired information, into a more understandable and semantic representation. Only the required portions of the page will be translated. Non-selected parts of the HTML document will be ignored. Finally, we will work under a reasonable limitation: only structured information, like data stored in lists or tables, will be considered.