Text Mining in Archaeology: Extracting Information from Archaeological Reports

doi:10.1201/b18530-17

ABSTRACT

Archaeologists generate large quantities of text, ranging from unpublished technical fi eldwork reports (the ‘grey literature’) to synthetic journal articles. However, the indexing and analysis of these documents can be time consuming and lacks consistency when done by hand. It is also rarely integrated with the wider archaeological information domain, and bibliographic searches have to be undertaken independently of database queries. Text mining offers a means of extracting information from large volumes of text, providing researchers with an easy way of locating relevant texts and also of identifying patterns in the literature. In recent years, techniques of Natural Language Processing (NLP) and its subfi eld, Information Extraction (IE), have been adopted to allow researchers to fi nd, compare and analyse relevant documents, and to link them to other types of data. This chapter introduces the underpinning mathematics and provides a short presentation of the algorithms used, from the point of view of artifi cial intelligence and computational logic. It describes the different NLP schools of thought and compares the pros and cons of rule-based vs. machine learning approaches to IE. The role of ontologies and named

entity recognition is discussed and the chapter demonstrates how IE can provide the basis for semantic annotation and how it contributes to the construction of a semantic web for archaeology. The authors have worked on a number of projects that have employed techniques from NLP and IE in Archaeology, including Archaeotools, STAR and STELLAR and draw on these projects to discuss the problems and challenges, as well as the potential benefi ts of employing text mining in the archaeological domain.