ABSTRACT

Taxonomic descriptions are the core output of systematics research and of critical importance for key questions in the fields of biology, earth science and environmental science. These descriptions contain vast amounts of information about the morphological features of organisms on Earth, their geographic distribution and, for fossils, their geological history. Much of these data are not widely available to the many potential users because they are predominantly published as hard copy in systematics journals or monographs. Digitization of these descriptions would make them much more widely available, but doing this manually would be an enormous and unrealistic task. This chapter describes an alternative method of automating the digitization of taxonomic descriptions, using new techniques in computing science that exploit the high degree of structure and organization imposed by systematic convention and rigorous editorial procedures. The method involves parsing such partially structured text to generate XML tags around discrete sections of the text. Once tagged, complex queries can be run across the data that were not possible with the non-tagged text, and the tagged text can more readily be imported into an existing 64database if required. A major bottleneck in the construction of biodiversity databases would therefore be overcome if the extensive data present in taxonomic descriptions could be extracted by computer and not rely on human operators manually entering the information into database fields. The advantages of automating the data capture phase of biodiversity database development are numerous — the process is fast, flexible in terms of input data and output data, accurate and can readily be updated. Adopting this strategy would mean that computers are doing the boring repetitive part of the process for which they are ideally suited, freeing humans to devote more time and resources towards the creative, analytical exploitation of these data. Issues such as copyright and intellectual property need to be addressed, but these are well within the capabilities of the kinds of cyber-infrastructure being developed in computing science. It also suggests that museums and other repositories of natural history collections should urgently review their policies on the publications of taxonomic descriptions based on specimens in their collections. In the digital world, it may well be that digitized data from collections-based research should be managed and maintained every bit as assiduously as the specimens are. An obvious way forward would be to adopt a twofold strategy of preparing XML templates for future taxonomic descriptions that allows synchronous publication and digital captures, and a separate phase of scanning, digitization of existing taxonomic monographs (many of which are full of relevant and beautifully illustrated taxonomic data).