Classification and the Internet | 20 | v3 | Classification Made Simple

ABSTRACT

One could, for example, select ‘Reference’ from the above list. This section contains a number of subordinate categories, one of which is ‘Library and Information Sciences’. When this is selected, a further list is presented .which includes ‘Knowledge Management’. From the list under ‘Knowledge Management’, ‘Knowledge Retrieval’ might be chosen and, from the list under ‘Knowledge Retrieval’, ‘Classification’. An examination of this search shows that the searcher is being guided through a hierarchical structure (indentation indicating the subordinate topics): Libraries and Information Science Knowledge Management Knowledge Retrieval Classification This categorisation can be extremely useful for its ‘classificatory’ nature concentrates the search on the appropriate subject area. Compare this with a search on the keyword ‘Classification’ in the main Google home page which, currently (2009), produces over 100 million hits, many of which are irrelevant to the required topic! Other search engines provide a similar categorisation facility. Yahoo,5 for instance, has a subject-based directory listing Websites in a wide range of topics, from arts, entertainment, society and culture, to science, education and health. If one was looking on Yahoo for Websites covering the programming language Visual Basic, ‘Computers & Internet’ would be selected from the top level. As with Google, the search could then progress though the hierarchical structure via subsidiary lists:

Computers & Internet Programming & Development (1171) Languages (968) Visual Basic (46) The numbers in parentheses indicate the number of relevant sites. One of the sites listed, for example, under Visual Basic is VB Web (www.vbweb.co.uk), that presents downloads, tutorials, source codes and news relating to this programming language Dodd (1996) is perhaps right to describe some of these attempts at categorisation as ‘semi-professional’. Although the hierarchical structures do support subject browsing, the nature of the ‘classification’ does not appear to be as systematic as can be found in the more traditional, established schemes; frequently cross-classification is apparent and a further possible disadvantage is that they do not have a notation. Whilst, as has been pointed out earlier in this text, a notation is not absolutely necessary for a system of classification to work, it can provide added value. Use of conventional classification schemes Is it possible, then, to utilise conventional schemes such as the Dewey Decimal Classification or the Library of Congress Classification for searching the Web? The answer, or course, is a qualified ‘Yes’, although services which make use of these schemes are usually non-commercial, much more restrictive than the search engines referred to above, and possibly experimental. In the early days of the Web, classification schemes such as UDC were used as organisers for many subject gateways but, by 1904, Broughton was claiming that they had ‘largely been superseded in this role’. Various reasons can account for this: lack of funding; difficulty of keeping up with the growth of the Web; change to another form of indexing; collaboration and merging of subject gateways and portals; and so on. In the previous edition of this present work, for instance, NISS (National Information Services and Systems) was cited as an example of a service which used UDC as a subject gateway to its Directory of Networked Resources but NISS became part of HERO (Higher Education & Resource Opportunities) and ceased using UDC in 2003. However, there are still a number of services that use traditional schemes. The Dewey Decimal Classification is used by ‘Webrary’ and

‘BUBL’. The Library of Congress Classification is used by ‘Cyberstacks’ and the ‘Scout Report Archives’ uses Library of Congress Subject Headings. Webrary6 is a service provided by the Morton Grove [US] Public Library. The Webrary Links Menu consists of links to what are claimed to be the most useful reference and informational Web sites, organised by Dewey class numbers. Selecting one of the ten main classes of Dewey provides a breakdown for that class, e.g. for class 000:

000-010-020-etc Computers Bibliography Library Science

Upon selecting 020, two of the first relevant sites named are those of the ‘American Library Association’ and the ‘Library of Congress Information System’. The BUBL7 Information Service for the library and higher education community also uses the Dewey Decimal Classification system as the primary organisation structure for its catalogue of internet resources. It is run by the Centre for Digital Library Research at the University of Strathclyde. The name was originally short for the Bulletin Board for Libraries. The search method is similar to that described for Webrary. The home page shows the main ten divisions of Dewey from which you select the class that you require. This will give you a breakdown of the divisions in that class from which you can make a further selection and so on. For example one might select 200 Religion and then 220 Bible and then 220.3 Encyclopedias and topical dictionaries. One of the resources given at this number is the internet version of Easton’s 1897 Bible dictionary, which defines unusual or religious terms from the scripture. Cyberstacks (sm)8 is a collection of significant World Wide Web and other Internet resources categorised using the Library of Congress Classification. As with Webrary, resources are categorised firstly within a broad classification, e.g. G Geography, Anthropology and Recreation H Social Sciences J Political Sciences . . . then within narrower subclasses and finally under a specific classification range and associated subject description. For example, ‘T Technology’ might be selected from the main menu, then ‘TL Motor vehicles, Aeronautics and Astronautics’ from the sub-classes, and finally ‘TL 7874050 Astronautics’ from the specific classification range. A relevant

resource listed here is ‘NASA Astronaut Biographies’. For each resource, a brief summary is provided and, when necessary, instructions on using the resource. At present, CyberStacks(sm) is a prototype demonstration service and is limited to significant Internet resources in selected fields of science and technology. The Scout Report Archives,9 located in the Computer Sciences Department of the University of Wisconsin, is a searchable and browsable database of nearly nine years of the Scout Report. Searches may be made under keyword or a more advanced search may be made in other fields such as title, author, publisher or Library of Congress Subject Headings. If, for instance, ‘D’ is selected from the ‘Browse by Library of Congress Subject Headings’ alphabetical array, a list of headings beginning with that letter is displayed. Among these will be found ‘Dinosaurs (67)’. This indicates that there are 67 resources with subject headings that start with the term ‘Dinosaurs’. Selecting this entry will reveal a list of ‘Resources (19)’ entered directly under ‘Dinosaurs’ and further ‘Classifications’, or subheadings, such as ‘Bibliographies (4) … Databases (1) … and Study and Teaching (15), relating to other resources. One subheading listed here is ‘Cardiovascular system (2)’ and selecting this will, for example, reveal the site ‘Willo: the dinosaur with a heart’, that refers to the discovery of a fossilised dinosaur heart. At the time of writing, the Scout Report Archives contain over 25,000 critical annotations of carefully selected Internet sites. In certain subject areas, the use of special rather than general classification schemes or lists of subject headings may be more appropriate. Among the options offered by, Intute: Social Sciences,10 for instance, is an online thesaurus. This is a service which aims to provide details of the very best Web resources for education and research in the social sciences. The thesaurus engine helps to indicate alternative search terms to the one selected, organised into a hierarchy of relationships. For example, searching for ‘Offences’ will produce the following suggested matches: drinking offences [0] driving offences [4] offences [78] political offences [2] sexual offences [19] At this point, one may examine the items indexed under a particular term (the number being indicated in parentheses) or continue to consult the thesaurus. Clicking on ‘sexual offences’, for instance, will give the following:

sexual offences [16] Broader terms Narrower terms Related terms offences [65] pornography [15] prostitution [38] rape [44] sexual behaviour [21] sexual harassment [29] gender [730] child sexual abuse [30] behavioural disorders [5] Again, one has the option of examining the items indexed or continuing to consult the thesaurus. If, at this point, it is decided that ‘behavioural disorders’ is the term required, selecting this will reveal the five items. These include, for instance, details of the Website of the Association for the Advancement of Behavioural Therapy. It should be noted that services such as Webrary, BUBL and the Scout Report Archives offer other facilities such as searching by keywords. However, as stated above, these services are far less exhaustive than commercial search engines. They are more properly described as search directories or catalogues rather than search engines. The latter attempts to seek out as many as possible of the Web pages that match a particular search criteria. The former is more selective in that it is limited to a particular directory’s database; sites are usually categorised and therefore this type of directory is more likely to utilise hierarchically organised and cross-referenced structures. The difference in coverage can be illustrated by the fact that the 38 ‘hits’ for ‘prostitution’ on the Intute: Social sciences service compares with over 16 million on Google! Online Public Access Catalogues (OPACs) As described in the previous chapter, most libraries now have online public access to their catalogues. Many of these catalogues are available on the Internet The University of Liverpool catalogue12 (see pages 119120) and the catalogue of the Library of the African Studies Centre Leiden13 (see pages 121-2) are two examples. Thus is it possible to sit at home and use one’s computer to search a great number of systems for citations and availability of required documents.In many instances, classification plays a central role in this search process. Some institutions provide merely a version of the in-house system. Other institutions are more innovative. The London Business School, for example, provides ‘Concept space’13,a visual search tool for business concepts linked to a wide range of information sources. This is a ‘point and click’ system based upon the London Classification of Business Studies (see pages 31-3, 104-5 and 114). It is possible to navigate through the system using a ‘graphic’ view or a ‘text’ view. Colour coding is used in both views, e.g.:

red = entry term; green = broader term; blue = narrower term; purple = related term. Having found a relevant term in either view, the user can click on the ‘Search related sites’ link in order to obtain information on where to find books, articles, academic information, related companies and more. Another institution, the North Carolina State University14 uses a ‘Guided navigation’ system, designed by Endeca, which it is claimed ‘lets you search its libraries’ collections using faceted classification’.15 However, this is not faceted classification as described in this text. Endeca, itself states that it only ‘resembles’ what librarians call faceted classification.16 The ‘facets’ include ‘topic’, ‘author’, ‘genre’, ‘language’, ‘format’, ‘material type’ and ‘availability’. The ‘topic’ facet does allow a search to be narrowed to a more relevant subject area but use is then made of the far from ‘faceted’ Library of Congress Subject Headings. For example, the topics revealed by a search for ‘Roses’ include: ‘African Americans’, ‘Architecture’, ‘Awards’, ‘Catholic Church’, ‘Children’, ‘Children’s poetry’, and so on. Nevertheless, this is a sophisticated system which is said to provide ‘the speed and flexibility of popular online search engines while capitalizing on existing catalog records’.17 Classification of electronic documents MacLennan (2000) considers that if Internet resources were adequately classified there seems every probability that schemes such as Dewey and Library of Congress could provide adequate access. We have seen how these schemes can be used for directories and catalogues but is it feasible that all electronic documents carried on the Internet could be classified in the same way as the items in a conventional library are classified, thus permitting a particular class number to be used by a search engine as the search criteria? In order to achieve this, the relevant classification information would need to be carried within the document itself. A metadata (data about data) standard that could be used for this purpose is the Dublin core. The Dublin Core The Dublin Core Metadata Element Set (DCMES) is a system that facilitates the inclusion of tagged description and identification within Web documents. It provides a simple and standardised set of conventions for describing documents stored online in a way that people can understand, in order to make them easier to find. The basic Dublin Core Metadata (data about data) consists of fifteen ‘properties’ which comprise

a standard data set for describing electronic documents. The name “Dublin” is due to its origin at a 1995 invitational workshop in Dublin, Ohio. It is maintained by the Dublin Core Metadata Initiative (DCMI). The term ‘core’ is used because its elements are broad and generic, usable for describing a wide range of resources. These elements include the ‘creator’, ‘title’, ‘date’, ‘format’ and so on. One of the fifteen elements is the ‘subject’ and, typically, this will be represented using keywords, key phrases, or classification codes. It is the latter that is of interest here as it would enable the inclusion of classification numbers from a scheme such as Dewey or Library of Congress to be added to a document, which could then be retrieved with a classification number search. Whilst this is therefore theoretically possible, the problem is one of number, there are many, many millions of electronic documents on the Internet and the logistics are daunting. In relation to the Dublin Core, readers might well come across a variety of terms and abbreviations. For example, implementations of the Dublin Core typically make use of XML (eXtensible Markup Language) and also RDF (Resource Description Framework). XML is a specification for computer based documents; it provides a syntax for describing how data is presented rather than what data is presented. RDF is used to present data; it provides a standard model to describe Web resources. More information is provided about XML and RDF in the section on the Semantic Web (page 140). Classification schemes as aids to searching Despite the difficulty of including classification numbers in documents, there is the possibility of using a general classification scheme such as Dewey or Library of Congress as an aid to searching. There are machinereadable versions of Dewey and Library of Congress which could assist here and, if the complete scheme was considered to be too detailed, outlines of these schemes are available. Schemes are also available online on the Web. ‘WebDewey’18 is provided by OCLC; ‘UDConline’19 by the British Standards Institution; and ‘Classification Web’20 (giving access to the Library of Congress Classification and Library of Congress Subject Headings) by the Library of Congress Cataloging Distribution Service. Special systems such as the ACM Computing Classification21 (see pages 48-9) are also available on the Web. Both Dewey and Library of Congress, as the reader now knows, are enumerative schemes. One faceted classification based on modern theory, which has been advocated for use as an aid to searching is the Broad

System of Ordering (BSO) (see pages 33-5). The following description attempts to explain how the scheme might be used: For an Internet searcher who has exhausted keyword and hypertext link search

BSO can also be consulted online.23 Automatic classification As noted above, for a classifier to sit down and classify all of the electronic documents that appear on the Internet in the same way as this might be done in a library or information service clearly is not feasible. However, is it possible to remove the human intellectual effort and allow the computer to assign documents to appropriate subject categories automatically? Is there a role for data obtained automatically from source documents to be used in the classification process? Is it possible to automatically identify characteristics that a document should have in order to place it is a particular category or class. Whilst there do not appear to be any practical, ‘real-life’ examples of this at present, research into automatic classification has been going on for a considerable length of time. This research has assumed a greater importance as the need to improve access to Internet resources has intensified. Internet users can often be irritated and frustrated by a search engine’s tendency to produce a vast plethora of results from which relevant information has to be sifted. A number of projects have attempted to examine whether automatic classification can be used to improve the above situation. Various algorithms and sampling techniques have been used in investigations aimed at discovering whether automatic classification is a practical proposition. One method, for instance, entails the statistical analysis of the way in which terms co-occur in documents. ‘Documents that share the same frequently occurring keywords and concepts are usually relevant to the

same queries. Clustering such documents together enables them to be retrieved more easily and helps to avoid the retrieval of irrelevant information’ (Jenkins, 2001). There is also the possibility of a statistical comparison of citations. Documents that cite a similar set of citations or documents that are cited by a similar set of documents are clearly subject related. In the United States, the OCLC Office of Research has undertaken research into whether standard library classification schemes can be adapted to automatically classify materials, especially Web resources and other digitised electronic documents.24 OCLC has also investigated whether there is a role for indexes and topic maps (see also page 141) that are obtained directly from source documents. The OCLC Scorpion project25 offers software that implements a system for automatically classifying Web-accessible text documents. Scorpion is intended for use by investigators who have a machine-readable subject classification scheme (such as the Library of Congress Classification) or thesaurus and wish to incorporate it into an automatic classification system. There is also the OCLC RDF Topicmaps Project,26 which explores subject navigation of Web sites using semi-automatically generated finding aids. Joanna Yi-Hang Pong and others (2008) describe a comparative study of automatic document classification methods using two well-known machine learning algorithms (the k-nearest neighbour (KNN) and the naïve Bayes) using the Library of Congress Classification. A full explanation of these and the other research projects in this area is outside the scope of an elementary text of this nature but there seems little doubt that automatic classification can have a part to play in the provision of Internet access. Nevertheless, it is interesting to note that, in an ISKO UK (International Society for Knowledge Classification) conference in 2008, in a linked presentation on the implementation of autocategorisation by three British Broadcasting Corporation (BBC) information architects, one of the conclusions was that ‘you still need humans’! There is also the view that no single classificatory technique, whether manual or automatic, outperforms other classificatory methods in all situations and perhaps there is a need to provide different approaches which can be used simultaneously and the results combined so that users can achieve an optimum, satisfactory outcome (Blumberg and Atre, 2003). E-commerce The buying and selling of products and services over the Internet and other computer networks has become known as ‘electronic commerce’ or ‘e-commerce’ or ‘e-business’. We have already seen that classification

has a significant part to play in this activity. Examples can be found in Chapter 6 in the classifications used by the book trade and in Chapter 9 in the codification system used by NATO. Hierarchical classification We saw in Chapter 1 how classification must be used in a supermarket in order to aid the shopper in the selection of goods. Although not quite so essential for the Internet shopper, nevertheless some form of categorisation is very helpful. Ebay27 Currently one of the most used buying and selling websites is Ebay. Searching can be done via a straight keyword search or by categories. If one is interested in, for example, ‘Antique ceramic boxes’, a search directly under those terms will reveal the number of items in all categories. Alternatively, one can select an appropriate category from the many listed on the Ebay home page: Antiques Art Baby Books etc Selecting ‘Antiques’ will reveal a number of further categories, one of which is: Decorative Arts Within ‘Decorative Arts’, divisions include: Ceramics, Porcelain Clocks Glass Lamps etc and within ‘Ceramics, Porcelain’ we find: Bowls (638) Boxes (176) Creamers, Sugar Bowls (565) Cups, Saucers (1933) etc

The numbers (which are constantly changing as items are offered or sold) refer to the number of items available. Clicking on ‘Boxes’ will enable the searcher to browse through the large number of boxes. The searcher is being guided through the hierarchical structure:

A great many retail websites offer a similar facility. If one is interested in computing, for example, the website of the particular retail dealer chosen may well list various categories, i.e. Computing equipment Desktop PCs Laptops Monitors Peripherals Printers etc As with any hierarchical methodology, the selection of a particular category will reveal a further set of options. UNSPSC Today, ‘increasingly business is e-business’ and ‘fast, simple, accurate classification of goods and services is imperative in the marketplace’ (UNSPSC, 2009). In order to achieve this aim, the United Nations Development Programme and the Dun & Bradstreet Corporation jointly developed UNSPSC (United Nations Standard Products and Services Code) which can be used to classify all types of products and services. It is claimed to be ‘the most efficient, accurate and flexible classification system available today for achieving company-wide visibility of spend analysis, enabling procurement to deliver on cost-effectiveness demands and allowing full exploitation of electronic commerce capabilities’ (UNSPSC, 2009). Microsoft has selected UNSPSC as its standard commodity classification system (Turner, 200?).