ABSTRACT

The executive director of the American Historical Association, James Grossman (2012), has defined big data as “the zillions of pieces of information that traverse the internet.” He suggests that “there is something here for historians. Maybe even something Big [sic].” However, historians have been slow to realize the full potential of big data. There are number of reasons. First, in some countries such as Britain, a lot of the big data is provided by commercial publishers, and thus, access requires an individual or institutional subscription/payment. Second, the big data website search engines rely on optical character recognition (OCR), the accuracy of which may be considerably less than 100% such as in the case of pre-twentieth century newspapers. There is a tradeoff between the number of scanned pages and the accuracy of OCR. Third, many historians are unaware of the full extent of the big data available. As Zhang, Liu, and Matthews (2015) suggest, there is a disconnect between the library and information services professionals who are developing the digital humanities resources and the humanities academics who are the intended 78user community. Fourth, historians also face the daunting volume of digitized historical data. Rosenzweig (2011, 23) observes, “Surely, the injunction of traditional historians to look at ‘everything’ cannot survive in a digital era in which ‘everything’ has survived.” Fifth, using big data requires historians to adopt a different research methodology. Rosenzweig and Cohen (2011, 31) observe that historians are accustomed to analyzing discrete historical sources with great care, whereas “[c]omputer scientists specialize in areas such as ‘data mining’ (finding statistical trends), ‘information retrieval’ (extracting specific bits of text or data), and ‘reputational systems’ (determining reliable documents or actors), all of which presuppose large corpuses on which to subject algorithms.” Sixth, it would also appear that some historians regard digital sources to be of inferior value. Jonathan Blaney (2016), a librarian at the University of London’s Institute of Historical Research (IHR), has observed that historians using British History Online (2016), the IHR’s digital library of key printed primary and secondary sources for the history of Britain and Ireland, often contact him for the page numbers of the original paper books, articles, or documents. They are unwilling to cite the web addresses of the digitized version of the sources in their publications. Rosenzweig and Cohen (2011, 31) note that with some exceptions, historians have not adopted the research methodology of computer scientists to engage with big data. Historians generally prefer to apply their own minds to analyze data rather than using digital technology to do it for them. The New Economic Historians from the predigitization era are among the exceptions. This school of economic history produced some noteworthy research findings such as Fogel and Engerman’s (1974) Time on the Cross, which argued that the Ante-Bellum American South had a growing rather than stagnating economy and that slave agriculture was not inefficient compared with free agriculture. Sixth, Zhang, Liu and Matthews (2015, 366) also note that “The unit of currency in DH [Digital Humanities] is not necessarily an article or a book, but rather a project, which is usually published using an open web platform, allowing users to dynamically interact with underlying data.” In countries where research funding privileges traditional articles and books, this may result in historians eschewing full engagement with big data.