ABSTRACT

Text analysis in the digital humanities is challenged by legal hurdles, which make it difficult to access and especially to redistribute datasets of modern texts. As large digitisation projects grow, copyright challenges are increasingly acute. We discuss the legal landscape around large bibliographic datasets and explore principles of non-expressive and non-consumptive access as one solution to enabling research access to sensitive texts. Non-consumptive access seeks to make text available in an abstracted but maximally useless form, supporting research use without distributing the original, readable text. The HathiTrust Research Center is presented as a case study of these principles. Devoted to scholarly access to the 17 million works of the Hathirust, the Research Center has been enabling access through feature datasets, high-level visualisation tools, an in-browser analysis suite and a secure virtual machine environment. This assortment of approaches has different strengths and challenges, and we consider how each may be instructive in considering the future of research over sensitive text.