Imagining the data | 4 | Corpus design | Matteo Di Cristofaro

ABSTRACT

The chapter presents a number of digital technicalities that directly impinge on the processing of digital textual data. From considerations and practical examples on how to set up a working environment to replicate the techniques presented in the volume, the three most adopted open formats for data processing are then presented. These are the formats employed throughout the volume and represent a subset of the ones that are standard in digital data and data sciences. Then the chapter outlines a set of practices and tools – mostly borrowed from digital archiving efforts – aimed at narrowing the impact of the subjective process previously mentioned by offering ways in which researchers can document and keep track of the changes made to the initial ‘raw data’. By doing so and allowing other researchers to have access to ‘what is data’, subjectivity is not diminished but becomes accountable and consequently both debatable and open to further contributions. Last, two major concepts surrounding digital textual data (character encodings and regular expressions) processing are presented: these represent the (arguably) most relevant factors involved in the treatment of digital textual data, directly affecting the scientific validity of a corpus and its resulting analyses.