Establishing a corpus of the archived web : The case of the Danish web from 2005 to 2015

doi:10.4324/9781315231662-9

ABSTRACT

This chapter investigates how a corpus to support the historical study of a national web can be established within a national web archive. The point of departure is that a national web archive usually holds several versions of the same web entity, and therefore a corpus has to be established from this comprehensive collection. This chapter discusses the impact of different approaches on the resulting corpus. Based on different datasets, obtained from the Danish national web archive 2005–2015, and the different ways these are handled, the chapter shows that the differences between the results are significant. Finally, the possible implications this has on research are discussed.