ABSTRACT

Texts are generally considered unstructured. Most of the machine learning algorithms require the data to be in a structured format which handles the data as objects characterized by feature vectors. The chapter describes the techniques that can be applied to texts in order to derive features and their values. The chapter emphasizes the approach using the classical bag-of-words model. Most common steps, including working with different encodings, language identification, tokenization, sentence detection, filtering stop words, common and rare words, removing diacritics, normalization (making tokens that look differently but have different meaning look the same), annotation (distinguishing tokens that look the same but have different meaning), and calculating the vector values consisting of three components (local and global weights and a normalization factor). In order to be able to transfer once derived structured representation from one system to another, several common formats for structured data exchange are discussed too.