ABSTRACT

The problem of feature engineering for text data is very closely related to the problem of text representation as the set of features extracted from the data essentially serves as a representation of the original data from a particular perspective. This chapter provides a systematic review of all the major techniques developed in multiple communities over the years for computing a wide range of features from text data. It emphasizes techniques that are relatively general and robust since such techniques can be potentially applied to text data on any topic and in any natural language. A natural generalization of the string representation of documents might be to change the granularity of the sequence from the character/glyph level to the level of individual words. By far the most dominant approach for text representation is the "bag of words" representation. This approach summarizes the word sequence representations for documents by computing a histogram over words for a document's word sequence.