ABSTRACT

In order to analyze text or build models, it often needs to be preprocessed. Text preprocessing includes cleaning the data to remove unwanted elements. Techniques used include lowercasing, removing commonly occurring words that don't convey meaning, segmenting sentences into words, and many more. Python code is shared for each preprocessing technique. A full end-to-end example of what preprocessing techniques can be used for a specific scenario is discussed.

Once the data is in a clean state, it can be visualized and/or converted to numerical representations for building models. Several numerical representations of text are discussed and implemented in Python, including word frequency-based vectors, word embedding models, and others.

Data visualization tools are discussed along with Python implementations.

Furthermore, data augmentation techniques to artificially synthesize text samples are discussed.