ABSTRACT

Text data typically contains different versions of one base word, often called a stem. Stemming is the practice of identifying and extracting the base or stem for a word using rules and heuristics, and stemming is concerned with the linguistics subfield of morphology, how words are formed. We encourage you to think of stemming as a preprocessing step in text modeling, one that must be thought through and chosen (or not) with good judgment. Stemming reduces the sparsity of text data which can be helpful when training models, but at the cost of throwing information away. Typical stemming algorithms are somewhat aggressive and have been built to favor sensitivity (or recall, or the true positive rate) at the expense of specificity (or precision, or the true negative rate). Lemmatization is another way to normalize words to a root, based on language structure and how words are used in their context.