ABSTRACT

You can use classification modeling to predict labels or categorical variables from a data set with the tidymodels framework, including data sets that include text. The goal of predictive modeling with text input features and a categorical outcome is to learn and model the relationship between those input features, typically created through steps as outlined in Chapters 1 through 5, and the class label or categorical outcome. Naive Bayes models can perform well with text data since each feature is handled independently and thus large numbers of features are computationally feasible. This is important as bag-of-words text models can involve thousands of tokens. We also saw that regularized linear models, such as lasso, often work well for text data sets and have the benefit of interpretable variable importance. Your own domain knowledge about your text data is valuable, and using that knowledge in careful engineering of custom features can improve your model in some cases. Performance metrics appropriate for classification models are different than those for regression, and can be used to compare, tune, and choose models.