ABSTRACT

You can use regression modeling to predict a continuous variable from a data set with the tidymodels framework, including a text data set. The goal of predictive modeling with text input features and a continuous outcome is to learn and model the relationship between the input features and the numeric target or outcome. Linear support vector machine models often work well for text data sets, while tree-based models such as random forests often behave poorly in practice. Linear SVMs also work well without much hyperparameter tuning and are directly interpretable. There are many possible preprocessing steps for text data, from removing stop words to n-gram tokenization strategies to feature hashing to lemmatization, that may improve your model. Resampling data sets and careful use of metrics allow you to make good choices among these possible options, given your own concerns and priorities.