ABSTRACT

Constant values in continuous (e.g., zip code) or categorical fields (state code) should not be included in any predictive or descriptive data mining modeling, since these values are unique for each case and do not help to discriminate or group individual cases. Similarly, unique information about customers such as phone numbers and social security numbers should also be excluded from predictive data mining. However, these unique value variables can be used as ID variables to identify individual cases and to exclude extreme outliers. Also, avoid including highly correlated (correlation coefficient > 0.95) continuous predictor variables in predictive modeling since they can produce unstable predictive models that work only with the sample used.