ABSTRACT

Regression commonly refers to the process of developing an empirical (data-driven) model to predict and/or explain one or more attributes in a database or set of data. It is most frequently associated with the simple linear model (y=mx+b) taught in most introductory statistics courses; the same ideas have been extended in many directions, including classification problems. When the emphasis is on hypothesis testing and simple models, the regression output is typically a few parameters that provide a direct linkage from the input variables to the predicted variables (or classification). In other situations the emphasis is on explaining as much of the variability in the output variables as is "reasonable" from the input variables. In this case, there are a number of "advanced" techniques, such as smoothing splines, decision trees, neural nets, and so forth, for which there are many "free" parameters. The meaning of any one of these parameters can be obscure. Many Data Mining techniques are, at their core, variations on well-known regression techniques.