DOI link for Multiple Regression
Multiple Regression book
Multiple Regression Multiple regression is one type of complex associational statistical method. Already, we have done assignments using another complex associational method, Cronbach’s alpha, which, like multiple regression, is based on a correlation matrix of all the variables to be considered in a problem. In addition to multiple regression, two other complex associational analyses, logistic regression and discriminant analysis, will be computed in Chapter 8. Like multiple regression, logistic regression and discriminant analysis have the general purpose of predicting a dependent or criterion variable from several independent or predictor variables. As you can tell from examining Table 6.4, these three techniques for predicting one outcome measure from several independent variables vary in the level of measurement and type of independent variables and/or type of outcome variable. There are several different ways of computing multiple regression that are used under somewhat different circumstances. We will have you use several of these approaches, so that you will be able to see that the method one uses to compute multiple regression influences the information one obtains from the analysis. If the researcher has no prior ideas about which variables will create the best prediction equation and has a reasonably small set of predictors, then simultaneous regression, which SPSS calls Enter, is the best method to use. It is preferable to use the hierarchical method when one has an idea about the order in which one wants to enter predictors and wants to know how prediction by certain variables improves on prediction by others. Hierarchical regression appropriately corrects for capitalization on chance, whereas stepwise, another method available in SPSS in which variables are entered sequentially, does not. Both simultaneous regression and hierarchical regression require that you specify exactly which variables serve as predictors, and they provide significance levels based on this number of predictors. Sometimes you have a relatively large set of variables that you think may be good predictors of the dependent variable, but you cannot simultaneously enter a large set of variables without sacrificing the power to find significant results. In such a case, stepwise regression might be used. However, as indicated earlier, stepwise regression capitalizes on chance more than many researchers find acceptable. In essence, stepwise regression computes correlations between all the predictors with the outcome variable, then the computer enters the largest first. Next, the variables are evaluated to assess which one when added to the model will increase R2 the most. This continues until all the variables are considered and the highest R2 has been found. Finally, the computer considers if removal of any predictor will increase R2. Many researchers do not use stepwise, as it has been commonly found to use the wrong degrees of freedom, it capitalizes on sampling error, and R2 is not always optimized. So we will not demonstrate it here. Many researchers suggest that a better approach would be to aggregate correlated predictors, thereby reducing the number of predictors. Other methods include backward and forward regression. Forward regression adds variables one at a time by assessing which variable has the smallest probability of F (i.e., p value). This continues until all variables are added that have a p value equal to or less than .05. With backward regression, all the variables are added into the model, then are eliminated one by one, with the variable that has the largest probability of F (i.e., p value) removed until all variables have a p value equal to or less than .10. Many researchers believe that none of these techniques find the “best” model, and instead use an approach where all sub-sets of the variables are analyzed to find the best model. Unfortunately, at this
time SPSS does not do this computation. In this chapter we will present how to conduct simultaneous, hierarchical, forward, and backward regression techniques. Conditions of Multiple Linear Regression There are a few important conditions for multiple regression. For multiple regression, the dependent or outcome variable should be an interval or scale level variable, which is normally distributed in the population from which it is drawn. The independent variables should be mostly interval-or scale-level variables, but multiple regression can also have dichotomous independent variables, which are called dummy variables. Dummy variables are often nominal categories that have been given numerical codes, usually 1 and 0. The 0 stands for whatever the 1 is not and is thus said to be “dumb” or silent. Thus, when we use gender, for instance, as a dummy variable in multiple regression, we’re really coding it as 1 = female and 0 = not female (i.e., male). This gets complex when there are more than two nominal categories. In that case, we need to convert the multiple category variable to a set of dichotomous variables indicating presence versus absence of the categories. For example, if we were to use the ethnic group variable, we would have to code it into several dichotomous dummy variables such as EuroAmerican and not Euro-American, African-American and not African-American, and Latino-American and not Latino-American. A condition that can be extremely problematic as well is multicollinearity, which can lead to misleading and/or inaccurate results. Multicollinearity (or collinearity) occurs when there are high intercorrelations among some set of the predictor variables. In other words, multicollinearity happens when two or more predictors contain much of the same information. Although a correlation matrix indicating the intercorrelations among all pairs of predictors is helpful in determining whether multicollinearity is likely to be a problem, it will not always indicate that the condition exists. Multicollinearity may occur because several predictors, taken together, are related to some other predictors or set of predictors. For this reason, it is important to test for multicollinearity when doing multiple regression. Assumptions of Multiple Linear Regression There are many assumptions to consider, but we will focus on the major ones that are easily tested with SPSS. The assumptions for multiple regression include the following: that the relationship between each of the predictor variables and the dependent variable is linear and that the error, or residual, is normally distributed and uncorrelated with the predictors. Retrieve your data file: hsbdataNew.sav.