ABSTRACT

Once again, it is assumed that the e’s associated with each subpopulation are normally distributed with all variances approximately equal. Our best estimate of the true population regression line would be the straight line that we can draw through our sample data. However, if asked to draw this line using a straight edge, it is unlikely that any two people, using visual inspection, would draw exactly the same line to fit best among these points. Thus, a variety of slopes and intercepts could be approximated. There are in fact an infinite number of possible lines, y = a + bx, which could be drawn between our data points. How can we select the “best” line from all the possible lines that can pass through these data points? The least-squares line is the line that best describes the linear relationship between the independent and dependent variables. The data points are usually scattered on either side of this straight line that fits best between the points on a scatter diagram. Also called the regression line, it represents a line from which the smallest sum of squared differences is observed between the observed (x,yi)

coordinates and the line (x,yc) coordinates along the y axis (sum of the squared vertical deviations). In other words, this “best fit” line shows where the sum of squares of the distances from the points in the scatter diagram to the regression line in the vertical direction of the y-variable is smallest. The calculation of the line that best fits between the sample data is presented below. The slope of this line (Eq. 13.7) is:

− Data to solve this equation can be generated in a table similar to the one used for the correlation coefficient (Table 13.4). The sample slope (b) is our best estimate of the true regression coefficient (β) for the population, but as will be discussed later, it is only an estimate. The greater the change in y, for a constant change in x, the steeper the slope of the line. With the calculated slope of the line that best fits the observed points in the scatter diagram, it is possible to calculate an “anchor point” on the y-axis (the yintercept) using Eq.13.7 (that point where the x-value is zero):

An alternative approach to the scatter diagram is to display the information in a table. The regression line can be calculated for the data points in Figure 14.1 by arranging the data in tabular format as presented in Table 14.1. Similar to the manipulation of data for the correlation coefficient, each x-value and y-value is squared, and the product is calculated for the x-and y-value at each data point. These five columns are then summed to produce x, y, x2, y2, and xy. Note that y2 is not required for determining the regression line, but will be used later in additional calculations required for the linear regression model. Using the results in Table 14.1, the computations for the slope and y-intercept would be as follows:

Table 14.1 Data Manipulation of Regression Line for Figure 14.1

x y x2 y2 xy 5 1.2 25 1.44 6.00 10 1.9 100 3.61 19.00 15 3.1 225 9.61 46.50 n = 8 20 3.6 400 12.96 72.00 25 5.3 625 28.09 132.50 30 5.8 900 33.64 174.00 35 7.4 1225 54.76 259.00 40 7.7 1600 59.29 308.00  = 180 36.0 5100 203.40 1017.00

x 5

10 15 20 25 30 35 40

y

1.2 1.9 3.1 3.6 5.3 5.8 7.4 7.7

Figure 14.2 Regression line for two continuous variables.