ABSTRACT

In this section we shall give some of the classical formulae in correlation and regression analysis that will be useful in the following. For simplicity we shall first consider the case of two variables. Suppose that we have made a large number of observations, each observation being characterized by the magnitude of two quantitative attributes X1 and X2. Thus each of our observations consists of two measurements, namely, the measurement of X1 and the measurement of X2. Any number of examples may be given. Our observation may, for instance, consist of the measurements of X1 = height and X2 = weight of a recruit. The measurement of the first recruit may be designated (1X1,1X2), the measurement of the second recruit may be designated (2X1,2X2), and so on. Again X1 and X2 may be represented, say, by the quantity traded of a commodity and the price of this same commodity. Or X1 may represent pig iron production and X2 represent the interest rate, and so on. If we pick out any of the observations we shall generally denote it by the ‘frontscript’ t, thus tX1, tX2. The last of the observations we shall designate by the frontscript N, that is to say, the number of observations is equal to N. If the two variables considered represent magnitudes varying in time, then the variable t in our notation would actually designate time. But if the variables considered are not time series but variables in any statistical distribution, then t would only be a number designating the particular observation considered. And in this case the succession of the various observations will as a rule be unessential for the problem at hand, while in the first case, namely, when t actually designates time, the succession of the observations might be essential. The result of such a set of observations may be represented graphically by a scatter diagram. This is constructed in the following way: we draw a system of axes (X1, X2). In this diagram we mark off a point that has the coordinates (1X1, 1X2), that is to say, a point whose abscissa is equal to 1X1 and whose ordinate is equal to 1X2. This point represents the first observation. For shortness we may represent this point itself by the symbol 1X. We shall call it the first observation point. Similarly, the second observation (2X1, 2X2) is represented by a point

whose abscissa is equal to 2X1 and whose ordinate is equal to 2X2. This point we designate 2X and we call it the second observation point. Similarly the following observations are represented by a set of points 3X, 4X, . . ., NX. If the number of observations is N there will be in all N points (see Figure 3.1). This set of points will form a ‘swarm’ or a ‘cluster’, the nature of which is characteristic for the observations. All the information obtained by the observations at hand is represented in this scatter diagram. Studying the result of our observations is therefore equivalent to studying the nature of the scatter diagram. The problem of analysing the result of observations may therefore be looked upon as a problem of studying the various characteristics of the scatter diagram. To give a few examples: one of the essential properties of the scatter diagram is its centre of gravity. Another feature is the density with which the points cluster around this centre. In this regard we may speak of the concentration in the vertical or in the horizontal direction. And another feature is whether or not the cluster is organized in the sense that if we increase one of the variables then we will on the average also increase the other. This latter problem, namely the problem of the organization of the cluster, contains in itself two separate problems, namely:

1 The problem regarding the nature of the relationship which is exhibited by the scatter diagram. That is to say, what is the nature of underlying trend in the cluster, is it linear, curvilinear, etc.?