ABSTRACT

Several Measures of Reliability This assignment illustrates several of the methods for computing measurement reliability. It is important to assess the reliability of your data prior to conducting inferential statistics. If your reliability tests indicate that your data have low reliability, the results from inferential testing would be suspect. Using existing measures that have already been tested and indicate that the data are reliable can help to increase the chances that your new data will be reliable. Regardless, it is important to assess the level of reliability for your data set, particularly if your sample differs in some way from the standardization sample. Internal consistency reliability for multiple-item scales. In this assignment, we will compute the most commonly used type of internal consistency reliability, Cronbach’s coefficient alpha. This measure indicates the consistency of a multiple-item scale. Alpha is typically used when you have several Likerttype items that are summed to make a composite score or summated scale. Alpha is based on the mean or average correlation of each item in the scale with every other item. In the social science literature, alpha is widely used, because it provides a measure of reliability that can be obtained during the study from just one testing session or one administration of a questionnaire. In Problems 3.1, 3.2, and 3.3, you will compute alphas for the three math attitude scales (motivation, competence, and pleasure) that items 1 to 14 were designed to assess. Reliability for one score/measure. In Problem 3.4, you will compute a correlation coefficient to check the reliability of visualization scores. Several types of reliability can be illustrated by this correlation. If the visualization retest was each participant’s score from retaking the test a month or so after they initially took it, then the correlation would be a measure of test-retest reliability. On the other hand, if the visualization retest was a score on an alternative/parallel or equivalent version of the visualization test, then this would be a measure of equivalent forms reliability. Imagine that the visualization test and the visualization retest involved an observer rating of participants’ responses to a behavioral test. Then, if two different raters’ scores for the this test comprised the visualization test and the retest, the correlation could be used as an index of interrater reliability. This latter type of reliability is needed when behaviors or answers to questions involve some degree of subjective judgment (e.g., when there are open-ended questions or ratings based on observations). Reliability for nominal variables. In addition to the correlation coefficient, there are several other methods of computing interrater or interobserver reliability. In Problem 3.5, Cohen’s kappa is used to assess interobserver agreement when the data are nominal. Assumptions for Measures of Reliability When two or more measures, items, or assessments are viewed as measuring the same underlying variable(construct), reliability can be assessed. Reliability is used to indicate the extent to which scores are consistent with one another (hopefully, in measuring the intended construct/variable) and the extent to which the data are free from measurement error. It is assumed that each item or score is composed of a true score measuring the underlying construct, plus error; there is almost always some error in the measurement. Therefore, one assumption is that the measures or items are related systematically to one another in a linear manner because they are believed to be measures of the same construct. In addition, because true error should not be correlated systematically with anything else, a second assumption is that the errors (residual) for the different measures or assessments are uncorrelated. If errors are correlated, this means that the residual is not simply error; rather, the different measures not only have the proposed underlying variable in common, but they also have something else systematic in common and reliability estimates may be inflated. An example of a situation in which the assumption of uncorrelated errors might be violated would be when all items are parts

of a cognitive test that is timed. The performance features that are affected by timing the test, in addition to the cognitive skills involved, might systematically affect responses to the items. The best way to determine whether part of the reliability score is due to these extraneous variables is by doing multiple types of reliability assessments (e.g., equivalent forms and test-retest). Conditions for Measures of Reliability A condition that is necessary for measures of reliability is that the scores or categories that are being related to one another need to be comparable. If you use split-half reliability, then both halves of the test need to be equivalent. If you use alpha (which we demonstrate in this chapter), then it is assumed that every item is measuring the same underlying construct. It is assumed that respondents should answer similarly on the parts being compared, with any differences being due to measurement error. • Retrieve your data file: hsbdataB.