ABSTRACT

Test developers face two issues: (a) what to measure, and (b) how to measure it (Lindquist, 1936). For most large-scale testing programs, test blueprints are developed that specify content and cognitive demands in terms of “what to measure.” Regarding “how to measure,” one dilemma facing designers is the choice of item format. The issue is significant in a number of ways. First, interpretations vary according to item format. Second, for policymakers, the cost of scoring open-ended items can be enormous compared with multiple-choice items. Third, the consequences of using any given format may affect instruction in ways that foster or hinder the development of cognitive skills being measured by tests—an effect related to systemic validity (Frederiksen & Collins, 1989). Everyone involved in these discussions points to the centrality of validity concerns. Whether our attention is to systemic validity, a unitary construct validity orientation (Messick, 1989), or a focus on consequential validity (see Mehrens, chap. 7, this volume; Messick, 1994), meaning and inference are our concerns.