ABSTRACT

During a second prototyping phase of the new TOEFL® project from 2000 to 2001, we sought evidence for the quality of the measures in the proposed blueprint that had been the end result of the first phase (the blueprint is summarized in Table 4.8). Our focus in this second phase broadened from issues related to the design of tasks to those related to the design of measures composed of these tasks. The research in this phase was motivated primarily by the need to support the generalization inference. The scores that the test takers obtained on each of the four measures would be interpreted as reflective of scores that would be obtained on parallel versions of the measures. In particular, the design of the speaking and writing measures needed to take into account the many factors that would affect score generalizability, as well as practical constraints. These measures were composed of different task types that varied in their dependence on input from other modalities and that required complex constructed responses. Only a few of these tasks, which were time-consuming to administer and were costly to score, could be included on a test form. Therefore, a study was designed to assess of the impact of different task configurations and rating designs on the reliability of the test measures.