ABSTRACT

The assessment of performance in understanding and using a second language potentially spans an extraordinary range of domains, tasks, testing techniques, scoring procedures, and score interpretations. The notion of “performance” is usually associated with overt simulation of real-world language use, but it also extends to both recognition of well-formed utterances and sentences, as evidenced in assessment methods such as grammaticality judgments, picture identification, and the ubiquitous multiple-choice item. Performance on these types of instruments is primarily passive and indirect, tapping into learners’ declarative knowledge about a second language. Test takers need only indicate their capacity to identify what is well-formed in the target language. Indirect methods of assessing language proficiency have for many decades been found wanting in terms of content, construct, and predictive validity, but have survived primarily because of their relative cost-effectiveness and familiarity. Demonstrations of procedural knowledge, specifically timed essays, role plays, interviews, and

simulations, represent the most direct versions of performance assessment. Direct assessments of language proficiency have been associated with task-based language assessment (Norris et al., 1998), though the transition from the most indirect discrete point forerunners of current taskbased assessments was separated by more than a decade of experimentation with testing methods such as translation, cloze, and dictation, which were thought to require an integration of declarative knowledge onto quasi-productive modalities (Oller, 1979). The integrated skills approach was considered an advance over the more discrete-point indirect predecessors, but as Wesche (1987) concluded, integrative tests may not be direct enough to accurately predict proficiency in contexts where language use requires overt performances. As performance assessment becomes increasingly synonymous with task-based assessment,

specification of what an assessment task actually entails is subject to interpretive variability. Norris et al. (1998) identify tasks as facsimiles of what occurs in the real world, reproduced in an assessment context. Bachman (2002) in contrast postulates that tasks designed without detailed specifications about the target domain can easily under-represent the construct the tasks are supposed to measure. Claims about language proficiency implied by performance on assessment tasks are thus crucially dependent on how thoroughly they sample and represent the constructs

they are designed to measure, how much they entail content felicitous with language use outside the assessment context, and the degree to which performances on such assessment tasks predict proficiency in non-assessment contexts. Basic to performance task specification and design is the notion of task difficulty. Attempts to

design facets of difficulty into tasks (e.g., Robinson, 2001) have been predicated on the idea that particular characteristics of tasks can be proactively and reliably manipulated by designers to calibrate the relative difficulty. The difficulty by design approach has not, however, consistently led to the predictable difference among tasks. As Bachman (2002) has stressed, task difficulty is not solely dependent on analytically derived characteristics of a task, but depends also on the interaction of individual test takers’ experiences and abilities to engage in particular tasks. This issue is a core challenge for task-based assessment design. Norris et al. (2002), for instance, argue that while some a priori estimations about task difficulty are borne out in empirical results, inferences about candidates’ performances on other tasks in the same domain might not be considered trustworthy. The implications of their research suggest that the task feature approach to design may be insufficient as a basis for task design. A feature checklist does not well predict actual difficulty, nor does it necessarily satisfy face or content validity expectations. Central to any validity claim is content domain analysis. Cumming et al. (2004) provide an

example of how content validity, which is subjectively argued, can be approached. They examined the perceptions of English as a Second Language (ESL) instructors about the authenticity of specimen tasks devised for the performance-oriented revision of the Test of English as a Foreign Language (TOEFL). As the perception of authenticity is central to the evidence-centered design approach adopted for the redesign of the TOEFL, expert judgment is considered essential for supporting claims about the validity of tasks devised to be facsimiles of performance in the target use domain. Content validity is considered by some measurement theorists, e.g., Lissitz and Samuelsen (2007) as essential to defining validity. The issue of content validity for performance assessment is even more crucial, but remains a necessary, but not sufficient, condition for a cohesive validity claim. Bachman and Palmer (1996) and Bachman (2002, 2005) argue for rigorous construct validation of performance assessments in addition to validity claims predicated on correspondences of task content to performance in the target domain. The rationale for test use thus shifts from domain sampling to the interpretation of test outcomes. In this interpretation, the assessment outcomes need to be construct-valid and predict future performances in the target use domain. Recent efforts to specify the relations among constructs, tasks devised to measure such con-

structs, and performance characteristics have been applied to a variety of language assessments. The evidence-centered design approach (Mislevy et al., 2002; Chapelle et al., 2008) uses a set of inferential procedures derived from Toulmin’s (2003) argument structure for the integration of models of examinee performance, construct validation, and task design specifications, which collectively strengthen the linkage of examinee performance characteristics to the constructs the performances are meant to instantiate. These schematic model formats require a multi-step interlinking chain of arguments from the domain description to the utilization of assessment results. Between each step, evidence-based justifications are required for the ultimate validity claims to be made. For performance assessments, the evidential basis for test score interpretation crucially depends on the validity of arguments linking the body of information or experience (i.e., its backing) to the test data that justifies interpretive claims (i.e., the warrant). Bachman (2005) provides a hypothetical example of a performance task examined under the lens of the Toulmin argument structure. An analogous example is presented here, but with excerpts from an authentic performance assessment specimen, a role play extracted from an oral proficiency interview.