ABSTRACT

One of the important features of an empirical study is the quality of its measurements, i.e., are the constructs measured appropriately (i.e., validity, Chapelle, this volume) and are thesemeasurements trustworthy (i.e., reliability, J. D. Brown, this volume).We expect the scores on a testing instrument to be a reflection of the construct we intend tomeasure, for example language proficiency, and nothing else. However, in practice, other factors influence test performance as well. For example, if we test a candidate’s writing proficiency by means of a writing prompt, eliciting a written composition, the writing score the candidate receives will be the result of the candidate’s writing proficiency, the characteristics of the writing prompt, the rating criteria, the rater’s severity, the interactions between these factors and probably some randomother factors as well. If, shortly afterwards, the candidate has to take a newwriting test, all these characteristics and their effects on thewriting score are likely to be somewhat different, except for the candidate’swriting proficiency. So, the comparability of the scores will largely dependon the size of the effects of the factorsmentioned and their interactions, compared to the candidate’s proficiency.Classical test theory (CTT, see J.D. Brown, this volume) focuses on the subdivision of the variance in test takers’ scores into variance due to proficiency differences (‘true score’ variance) and variance due to other, random factors (error variance). Generalisability theory (G-theory) aims at teasing apart the effects of different factors, such as writing prompt and rater, and thus determining the generalisability of test scores across, for example, prompts and raters. As such, G-theory can be seen as an extension of CTT, but G-theory also adds a number of interesting possibilities of evaluating the quality of (complex) measurements and designing new assessments. In the next sections, we will introduce a few applications of G-theory. The flexibility of this

approach comes at the cost of computational complexity. Here, we will confine ourselves to the basic elements of G-theory. The interested reader is referred to suggested readings at the end of this chapter.

Whenwewant to determine the generalisability of an assessment or test score, the first question that needs to be addressed is to what domain do we want to generalise, and thus: what is the universe of

admissible observations from which we consider our measurement to be a (random) sample. When measuring writing proficiency, wemight consider our writing assignment and the subsequent expert rating as appropriate. At the same time we probably acknowledge that another writing assignment and the rating by another expert could have beenused equallywell. Actually,we assume that our current assessment score has predictive value for the performance on other (similar) assessments. This predictive valuemay concern just the ranking of the candidates (relative decisions) or the actual level of the scores (absolute decisions). In all cases, we have to consider what the construct is we intend tomeasure, and what we consider admissible observations or measurements of that construct. G-theory is often presented as just another approach to reliability, but its concerns and questions about the facets of a measurement, underscore that generalisability is very much intertwined with validity issues. Assessments for a given construct can be described in terms of one or more aspects of the

testing conditions, such as writing prompts, text types and raters. Each aspect of the testing conditions is called a facet and has its own categories. For instance, the facet of text types may consist of persuasive, descriptive and narrative texts. A facet can be considered a set of similar conditions (Brennan, 1992). The facets involved in a study define the universe of admissible observations. For example, in a study with a three-faceted operationalisation of writing proficiency, this universe is determined by combinations of the different (acceptable) prompts, the different (acceptable) text types, and the different (acceptable) raters. In CTT the test taker is assumed to have a ‘true’ score, that with some error determines the

actual observed scores. G-theory’s counterpart of the true score is defined as a person’s mean score across all admissible observations, that is the mean score over an infinite number of writing prompts for an infinite number of text types with ratings by an infinite number of raters. Of course, this is a theoretical definition. The actual measures in an assessment are taken as a sample of all possible measures that allow us to estimate a person’s mean score over all admissible scores. This mean score is called the universe score and can be seen as the analogue of a person’s ‘true score’ in CTT. G-theory is concerned with the generalisability of actual observations to all admissible scores and thus to a person’s universe score. Most commonly in applied linguistic research, the language learners are the so-called objects of measurement and for convenience we will take our examples from these types of studies, but obviously the following also applies to studies with other objects of measurements, such as schools or sentences. Evaluating the generalisability of scores, we need to be aware of the kind of information we want

to include in our generalisation. In some studies the absolute level of the scores is not very meaningful or relevant. In those cases we are mainly interested in the ranking of candidates, for instance, for correlational analyses. We make so-called relative decisions about candidates. For example, two raters who differ in severity, but rank the candidates in the same way, will still come to the same relative decisions. However, if the absolute level of a candidate’s score is meaningful or relevant, and we want to make claims about candidates’ level of performance, we make absolute decisions. This could be the case in forms of criterion-referenced testing where candidates have to reach a certain preset score level for admission or immigration (Davidson and J. D. Brown, this volume, both discuss criterion-referenced testing; see Glaser, 1963, for the origin of the concept). It is not hard to imagine that in these latter situations raters of different severity are not interchangeable, and that when (severity) differences are large, the generalisability of single ratings is limited. So, the kind of decisions we want to make based on our scores determines the kinds of differences between raters, tasks and such we have to take into account when we evaluate the generalisability of the scores. Below we will introduce separate generalisability indexes for relative and absolute decisions. So far, we have assumed that the conditions within a facet are random.We consider our raters as

a random sample of all acceptable raters. When, for example, we employ two raters in our studies and we consider them interchangeable with any other two acceptable raters, raters is a

random facet of our measurement. The same reasoning can be applied to a set of items in a reading test: items as a random sample of all possible items, and interchangeable with another set of items. However, in cases where the conditions of a facet are limited and exhausted by the observations in the measurement, the facet is considered a fixed facet (Shavelson and Webb, 1991; Brennan, 2001). For example, when we administer writing tasks for fictional writing and for factual writing, and we intend to generalise to a universe consisting of these two conditions, fictional and factual writing, the facet of factualness is a fixed facet. Since the universe of generalisation is restricted, there is no longer a concern about generalisation across sorts of writing. The generalisability of the observed scores is determined by on the one hand the kind of decisions

(relative or absolute) and generalisations (over random or fixed facets) we want to make, and on the other hand by the sample of measurements we have to work with. G-theory uses analysis of variance to estimate both the variance due to individual differences in proficiency (the facet person representing the objects of measurement) and the effects of different facets on score variance. The first stage of calculating the variance components of the different facets and their interactions is called a G(eneralisability)-study. The results of the G-study, i.e., the variance estimates, can be used to design future assessments and to make decisions about number of tasks, raters and such. This second stage of designing a new assessment and estimating its expected generalisability is usually referred to as a D(ecision)-study. In the next sections, we will present some of the statistical underpinnings of G-and D-studies by discussing a few examples in more detail.