Measurement is the systematic process of assigning numbers to represent quantities of a property or attribute. Examples include the use of millimeters as a measure of length, lumens as a measure of visible light, and standardized test scores as a measure of achievement. The first two of these examples, length and visible light, differ from the third in that they involve physical properties that are directly observable. Achievement, on the other hand, is an example of a psychological attribute that can only be observed indirectly, and that is thus more difficult to conceptualize and operationalize. In the context of education, impactful decisions are often informed by the measurement of underlying psychological attributes. To effectively represent these attributes, educational measurement must be supported by theory, carried out with appropriate statistical modeling, and evaluated according to established standards.

The field of educational measurement has evolved through research and practice over the past century or so, with contributions from the fields of psychology and statistics. Applications include intelligence testing, licensure and certification examinations, and classroom assessments of student learning. Early applications of educational measurement relied on frameworks that fit within what is called classical test theory, which presents observed test scores as consisting of two uncorrelated components, one systematic and the other randomly varying. With developments in methodology and improvements in computing power, current measurement frameworks tend to be based on item response theory, which considers test scores in terms of interactions between test items and test takers. The chosen measurement theory typically lends itself to a specific statistical model and conceptualization of reliability, as well as procedures for scoring, scaling, and evaluating results. Classical test theory traditionally supplies simpler estimates of reliability and measurement error that are fixed across a measurement scale, as well as item and person statistics that are dependent on a test administration. Item response theory allows for the estimation of more complex item and person statistics that are, in theory, independent of a test administration, as well as reliability estimates that can vary over the score scale.

When measurement results are used to inform educational decision-making, validity evidence must be collected to document the appropriateness of the information provided. Interpretations and uses of measurement results gain validity as research and practice, typically led by test developers, confirm their appropriateness. Common sources of validity evidence include expert review of testing materials and procedures, correlations with other measures, and dimensionality analysis, in addition to demonstration of having followed professional standards for test development and administration.