Human Benchmarking of Natural Language Systems

doi:10.4324/9781315044682-4

ABSTRACT

This chapter documents an approach to the evaluation of intelligent computer systems, in particular, natural language (NL) understanding systems. Our overall project strategy was to develop a multidimensional system to evaluate both qualitative and quantitative elements of natural language computer programs. The reasons for this research are threefold. First, it is difficult for program managers in government and potential users of intelligent computer systems to get clear and consistent indicators of improvement in system performance in other than very technical terms. Second, the evaluation of such systems in the computer science community had, to this point, proceeded unsystematically and in general without regard to the long history in evaluation and measurement shared by the social sciences. Third, as a research enterprise, we were interested in understanding how and how much of computer programs purported to model intelligence (i.e., artificial intelligence) can be referenced back to the performance of humans.