ABSTRACT

Before any interface (robotic or otherwise) can be evaluated, it is necessary to understand the users’ relevant skills and mental models and to develop eval­uation criteria with those users in mind. Evaluations based on empirically vali­dated sets of heuristics (Nielsen, 1994) have been used on desktop UIs and Web-based applications. However, current human-robot interfaces differ widely depending on platforms and sensors, and existing guidelines are not adequate to support heuristic evaluations.Messina, Meystel, and Reeker (2001) proposed some criteria in the intelli­gent systems literature, but they are qualitative criteria that apply to the perfor­mance of the robot only, as opposed to the robots and the users acting as a cooperating system. An example criterion is, “The system ... ought to have the capability to interpret incomplete commands, understand higher level, more abstract commands, and to supplement the given command with additional information that helps to generate more specific plans internally” (p. 1).In contrast, Scholtz (2002) proposed six evaluation guidelines that can be used as high-level evaluation criteria:

1. Is the necessary information present for the human to be able to deter­mine that an intervention is needed?2. Is the information presented in an appropriate form?3. Is the interaction language efficient for both the human and the intel­ligent system?4. Are interactions handled efficiently and effectively-both from the user and the system perspective?5. Does the interaction architecture scale to multiple platforms and in­teractions?6. Does the interaction architecture support evolution of platforms? Usability evaluations use effectiveness, efficiency, and user satisfaction as metrics for evaluation of UIs. Effectiveness metrics evaluate the perfor­mance of tasks through the UI. In HRI, the operators’ tasks are to monitor the behavior of robots (if the system has some level of autonomy); to inter­vene when necessary; and to control navigation either by assigning waypoints, issuing a command such as “back-up,” or teleoperating the ro­bot if necessary. In addition, in search and rescue, operators have the task of identifying victims and their location.Not only must the necessary information be present, it must also be pre­sented in such a way as to maximize its utility. Information can be present but in separated areas of the interface, requiring users to manipulate windows to gain an overall picture of system state. Such manipulation takes time and can result in an event not being noticed for some time. Information fusion is an­other aspect of presentation. Time delays and errors occur when users need to fuse a number of different pieces of information.As robots become more useful in various applications, we think in terms of using multiple robots. Therefore, the UIs and the interaction architectures must scale to support operators controlling more than one robot.Robot platforms have made amazing progress in the last decade and will continue to progress. Rather than continually developing new user interaction schemes, is it possible to design interaction architectures and UIs to support hardware evolution? Can new sensors, new types of mobility, and additional levels of autonomy be easily incorporated into an existing UI?We use Scholtz’s (2002) guidelines as an organizing theme for our analysis, operationalizing and tailoring them to be specific to the urban search and res­cue environment.Evaluation methods from the HCI and CSCW worlds can be adapted for use in HRI as long as they take into account the complex, dynamic, and auton­omous nature of robots. The HCI community often speaks of three major classes of evaluation methods: inspection methods (evaluation by UI experts), empirical methods (evaluation involving users), and formal methods (evalua­

tion focusing on analytical approaches). Robot competitions lend themselves to empirical evaluation because they involve users performing typical tasks in as realistic an environment as possible (for a description of some robot compe­titions, see Y anco, 2001). Unfortunately (from the viewpoint of performing the empirical technique known as formal usability testing), robot competitions normally involve the robot developers, not the intended users of the robots, operating the robots during the competition. The performance attained by ro­bot developers, however, can be construed as an “upper bound” for the per­formance of more typical users. Specifically, if the robot developers have difficulty using aspects of the interface, then typical users will likely experience even more difficulty. In addition, robot competitions afford an interesting op­portunity (one not attained so far in formal usability testing of HRI) to corre­late HRI performance under controlled conditions to HRI design approaches.Although the AAAI Robot Competition provided us with an opportunity to observe users performing search and rescue tasks, there were two limita­tions. First, we were not able to converse with the operators due to the time constraints they were under, which eliminated the possibility of conducting think-aloud (Ericsson & Simon, 1980) or talk-aloud (Ericsson & Simon, 1993) protocols, and also eliminated our ability to have operators perform tasks other than those implied by the competition (i.e., search for victims). Second, the competition simulated a rescue environment. Many of the hazards (be­yond those incorporated in the arena) and stress-inducing aspects of an actual search and rescue environment were missing. Nonetheless, this environment was probably the closest we could use in studying search and rescue tasks due to safety and time constraints in actual search and rescue missions.Two patterns were observed in previous HRI empirical testing efforts that limit the insights obtained to date. The first, as mentioned previously, is a ten­dency for robot performance to be evaluated using atypical users. For exam­ple, Yanco (2000) used a version of a usability test as part of an evaluation of a robotic wheelchair system but did not involve the intended users operating the wheelchair (the wheelchair was observed operating with able-bodied occu­pants). We have started to break this pattern by also analyzing the use of two urban search and rescue robot systems by a fire chief, a more typical user, after the competition runs were completed.The second pattern that limits HRI empirical testing effectiveness is the ten­dency to conduct such tests very informally. For example, Draper, Pin, Rowe, and Jansen (1999) tested the Next Generation Munitions Handler/Advanced Technology Demonstrator, which involves a robot that re-arms military tacti­cal fighters. Although experienced munitions loaders were used as test partici­pants, testing sessions were actually hybrid testing and training sessions, and test parameters were not held constant during the course of the experiment. Data analysis was primarily confined to noting test participants’ comments