ABSTRACT

Building automatic dialogue systems that match human flexibility and reactivity has proven difficult. Many factors impede the progress of such systems, useful as they may be, from the low-level of real-time audio signal analysis and noise filtering to medium-level turn-taking cues and control signals, all the way up to high-level dialogue intent and content-related interpretation. Of these, we have focussed on the dynamics of turn-taking-the real-time1 control of who has the turn and how turns are exchanged and how to integrate these in an expandable architecture for dialogue generation and control. Manual categorization of silences, prosody and other candidate turn-giving signals, or analysis of corpora to produce static decision trees for this purpose cannot address the high within-and between-individual variability observed in natural interaction. As an alternative, we have

developed an architecture with integrated machine learning, allowing the system to automatically acquire proper turn-taking behavior. The system learns cooperative (“polite”) turn-taking in real-time by talking to humans via Skype. Results show performance to be close to that of human, as found in naturally occurring dialogue, with 20% of the turn transitions taking place in under 300 milliseconds (msecs) and 50% under 500 msecs. Key contributions of this work are the methods for constructing more capable dialogue systems with an increasing number of integrated features, implementation of adaptivity for turn-taking, and a firmer theoretical ground on which to build holistic dialogue architectures.