ABSTRACT

Communication is a concept of information exchange. It incorporates speech perception, visual perception, and understanding. These concepts are correlated by sets of complex and interspersed conscious and subconscious processes. The complementary nature of multimodal information suggests that verbal information is accompanied by spontaneous and planned body language, especially by gesturing using lips, hands, and arms. In fact, the perception of a speech sound is affected by a non-matching lip/mouth movement (Sargin et al., 2006). Most researchers agree that speech-synchronized coverbal behavior is essential in communication. Subsequently, artificial interfaces should also integrate natural speech-gesture production to, at least, some degree. The communication between different entities, whether it be human or synthetic, commonly reflects personality and emotions (Wallbott, 1998), idiosyncratic features (Gallaher, 1992), and communicative functions (Allwood et al., 2007a). These nonverbal features are regarded as coverbal behavior. Coverbal behavior benefits the understanding process of the person being addressed

(Bavelas and Chovil, 2000; Kendon, 2004), and the cognitive process of the speaker (Pine et al., 2007; Kita and Davies, 2009). Information expressed by co-aligned verbal and non-verbal behavior is better understood and faster in achieving its purpose (e.g. inducing a social response). Embodied Conversational Agents (ECAs), like in (Kopp and Wachsmuth, 2002; Mlakar and Rojc, 2011; Poggi et al., 2005; Thiebaux et al., 2008), present a paradigm of artificial bodies that can control and move different body-parts, and are capable of communicating by using their voices, faces, hands, and arms (or full body). ECAs may represent the coverbal behavior in the form of communicative function and/or by directly representing the semiotic nature of the spoken dialog (e.g. as iconic/metamorphic representation, stressing the importance of spoken segments, or simply as to regulate the flow of information exchange). The believability of synthetic coverbal behavior closely relates to the term expressivity, namely, to the ability of performing continuous and smooth, context adaptable communicative acts, emulating natural movement tendencies and dynamics, and synchrony with the situational context and/or the verbal flow. The interaction incorporating expressive ECAs is proven to provide visual meaning, and benefit an understanding of the spoken words and actions performed in multimodal interfaces. Although ECAs and synthetic ‘communicative’ behavior have been well researched, the co-alignment of speech and non-verbal expressions still represents an important and challenging task. Most of the behavior overlaid by such agents is, therefore, often limited to lip-sync (Tang et al., 2008; Zoric and Pandžic’, 2008), and facial expressions (Lankes and Bernhaupt, 2011), or is based on behavior generation/realization engines that incorporate scenarios and/or semantically tagged text (C´`erekovic´ and Pandžic’, 2011; Krenn et al., 2011; Nowina-Krowicki et al., 2011; van Oijen et al., 2012). In general, the correlation between verbal and non-verbal signals originates from the semantic, pragmatic and temporal features of the multimodal-content (Jokinen, 2009; Kendon, 2000; McNeill, 1992). Some coverbal gestures like iconic expressions (Hadar and Krauss, 1999; Straube et al., 2011), symbolic expressions (Barbieri et al., 2009), and mimicry (Holler and Wilkin, 2011) are also tightly interlinked with speech. These gestures may be identified by the linguistic (semantic) properties of the input text, like wordtype, word-type-order, word-affiliation, etc. Other coverbal gestures, especially those representing communicative functions (e.g. indexical and adaptive expressions, Allwood, 2001), have less (if at all) evident semantic or linguistic alignment with the text. However, they may still be identified by linguistic fillers (Grenfell, 2011), turn-taking, and directional signals. Although speech and coverbal gestures are

manifested by the same underlying process (e.g. different sides of the same coin), the information is conveyed by each modality in a different way. For instance, two gestures produced together do not necessarily represent a gesture expressing complex meaning. Gestures are also not completely of a linguistic nature. Several gestures may represent a similar meaning, whereas similar gestures may represent a totally different meaning. There are also no grammatical rules on the movement structure by which a gesture is propagated. The language, on the other hand, has grammar and order.