This chapter describes robotic imitation aided by verbal suggestions in addition to demonstration as one of the phases of integrative multimedia understanding by a robot. Robotic or artificial imitation is one kind of machine learning on human actions and there have been a considerable number of studies on imitation learning from human actions demonstrated without any verbal hint. The mental image directed semantic theory has proposed a model of human attention-guided perception, yielding omnisensory images that inevitably reflect certain movements of the focus of attention of the observer scanning certain matters in the world. The most remarkable feature of mental image description language is its capability of formalizing spatiotemporal matter concepts grounded in human/robotic sensation, while the other similar knowledge representation languages are designed to describe the logical relations among conceptual primitives represented by lexical tokens.