This work explores the border area between vision and natural language with respect to a particular task: obtaining verbal descriptions of scenes with motion. The task involves image understanding as we assume that the time-varying scene to be described is represented by an image sequence. Hence, part of the problem is image-sequence analysis. We focus on high-level aspects: recognizing interesting occurrences that extend over time. Very little is said about lower level processes that constitute the scope of vision in a narrow sense. The concepts and representations proposed in this work can be viewed as extending the scope of a vision system beyond the level of object recognition. In this respect, our work is a contribution to the question raised by Waltz (1979): What should the output of a (complete) vision system be?