ABSTRACT

Being able to recognize human activities in video is a critical capability of an intelligent vision system in many applications including video

surveillance. On this task, human vision still outperforms any existing automatic techniques. Humans tend to describe, remember, perform, and compare an action as a sequence of key poses of the body. Most of the existing vision techniques attempt to mimic this to a certain degree, explicitly or implicitly, by modeling and classifying human actions based on key poses and their orders. Unfortunately, such techniques typically rely on sophisticated feature extraction (e.g., explicit detection and tracking of body parts, or doing so implicitly by complex representation and detection of the body motion through high-dimensional spatiotemporal features), which is a challenging task on its own especially considering the wide variability of the acquisition condition.