ABSTRACT

Reliable recovery and tracking of articulated human motion from video is a very challenging problem in computer vision, due to the versatility of human movement, the variability of body types, various movement styles and signatures, and the 3D nature of the human body. Visionbased tracking of articulated motion is a temporal inference problem. There exist numerous computational frameworks addressing this problem. Some of the frameworks make use of training data (e.g., Urtasun et al., 2006) to inform the tracking, while some attempt to directly infer the articulated motion without using any training data (e.g., Deutscher et al., 2000). When training data are available, the articulated motion tracking can be cast into a statistical learning and inference problem. Using a set of training examples, a learning and inference framework needs to be developed to track both seen and unseen movements performed by known or unknown subjects. In terms of the learning and inference structure, existing 3D tracking algorithms can be roughly clustered into two categories, namely, generative-based and discriminativebased approaches. Generative-based approaches (e.g., Deutscher et al., 2000; Kakadiaris and Metaxas, 2000; Sidenbladh et al., 2000), usually assume the knowledge of a 3D body model of the subject and dynamical models of the related movement, from which kinematic predictions and

corresponding image observations can be generated. The movement dynamics are learned from training examples using various dynamic system models, e.g., autoregressive (AR) models (Agarwal and Triggs, 2004), hidden Markov models (Pavlovic et al., 2000), Gaussian process dynamical models (Urtasun et al., 2006), and piecewise linear models in the form of mixture of factor analyzers (Li et al., 2007). A recursive fi lter is often deployed to temporally propagate the posterior distribution of the state. Especially, particle fi lters have been extensively used in movement tracking to handle nonlinearity in both the system observation and the dynamic equations. Discriminativebased approaches (e.g., Mori and Malik, 2002; Grauman et al., 2003; Sminchisescu et al., 2005a,b; Agarwal and Triggs, 2006; Thayananthan et al., 2006) treat kinematics recovery from images as a regression problem from the image space to the body kinematics space. Using training data, the relationship between image observation and body poses is obtained using machine learning techniques. When compared against each other, both approaches have their own pros and cons. In general, generative-based methods utilize movement dynamics and produce more accurate tracking results, although they are more time consuming, and usually the conditional distribution of the kinematics given the current image observation is not utilized directly. On the other hand, discriminative-based methods learn such conditional distributions of kinematics given image observations from training data and often result in fast image-based kinematic inference. However, movement kinematics are usually not fully explored by discriminativebased methods. Thus the rich temporal correlation of body kinematics between adjacent frames is unused in tracking.