ABSTRACT

Gradient-based direct policy search methods introduced in Chapter 7 are useful particularly in controlling continuous systems. However, appropriately choosing the step size of gradient ascent is often difficult in practice. In this chapter, we introduce another direct policy search method based on the expectation-maximization (EM) algorithm that does not contain the step size parameter. In Section 8.1, the main idea of the EM-based method is described, which is expected to converge faster because policies are more aggressively updated than the gradient-based approach. In practice, however, direct policy search often requires a large number of samples to obtain a stable policy update estimator. To improve the stability when the sample size is small, reusing previously collected samples is a promising approach. In Section 8.2, the sample-reuse technique that has been successfully used to improve the performance of policy iteration (see Chapter 4) is applied to the EM-based method. Then its experimental performance is evaluated in Section 8.3 and this chapter is concluded in Section 8.4.