ABSTRACT

In Chapter 4, we considered the off-policy situation where a data-collecting policy and the target policy are different. In the framework of sample-reuse policy iteration, new samples are always chosen following the target policy. However, a clever choice of sampling policies can actually further improve the performance. The topic of choosing sampling policies is called active learning in statistics and machine learning. In this chapter, we address the problem of choosing sampling policies in sample-reuse policy iteration. In Section 5.1, we explain how a statistical active learning method can be employed for optimizing the sampling policy in value function approximation. In Section 5.2, we introduce active policy iteration, which incorporates the active learning idea into the framework of sample-reuse policy iteration. The effectiveness of active policy iteration is numerically investigated in Section 5.3, and finally this chapter is concluded in Section 5.4.