ABSTRACT

Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias caused by the difference between the data-sampling policy and the target policy. In this chapter, we explain how importance sampling can be utilized to efficiently reuse previously collected data samples in policy iteration. After formulating the problem of off-policy value function approximation in Section 4.1, representative off-policy value function approximation techniques including adaptive importance sampling are reviewed in Section 4.2. Then, in Section 4.3, how the adaptivity of importance sampling can be optimally controlled is explained. In Section 4.4, off-policy value function approximation techniques are integrated in the framework of least-squares policy iteration for efficient sample reuse. Experimental results are shown in Section 4.5, and finally this chapter is concluded in Section 4.6.