Robust Policy Iteration | 12 | Statistical Reinforcement Learning

ABSTRACT

The framework of least-squares policy iteration (LSPI) introduced in Chapter 2 is useful, thanks to its computational efficiency and analytical tractability. However, due to the squared loss, it tends to be sensitive to outliers in observed rewards. In this chapter, we introduce an alternative policy iteration method that employs the absolute loss for enhancing robustness and reliability. In Section 6.1, robustness and reliability brought by the use of the absolute loss is discussed. In Section 6.2, the policy iteration framework with the absolute loss called least-absolute policy iteration (LAPI) is introduced. In Section 6.3, the usefulness of LAPI is illustrated through experiments. Variations of LAPI are considered in Section 6.4, and finally this chapter is concluded in Section 6.5.