ABSTRACT

This chapter introduces the model-free methods for policy evaluation, a.k.a. reinforcement learning (RL) prediction, in which the value function vπ(s) is estimated for a fixed arbitrary policy π. It discusses the model-free methods for the optimal policy search, a.k.a. RL control, in which a policy is found in an iterative manner to maximize the value functions. In RL, the state space contains an “environmental model” and other inputs including its own dynamics and the kinematics of moving objects from the previous frames. The chapter explores several classic model-free control algorithms including Monte Carlo (MC) control, Sarsa, Q-learning, policy gradient and actor-critic. Basically there are two widely used MC methods: first-visit MC and every-visit MC. The chapter examines a new method called Temporal-Difference learning, which combines the advantages of the decision process (DP) and MC methods, namely, the bootstrapping of DP and the model-free sampling of MC.