ABSTRACT

This chapter introduces RL and take a Markov decision process (MDP) viewpoint of RL. It begins with the basic MDP-driven problem formulation of RL and describe the fundamental elements of RL, such as environment, reward, actions, policy, and value, followed by a description of foundational algorithms for solving RL problems. The chapter provides perspectives for future work. Q-learning is one of the most popular RL methods where the authors directly approximate the optimal action-value function. Compared to value-based methods, policy gradient methods can handle larger and continuous action spaces as they directly approximate the policy function and can find the optimal policy when the value function is not well-defined. Actor-Critic methods combine the advantages of both value-based and policy-based methods and can handle continuous action spaces while reducing variance in the gradient estimator.