ABSTRACT

This chapter introduces the reinforcement learning (RL) problem and its framework. It deals with two important simplified versions of the RL problem: Bandits and Contextual Bandits. The transition from Bandits to Contextual Bandits and Contextual Bandits to RL should seem natural and straightforward. By focusing on the full reinforcement learning problem, the chapter explains some common elements and themes that emerge from the multi-armed bandit (MAB) problem that also be common and critical to the RL problem. It presents one algorithm that follows directly from the upper confidence bound algorithm for MAB. The ∊-greedy algorithm is a straightforward algorithm that nicely illustrates the concept of balancing exploration and exploitation. According to ∊-greedy, the agent chooses to explore with probability and to exploit with probability. The softmax algorithm follows a similar structure as the ∊-greedy algorithm. In fact, the only difference is the construction of the categorical action generating distribution.