ABSTRACT

A Markov decision process (MDP) is a controlled Markov chain (Section 5.2), the objective of the control being to minimize a sum or average of (possibly discounted) costs associated with each stage (the term used to describe all aspects of a single time index). The set of admissible controls is precisely given, so we have a well defined optimization problem. Formally, the MDP itself is a selection, through the control process, from a class of Markov chains, and so may not itself be a Markov chain. This depends on whether or not the control itself is Markovian, in the sense that it depends only on the system’s current state. It can be shown that under general conditions the optimal control can be chosen to be Markovian, assuming perfect system identification. But even when this holds, there may be formidable computational challenges to overcome, so, for a number of reasons, there has has been considerable development of approximation methods for MDPs, for example, Hernández-Lerma (1989b), Bertsekas and Tsitsiklis (1996), Chang et al. (2007), Si et al. (2004), Bus¸oniu et al. (2010), Powell (2011).