ABSTRACT

Markov chains have been widely studied [1-6]. Many engineering problems can be modelled as finite controlled Markov chains whose transition probabilities depends on the control action. The control actions are generated to achieve some desired goal (control objective) such the maximization of the expected average reward or the minimization of a loss function. The control problem related to Markov chains with known transition probabilities has been extensively studied by several authors [1-2, 7-8] and solved on the basis of dynamic programming and linear programming. Many studies have been devoted to the control of Markov chains whose transition probabilities depend upon a constant and unknown parameter taking values in a finite set [9-20] or a time-varying parameter with a certain period [21]. In these studies, the self-tuning approach (certainty equivalence) has been considered (the unknown parameters are estimated and the control strategy is designed as if the estimated parameters were the true system parameters [22]). The maximum likelihood estimation procedure has been used by several authors. In [15] the problem of adaptive control of Markov chains is treated as a kind of multiarmed bandit problem. The certainty equivalence control with forcing [23] approach has been used in [15] and [16] to derive adaptive control strategies for finite Markov chains. In this control approach, at certain a priori specified instants, the system is forced (forcing or experimenting phase) by using other control actions in order to escape false identification traps. The forcing phase is similar to the introduction of extra perturbations in adaptive systems, to obtain good excitation (persistent excitation which is a uniform identifiability condition).