Abstract
A class of discounted Markov decision processes (MDPs) is formed by bringing together individual MDPs sharing the same discount rate. These are in competition in the sense that at each decision epoch a single action is chosen from the union of the action sets of the individual MDPs. Such families of competing MDPs have been used to model a variety of problems in stochastic resource allocation and in the sequential design of experiments. Suppose thatS is a stationary strategy for such a family, thatS* is an optimal strategy and thatR(S),R(S*) denote the respective rewards earned. The paper extends (and explains) existing theory based on the Gittins index to give bounds onR(S*)-R(S) for this important class of processes. The procedures are illustrated by examples taken from the fields of stochastic scheduling and research planning.
Similar content being viewed by others
References
L. Benkherouf, K.D. Glazebrook and R.W. Owen, Gittins' indices and oil exploration (1990), submitted for publication.
J. Bruno and M. Hofri, On scheduling chains of jobs on one processor with limited preemption, SIAM J. Comput. 4 (1975) 478–490.
R.W. Cranston, First experiences with a ranking method for portfolio selection, R & D. Mgmt. 5 (1975) 88–92.
J.C. Gittins, Bandit processes and dynamic allocation indices (with discussion), J. Roy. Statist. Soc. B41 (1979) 148–177.
J.C. Gittins,Multi-Armed Bandit Allocation Indices (Wiley, New York, 1989).
K.D. Glazebrook, Stochastic scheduling with order constraints, Int. J. Syst. Sci. 7 (1976) 657–666.
K.D. Glazebrook, Some ranking formulae for alternative research projects, OMEGA 4 (1978) 79–83.
K.D. Glazebrook, Stoppable families of alternative bandit processes, J. Appl. Prob. 16 (1979) 843–854.
K.D. Glazebrook, On a sufficient condition for superprocesses due to Whittle, J. Appl. Prob. 19 (1982) 99–110.
K.D. Glazebrook, On a reduction principle in dynamic programming, Adv. Appl. Prob. 20 (1988) 836–851.
K.D. Glazebrook, Procedures for the evaluation of strategies for resource allocation in a stochastic environment, J. Appl. Prob. 27 (1990) 215–220.
K.D. Glazebrook, A comparative study of procedures for policy evaluation in stochastic scheduling (1990), submitted for publication.
K.D. Glazebrook and N.A. Fay, Evaluating strategies for Markov decision processes in parallel, Math. Oper. Res. 15 (1990) 17–32.
K.D. Glazebrook and N.A. Fay, Stochastic scheduling — do we need to model arrivals? (1989), submitted for publication.
K.D. Glazebrook and N.A. Fay, A general model for the scheduling of alternative tasks on a single machine, Prob. Eng. Inf. Sci. 3 (1989) 199–221.
M.N. Katehakis and A.F. Veinott, The multi-armed bandit problem: decomposition and computation, Math. Oper. Res. 12 (1987) 262–268.
P. Nash, Optimal allocation of resources between research projects, Ph.D. thesis, Cambridge University (1973).
P. Varaiya, J. Walrand and C. Buyukkoc, Extensions of the multi-armed bandit problem: the discounted case, IEEE Trans. Auto. Control AC-30 (1985) 426–439.
P. Whittle, Multi-armed bandits and the Gittins index, J. Roy. Statist. Soc. B42 (1980) 143–149.
P. Whittle, Arm-acquiring bandits, Ann. Prob. 9 (1981) 284–292.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Glazebrook, K.D. Competing Markov decision processes. Ann Oper Res 29, 537–563 (1991). https://doi.org/10.1007/BF02283613
Issue Date:
DOI: https://doi.org/10.1007/BF02283613