Skip to main content

Advertisement

Log in

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

  • Published:
Journal of the Operations Research Society of China Aims and scope Submit manuscript

Abstract

The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article. We introduce scenario belief to describe the probability distribution of scenarios in the system, and scenario expectation to formulate the expected total discounted reward of a policy. We establish a new framework named as double-factored Markov decision process (DFMDP), in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process. Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies. We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper. An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed. It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edn. Wiley, New Jersey (2005)

    MATH  Google Scholar 

  2. Beemsterboer, B.J., Land, M.J., Teunter, R.H.: Flexible lot sizing in hybrid make-to-order/make-to-stock production planning. Eur. J. Oper. Res. 260(3), 1014–1023 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chen, N., Teven, K., Wang, C.: A partitioning algorithm for markov decision processes with applications to market microstructure. Manag. Sci. 64(2), 784–803 (2018)

    Article  Google Scholar 

  4. Steimle, L.N., Kaufman, D.L., Denton, B.T.: Multi-model Markov decision processes. IISE Trans. 53(10), 1124–1139 (2022)

    Google Scholar 

  5. Buchholz, P., Scheftelowitsch, D.: Computation of weighted sums of rewards for concurrent MDPs. Math. Methods Oper. Res. 89(1), 1–42 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  6. Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artif. Intell. 122(1–2), 71–109 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  7. Delgado, K.V., Sanner, S., de Barros, L.N.: Efficient solutions to factored MDPs with imprecise transition probabilities. Artif. Intell. 175(9–10), 1498–1527 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  8. Witwicki, S.J., Melo, F.S., Capitan, J., Spaan, M.T.J.: A flexible approach to modeling unpredictable events in MDPs. In: Proceedings of Twenty-Third International Conference on Automated Planning and Scheduling ICAPS2013, pp. 260–268 (2013)

  9. Duff, M.: Optimal learning: computational procedures for bayes-adaptive markov decision processes. Ph.D. thesis, University of Massachusetts Amherst, Amherst, MA (2002)

  10. Castro, P. S., Precup, D.: Using linear programming for Bayesian exploration in Markov decision processes. In: International Joint Conference on Artificial Intelligence IJCAI2007, pp. 2437–2442 (2007)

  11. Kumar, P.: Information theoretic learning methods for Markov decision processes with parametric uncertainty. Ph.D. thesis, University of Washington (2018).

  12. Iyengar, G.: Robust dynamic programming. Math. Oper. Res. 30(2), 257–280 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  14. Delgado, K.V., de Barros, L.N., Dias, D.B., Sanner, S.: Real-time dynamic programming for Markov decision process with imprecise probabilities. Artif. Intell. 230(8), 192–223 (2016)

    Article  MATH  Google Scholar 

  15. Moreira, D.A.M., Delgado, K.V., de Barros, L.N.: Robust probabilistic planning with ilao. Appl. Intell. 45(3), 662–672 (2016)

    Article  Google Scholar 

  16. Delage, E., Shie, M.: Percentile optimization for markov decision processes with parameter uncertainty. Oper. Res. 58(1), 203–213 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  17. Adulyasak, Y., Varakantham, P., Ahmed, A., Jaillet, P.: Solving uncertain MDPs with objectives that are separable over instantiations of model uncertainty. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, AAAI Press, pp. 3454–3460 (2015)

  18. Ahmed, A., Varakantham, P., Lowalekar, M., Adulyasak, Y., Jaillet, P.: Sampling based approaches for minimizing regret in uncertain Markov decision processes (MDPs). J. Artif. Intell. Res. 59, 229–264 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  19. Meraklı, M., Küçükyavuz, S.: Risk aversion to parameter uncertainty in Markov decision processes with an application to slow-onset disaster relief. IISE Trans. 52(8), 811–831 (2019)

    Article  Google Scholar 

  20. Shani, G., Heckerman, D., Brafman, R.: An MDP-based recommender system. J. Mach. Learn. Res. 6(43), 1265–1295 (2005)

    MathSciNet  MATH  Google Scholar 

  21. Chen, Q., Ayer, T., Chhatwal, J.: Sensitivity analysis in sequential decision models: a probabilistic approach. Med. Decis. Making 37(2), 243–252 (2017)

    Article  Google Scholar 

  22. Bala, M.V., Mauskopf, J.A.: Optimal assignment of treatments to health states using a Markov decision model. Pharmacoeconomics 24(4), 345–354 (2006)

    Article  Google Scholar 

Download references

Acknowledgements

The author is grateful to the editors and reviews for their dedicated inputs and supports during the review process. The author would like to thank his advisor Prof. Theodore Allen from Systems Engineering of The Ohio State University, for his continuous support and guidance. The author also wants to express his sincere appreciation to Prof. Nick Hall from Fisher Business School of The Ohio State University for his valuable suggestions on this work.

Author information

Authors and Affiliations

Authors

Contributions

C.-J. Hou contributed to all the elements involved in this paper, including framework conceptualization, algorithm design, mathematical proof, and numeric validation.

Corresponding author

Correspondence to Cheng-Jun Hou.

Ethics declarations

Competing Interests

The author declares that there are no competing interests related to the research work and the writing of this paper.

Additional information

This research was originally supported by the (United States) National Science Foundation (No. 1409214).

Appendix

Appendix

1.1 Proof of Theorem 1

For any given \({h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}\) and a fixed π ∊ ΠDHD, the history sets \({H}_{t}^{\pi }\) for t = 0,1, ···, Z are determined by step 1 in the policy evaluation algorithm. Equation (22) shows that when the history up to time t is \(h_{t} \in H_{t}^{\pi }\), the expected value of policy π at decision epoch t, t + 1, ···, Z is equal to the reward \({\overline{r} }_{{s}_{t}}^{{x}_{t}({h}_{t})}\) received by selecting action xt(ht) plus the expected total discounted reward over the remaining periods. The second term contains the product of \({\overline{p} }_{{s}_{t},{s}_{t+1}}^{{x}_{t}({h}_{t})}\) the probability of state transiting from st to st+1 when action xt(ht) is performed at decision epoch t, and \({U}_{\mathrm{t}+1}^{\pi }\) the expected total discounted reward obtained by applying π at decision epoch t + 1, ···, Z when the history up to time t + 1 is \({h}_{t+1}=\left({h}_{t},{x}_{t}({h}_{t}),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\). Summing over all possible states st+1 gives the desired expectation expressed in terms of \({U}_{\mathrm{t}+1}^{\pi }\). So Eq. (22) can be written as below

$${U}_{t}^{\pi }\left({h}_{t}\right)={\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+\gamma {E}_{{h}_{t}}^{\pi }\left\{{U}_{t+1}^{\pi }\left({h}_{t},{x}_{t}\left({h}_{t}\right),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}.$$
(A1)

The rest of the proof is by backward induction where the index of induction is t. It is obvious that (21) holds when t = Z. Suppose now that (21) holds for t + 1, ···, Z. Then by using (A1) and the induction hypothesis, we have

$$\begin{aligned} & U_{t}^{\pi } \left( {h_{t} } \right) = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {E_{{h_{t + 1} }}^{\pi } \left[ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right]} \right\} \\ & \quad = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} = E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t}^{Z - 1} \gamma^{n - t} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{Z - t} \overline{r}_{{s_{Z} }}^{0} } \right\}. \\ \end{aligned}$$

It is true that (21) holds for t. Therefore, \({V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={U}_{0}^{\pi }\left({h}_{0}\right)\) when t = 0.

1.2 Proof of Theorem 3

The proof is in two parts. Firstly, we establish by induction that \({U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)\) for all ht\(\in\)Ht and t = 0,1, ···, Z. Obviously, \({U}_{Z}\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}={U}_{Z}^{\pi }\left({h}_{Z}\right)\) for all hZ\(\in\)HZ and π\(\in\) ΠDHR. Therefore \({U}_{Z}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)\) for all hZ\(\in\)HZ. Now assume that \({U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)\) for all \(h_{t} \in H_{t}\) for t = n + 1, ···, Z. Let \(\pi^{\prime } = \left( {x_{0}^{\prime } ,x_{1}^{\prime } ,\cdots,x_{Z - 1}^{\prime } } \right)\) be an arbitrary policy in ΠDHR. So for t = n, we have

$$U_{n} \left( {h_{n} } \right) = \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$
$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$
(A2)
$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$
(A3)
$$\geqslant \mathop \sum \limits_{{a_{n} \in A}} \mu_{{h_{n} }}^{^{\prime}} \left( {a_{n} } \right)\left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$
(A4)
$$= U_{n}^{{\pi {^{\prime}}}} \left( {h_{n} } \right).$$

Line (A2) holds because of the induction hypothesis and non-negativity of \(\overline{{\varvec{p}} }\). Line (A3) holds because of the definition of \({U}_{n+1}^{*}\). Line (A4) follows from Lemma 1. The last equality follows from (23) and Theorem 2.

Because \({\pi}'\) is arbitrary, we have

$${U}_{n}\left({h}_{n}\right)\geqslant {U}_{n}^{\pi }\left({h}_{n}\right)$$

for all π\(\in\) ΠDHR. Thus \({U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)\) and the induction hypothesis holds. The first part finishes.

For the second part, we establish that for any ε > 0, there always exists a  \({\pi}' {\in}\) ΠDHD so that

$${U}_{t}^{\pi \mathrm{^{\prime}}}\left({h}_{t}\right)+\left(Z-t\right)\varepsilon \geqslant {U}_{t}\left({h}_{t}\right)$$
(A5)

for all ht\(\in\)Ht and t = 0,1, ··· , Z. Since \(U_{Z}^{{\pi^{\prime } }} \left( {h_{Z} } \right) = U_{Z} \left( {h_{Z} } \right) = \overline{r}_{{s_{Z} }}^{0}\), the induction hypothesis holds for t = Z. Assuming that \(U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right)\) for t = n + 1, ···, Z, we have

$$\begin{aligned} U_{n}^{\pi ^{\prime}} \left( {h_{n} } \right) & = \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1}^{\pi ^{\prime}} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) \\ & \geqslant \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) - \left( {Z - n - 1} \right)\varepsilon \\ & \geqslant U_{n} \left( {h_{n} } \right) - \left( {Z - n} \right)\varepsilon .\\ \end{aligned}$$
(A6)

The second line in (A6) holds because of the induction hypothesis and ε > 0. Thus the inductive hypothesis is satisfied and (A5) holds for t = 0,1, ···, Z.

Thus for any ε > 0, there exists a \({\pi}'\)\(\in\) ΠDHR for which

$$U_{t}^{*} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right) \geqslant U_{t}^{*} \left( {h_{t} } \right),$$

then part (i) in the theorem holds. Part (ii) in the theorem holds because

$${U}_{0}\left({h}_{0}\right)={U}_{0}^{*}\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{U}_{0}^{\pi }\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

1.2.1 Proof of Theorem 4

For t = Z, clearly \({U}_{Z}^{{\pi }^{*}}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)\) for all hZ\(\in\)HZ. Assume argument in part (i) in the theorem is true for t = n + 1, ···, Z. Then, for t = n and \({h}_{n}=\left({h}_{n-1},{x}_{n-1}^{*}({h}_{n-1}),\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)\),

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \overline{r}_{{s_{n} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} U_{n + 1}^{{\pi^{*} }} \left( {h_{n} ,x_{n}^{*} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) = U_{n}^{{\pi^{*} }} \left( {h_{n} } \right). \\ \end{aligned}$$

Thus the induction hypothesis is true. Part (ii) in the theorem follows from Theorem 1 and Theorem 3-(ii).

1.3 Proof of Theorem 6

We show that (i) holds by induction. Since \({U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z-1},{a}_{Z-1},\langle {s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\rangle \right)={\overline{r} }_{{s}_{Z}}^{0}\) for all hZ–1\(\in\)HZ–1 and aZ–1\(\in\)A, \({U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\right)\). Assume now that (i) is true for t = n + 1, ···, Z. Then let t = n. For any \({h}_{n}=\left({h}_{n-1},{a}_{n-1},\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)\in {H}_{n}\), it follows from (24), the induction hypothesis and (14) that

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} r_{{s_{n} ,k}}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} p_{{s_{n} ,s_{n + 1} ,k}}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,\tau \left( {\langle s_{n} ,{\varvec{b}}_{{s_{n} }} \rangle ,a_{n} ,s_{n + 1} } \right)} \right)} \right\}. \\ \end{aligned}$$
(A7)

Because the quantities within brackets in the last line of (A7) depends on hn only through \(\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle\), we have \({U}_{n}^{*}\left({h}_{n}\right)={U}_{n}^{*}\left({s}_{n},{{\varvec{b}}}_{{s}_{n}}\right)\). Thus part (i) in the theorem is true for t = 0,1, ··· , Z.

When S and A are finite, there exists a policy \(\pi^{*} = \left( {x_{0}^{*} ,x_{1}^{*},\,\dots\,,x_{Z - 1}^{*} } \right) \in {\Pi }^{{{\text{DMD}}}}\) derived from

$${x}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)\in \underset{{a}_{t}\in A}{\text{argmax}}\left\{\sum \limits_{k \in C}{b}_{{s}_{t},k}{r}_{{s}_{t},k}^{{a}_{t}}+\gamma \sum \limits_{{s_{t + 1} \in S}}\sum \limits_{k \in C} {b}_{{s}_{t},k}{p}_{{s}_{t},{s}_{t+1},k}^{{a}_{t}}{U}_{t+1}^{*}\left({s}_{t+1},\tau \left(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1}\right)\right)\right\}.$$

Therefore, by part (i) in the theorem and Theorem 4-(ii), \({\pi }^{*}\) is optimal.

1.4 Proof of Theorem 7

We will show that \({W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)\) in (4) is equal to \({V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})\) in (16). Firstly, we provide another calculation of scenario belief \({{\varvec{b}}}_{{s}_{t}}\) in state st at decision epoch t. For any \(h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle \in S \times {\Delta }^{C}\), using history recursion \({h}_{t}=({h}_{t-1},{a}_{t-1},\langle {s}_{t},\tau (\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle ,{a}_{t-1},{s}_{t})\rangle )\) repeatedly, the belief \({{\varvec{b}}}_{{s}_{t}}\) within \(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle\) in ht\(\in\)Ht can be expressed by

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{0} ,k}} p_{{s_{0} ,s_{1} ,k}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k}}^{{a_{t - 1} }} }}{{\mathop \sum\limits_{\text{k}' \in \text{C}}b_{{s_{0} ,k{^{\prime}}}} p_{{s_{0} ,s_{1} ,k{^{\prime}}}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k{^{\prime}}}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k{^{\prime}}}}^{{a_{t - 1} }} }}, \forall\,k \in C,$$
(A8)

where s0, a0, s1, ···, at−1, st are the state realizations and actions taken up to t.

For a fixed policy π\(\in\) ΠDHD, starting from the boundary condition \({U}_{Z}^{\pi }\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}\), we evaluate recursively \({U}_{t}^{\pi }\) for \(t=Z-1,\cdots,1,0\) by (22), that is,

$$\begin{aligned} U_{Z - 1}^{\pi } \left( {h_{Z - 1} } \right) & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} U_{Z}^{\pi } \left( {h_{Z - 1} ,x_{Z - 1} \left( {h_{Z - 1} } \right),\langle s_{Z} ,{\varvec{b}}_{{s_{Z} }} \rangle } \right) \\ & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} ,\forall\, h_{Z - 1} \in H_{Z - 1}^{\pi } , \\ \end{aligned}$$
$$\begin{aligned} &U_{Z - 2}^{\pi } \left( {h_{Z - 2} } \right) = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} U_{Z - 1}^{\pi } \left( {h_{Z - 2} ,x_{Z - 2} \left( {h_{Z - 2} } \right),\langle s_{Z - 1} ,{\varvec{b}}_{{s_{Z - 1} }} \rangle } \right) \\ &\quad\quad\quad\quad\!\!\quad\quad\quad = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\},\forall\, h_{Z - 2} \in H_{Z - 2}^{\pi } , \\ & \cdots, \\ \end{aligned}$$
$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \overline{r}_{{s_{0} }}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} \overline{p}_{{s_{0} ,s_{1} }}^{{x_{0} \left( {h_{0} } \right)}} \left\{ {\overline{r}_{{s_{1} }}^{{x_{1} \left( {h_{1} } \right)}} + \gamma \mathop \sum \limits_{{s_{2} \in S}} \overline{p}_{{s_{1} ,s_{2} }}^{{x_{1} \left( {h_{1} } \right)}} } \right. \\ & \quad \cdot \left\{ {\overline{r}_{{s_{2} }}^{{x_{2} \left( {h_{2} } \right)}} + \ldots + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left. {\left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} \ldots } \right\}} \right\}. \\ \end{aligned}$$
(A9)

Submitting (14) into the last equation in (A9) and using (A8), we have

$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} \left\{ {r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} } \right. + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad \left. { + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} } \right\}. \\ \end{aligned}$$
(A10)

On the other hand, Steimle et al. [4] infer that there are no optimal policies that are Markovian for the problem in (5). Then, we consider the history-dependent deterministic policies of standard MDPs, \(\pi = \left\{ {x_{t} \left( {h_{t} } \right):h_{t} \in S \times A \times \cdots \times A \times S,t \in T} \right\}\). Using the finite horizon policy evaluation algorithm presented in [1] and the same logic as in (A9), we obtain \({V}_{k}^{\pi }\left({s}_{0}\right)\) in (3) as follows

$$\begin{aligned} V_{k}^{\pi } \left( {s_{0} } \right) & = r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} . \\ \end{aligned}$$
(A11)

It holds from (4), (A11) and (A10) that

$${W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{\pi }\left({s}_{0}\right)={U}_{0}^{\pi }\left({h}_{0}\right)={V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

Then it follows from (5) and (29) that the finite horizon DFMDP is equivalent to the WVP of the multi-scenario MDP.

1.5 Proof of Theorem 8

For a given \(\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle\), (10) becomes

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{t - 1} ,k}} p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} }}{{p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} \mathop \sum \nolimits_{{k{^{\prime}} \in {\text{C}}}} b_{{s_{t - 1} ,k{^{\prime}}}} }} = b_{{s_{t - 1} ,k}} ,\forall\, a_{t - 1} \in A,s_{t - 1} ,s_{t} \in S, k \in C,$$

since p1 = … = p|C|= p0. This implies that the all scenario beliefs for any state at any time are constant and equal to \({{\varvec{b}}}_{{s}_{0}}\). As a result, \(\overline{r}_{{s_{t} }}^{{a_{t} }} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} r_{{s_{t} ,k}}^{{a_{t} }} ,\forall\, s_{t} \in S,a_{t} \in A,t \in T\), i.e., \(\overline{\varvec{r}} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} {\varvec{r}}_{k}\) which means that the scenario expected reward does not change over time. Therefore, the DFMDP with the type-I of scenarios can be reduced to a standard MDP with parameter pair (\({{\varvec{p}}}_{0}\),\(\overline{{\varvec{r}} }\)).

1.6 Proof of Theorem 9

Firstly, we will prove that H is an isotone mapping. Let V, U\(\in\)\(\mathcal{V}\) and U\(\geqslant\)V. Suppose that

$${a}^{*}(s,{{\varvec{b}}}_{s})\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{s\mathrm{^{\prime}}\in S}\sum \limits_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right)\right\},$$

then for any \(\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}\),

$$\begin{aligned} & \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \geqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {\text{argmax}\,U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) - V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right)} \right\} \geqslant 0. \\ \end{aligned}$$

It shows that HU\(\geqslant\)HV when U\(\geqslant\)V. So H is an isotone mapping.

Then we will prove that for all \({\text{V}}\), U\(\in\)\({\mathcal{V}}\), that \(\left \| HV-HU \right \| \leqslant \gamma \left \| V-U \right \|\) is true for any 0 < γ < 1. Let V, U\(\in\)\({\mathcal{V}}\) and HV(s, bs) \(\geqslant\)HU(s, bs) for a fixed \(\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}\). Again, suppose that

$${a}^{*}\left(s,{{\varvec{b}}}_{s}\right)\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{{s}^{\prime} \in S} \sum \limits_{k\in C}{b}_{s,k}{p}_{s,{s}^{\mathrm{^{\prime}}},k}^{a}V\left({s}^{\prime},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,{s}^{\prime}\right)\right)\right\}.$$

Then

$$\begin{aligned} 0 & \leqslant \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \leqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {V\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right) - U\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right)} \right\} \\ & \leqslant \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \Vert V - U \Vert \leqslant \gamma \Vert V - U \Vert . \\ \end{aligned}$$

If we assume HU(s, bs) \(\geqslant\)HV(s, bs), the same logic will imply that

$$\left|\left(HV\right)\left(s,{{\varvec{b}}}_{s}\right)-\left(HU\right)\left(s,{{\varvec{b}}}_{s}\right)\right|\leqslant \gamma \Vert V-U\Vert$$

for any \(\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}\). This results in \(\left \| \, HV-HU \right \| \leqslant \gamma \left \| V-U \right \|\) So H is a contraction mapping.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hou, CJ. Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters. J. Oper. Res. Soc. China (2023). https://doi.org/10.1007/s40305-023-00457-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s40305-023-00457-5

Keywords

Mathematics Subject Classification