Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Hou, Cheng-Jun

doi:10.1007/s40305-023-00457-5

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Published: 15 May 2023

(2023)
Cite this article

Journal of the Operations Research Society of China Aims and scope Submit manuscript

Cheng-Jun Hou ORCID: orcid.org/0000-0002-0351-5179¹

109 Accesses
Explore all metrics

Abstract

The double-factored decision theory for Markov decision processes with multiple scenarios of the parameters is proposed in this article. We introduce scenario belief to describe the probability distribution of scenarios in the system, and scenario expectation to formulate the expected total discounted reward of a policy. We establish a new framework named as double-factored Markov decision process (DFMDP), in which the physical state and scenario belief are shown to be the double factors serving as the sufficient statistics for the history of the decision process. Four classes of policies for the finite horizon DFMDPs are studied and it is shown that there exists a double-factored Markovian deterministic policy which is optimal among all policies. We also formulate the infinite horizon DFMDPs and present its optimality equation in this paper. An exact solution method named as double-factored backward induction for the finite horizon DFMDPs is proposed. It is utilized to find the optimal policies for the numeric examples and then compared with policies derived from other methods from the related literatures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complexity Bounds for Deterministic Partially Observed Markov Decision Processes

Article 30 October 2024

Analysis of Markov Decision Processes Under Parameter Uncertainty

Markov Decision Processes with Discounted Costs over a Finite Horizon: Action Elimination

References

Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edn. Wiley, New Jersey (2005)
MATH Google Scholar
Beemsterboer, B.J., Land, M.J., Teunter, R.H.: Flexible lot sizing in hybrid make-to-order/make-to-stock production planning. Eur. J. Oper. Res. 260(3), 1014–1023 (2017)
Article MathSciNet MATH Google Scholar
Chen, N., Teven, K., Wang, C.: A partitioning algorithm for markov decision processes with applications to market microstructure. Manag. Sci. 64(2), 784–803 (2018)
Article Google Scholar
Steimle, L.N., Kaufman, D.L., Denton, B.T.: Multi-model Markov decision processes. IISE Trans. 53(10), 1124–1139 (2022)
Google Scholar
Buchholz, P., Scheftelowitsch, D.: Computation of weighted sums of rewards for concurrent MDPs. Math. Methods Oper. Res. 89(1), 1–42 (2019)
Article MathSciNet MATH Google Scholar
Givan, R., Leach, S., Dean, T.: Bounded-parameter Markov decision processes. Artif. Intell. 122(1–2), 71–109 (2000)
Article MathSciNet MATH Google Scholar
Delgado, K.V., Sanner, S., de Barros, L.N.: Efficient solutions to factored MDPs with imprecise transition probabilities. Artif. Intell. 175(9–10), 1498–1527 (2011)
Article MathSciNet MATH Google Scholar
Witwicki, S.J., Melo, F.S., Capitan, J., Spaan, M.T.J.: A flexible approach to modeling unpredictable events in MDPs. In: Proceedings of Twenty-Third International Conference on Automated Planning and Scheduling ICAPS2013, pp. 260–268 (2013)
Duff, M.: Optimal learning: computational procedures for bayes-adaptive markov decision processes. Ph.D. thesis, University of Massachusetts Amherst, Amherst, MA (2002)
Castro, P. S., Precup, D.: Using linear programming for Bayesian exploration in Markov decision processes. In: International Joint Conference on Artificial Intelligence IJCAI2007, pp. 2437–2442 (2007)
Kumar, P.: Information theoretic learning methods for Markov decision processes with parametric uncertainty. Ph.D. thesis, University of Washington (2018).
Iyengar, G.: Robust dynamic programming. Math. Oper. Res. 30(2), 257–280 (2005)
Article MathSciNet MATH Google Scholar
Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53(5), 780–798 (2005)
Article MathSciNet MATH Google Scholar
Delgado, K.V., de Barros, L.N., Dias, D.B., Sanner, S.: Real-time dynamic programming for Markov decision process with imprecise probabilities. Artif. Intell. 230(8), 192–223 (2016)
Article MATH Google Scholar
Moreira, D.A.M., Delgado, K.V., de Barros, L.N.: Robust probabilistic planning with ilao. Appl. Intell. 45(3), 662–672 (2016)
Article Google Scholar
Delage, E., Shie, M.: Percentile optimization for markov decision processes with parameter uncertainty. Oper. Res. 58(1), 203–213 (2010)
Article MathSciNet MATH Google Scholar
Adulyasak, Y., Varakantham, P., Ahmed, A., Jaillet, P.: Solving uncertain MDPs with objectives that are separable over instantiations of model uncertainty. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, AAAI Press, pp. 3454–3460 (2015)
Ahmed, A., Varakantham, P., Lowalekar, M., Adulyasak, Y., Jaillet, P.: Sampling based approaches for minimizing regret in uncertain Markov decision processes (MDPs). J. Artif. Intell. Res. 59, 229–264 (2017)
Article MathSciNet MATH Google Scholar
Meraklı, M., Küçükyavuz, S.: Risk aversion to parameter uncertainty in Markov decision processes with an application to slow-onset disaster relief. IISE Trans. 52(8), 811–831 (2019)
Article Google Scholar
Shani, G., Heckerman, D., Brafman, R.: An MDP-based recommender system. J. Mach. Learn. Res. 6(43), 1265–1295 (2005)
MathSciNet MATH Google Scholar
Chen, Q., Ayer, T., Chhatwal, J.: Sensitivity analysis in sequential decision models: a probabilistic approach. Med. Decis. Making 37(2), 243–252 (2017)
Article Google Scholar
Bala, M.V., Mauskopf, J.A.: Optimal assignment of treatments to health states using a Markov decision model. Pharmacoeconomics 24(4), 345–354 (2006)
Article Google Scholar

Download references

Acknowledgements

The author is grateful to the editors and reviews for their dedicated inputs and supports during the review process. The author would like to thank his advisor Prof. Theodore Allen from Systems Engineering of The Ohio State University, for his continuous support and guidance. The author also wants to express his sincere appreciation to Prof. Nick Hall from Fisher Business School of The Ohio State University for his valuable suggestions on this work.

Author information

Authors and Affiliations

Software Research and Data Science, Amazon Robotics, North Reading, MA, 01864, USA
Cheng-Jun Hou

Authors

Cheng-Jun Hou
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

C.-J. Hou contributed to all the elements involved in this paper, including framework conceptualization, algorithm design, mathematical proof, and numeric validation.

Corresponding author

Correspondence to Cheng-Jun Hou.

Ethics declarations

Competing Interests

The author declares that there are no competing interests related to the research work and the writing of this paper.

Additional information

This research was originally supported by the (United States) National Science Foundation (No. 1409214).

Appendix

1.1 Proof of Theorem 1

For any given ${h}_{0}=\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle \in S\times {\Delta }^{C}$ and a fixed π ∊ Π^DHD, the history sets ${H}_{t}^{\pi }$ for t = 0,1, ···, Z are determined by step 1 in the policy evaluation algorithm. Equation (22) shows that when the history up to time t is $h_{t} \in H_{t}^{\pi }$, the expected value of policy π at decision epoch t, t + 1, ···, Z is equal to the reward ${\overline{r} }_{{s}_{t}}^{{x}_{t}({h}_{t})}$ received by selecting action x_t(h_t) plus the expected total discounted reward over the remaining periods. The second term contains the product of ${\overline{p} }_{{s}_{t},{s}_{t+1}}^{{x}_{t}({h}_{t})}$ the probability of state transiting from s_t to s_t+1 when action x_t(h_t) is performed at decision epoch t, and ${U}_{\mathrm{t}+1}^{\pi }$ the expected total discounted reward obtained by applying π at decision epoch t + 1, ···, Z when the history up to time t + 1 is ${h}_{t+1}=\left({h}_{t},{x}_{t}({h}_{t}),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)$. Summing over all possible states s_t+1 gives the desired expectation expressed in terms of ${U}_{\mathrm{t}+1}^{\pi }$. So Eq. (22) can be written as below

$${U}_{t}^{\pi }\left({h}_{t}\right)={\overline{r} }_{{s}_{t}}^{{x}_{t}\left({h}_{t}\right)}+\gamma {E}_{{h}_{t}}^{\pi }\left\{{U}_{t+1}^{\pi }\left({h}_{t},{x}_{t}\left({h}_{t}\right),\langle {s}_{t+1},{{\varvec{b}}}_{{s}_{t+1}}\rangle \right)\right\}.$$

(A1)

The rest of the proof is by backward induction where the index of induction is t. It is obvious that (21) holds when t = Z. Suppose now that (21) holds for t + 1, ···, Z. Then by using (A1) and the induction hypothesis, we have

$$\begin{aligned} & U_{t}^{\pi } \left( {h_{t} } \right) = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {E_{{h_{t + 1} }}^{\pi } \left[ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right]} \right\} \\ & \quad = \overline{r}_{{s_{t} }}^{{x_{t} \left( {h_{t} } \right)}} + \gamma E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t + 1}^{Z - 1} \gamma^{{n - \left( {t + 1} \right)}} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{{Z - \left( {t + 1} \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} = E_{{h_{t} }}^{\pi } \left\{ {\mathop \sum \limits_{n = t}^{Z - 1} \gamma^{n - t} \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma^{Z - t} \overline{r}_{{s_{Z} }}^{0} } \right\}. \\ \end{aligned}$$

It is true that (21) holds for t. Therefore, ${V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={U}_{0}^{\pi }\left({h}_{0}\right)$ when t = 0.

1.2 Proof of Theorem 3

The proof is in two parts. Firstly, we establish by induction that ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ for all h_t $\in$ H_t and t = 0,1, ···, Z. Obviously, ${U}_{Z}\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}={U}_{Z}^{\pi }\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z and π $\in$ Π^DHR. Therefore ${U}_{Z}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z. Now assume that ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ for all $h_{t} \in H_{t}$ for t = n + 1, ···, Z. Let $\pi^{\prime } = \left( {x_{0}^{\prime } ,x_{1}^{\prime } ,\cdots,x_{Z - 1}^{\prime } } \right)$ be an arbitrary policy in Π^DHR. So for t = n, we have

$$U_{n} \left( {h_{n} } \right) = \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A2)

$$\geqslant \mathop {{\text{max}}}\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A3)

$$\geqslant \mathop \sum \limits_{{a_{n} \in A}} \mu_{{h_{n} }}^{^{\prime}} \left( {a_{n} } \right)\left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{{\pi {^{\prime}}}} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\}$$

(A4)

$$= U_{n}^{{\pi {^{\prime}}}} \left( {h_{n} } \right).$$

Line (A2) holds because of the induction hypothesis and non-negativity of $\overline{{\varvec{p}} }$. Line (A3) holds because of the definition of ${U}_{n+1}^{*}$. Line (A4) follows from Lemma 1. The last equality follows from (23) and Theorem 2.

Because ${\pi}'$ is arbitrary, we have

$${U}_{n}\left({h}_{n}\right)\geqslant {U}_{n}^{\pi }\left({h}_{n}\right)$$

for all π $\in$ Π^DHR. Thus ${U}_{t}\left({h}_{t}\right)\geqslant {U}_{t}^{*}\left({h}_{t}\right)$ and the induction hypothesis holds. The first part finishes.

For the second part, we establish that for any ε > 0, there always exists a ${\pi}' {\in}$ Π^DHD so that

$${U}_{t}^{\pi \mathrm{^{\prime}}}\left({h}_{t}\right)+\left(Z-t\right)\varepsilon \geqslant {U}_{t}\left({h}_{t}\right)$$

(A5)

for all h_t $\in$ H_t and t = 0,1, ··· , Z. Since $U_{Z}^{{\pi^{\prime } }} \left( {h_{Z} } \right) = U_{Z} \left( {h_{Z} } \right) = \overline{r}_{{s_{Z} }}^{0}$, the induction hypothesis holds for t = Z. Assuming that $U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right)$ for t = n + 1, ···, Z, we have

$$\begin{aligned} U_{n}^{\pi ^{\prime}} \left( {h_{n} } \right) & = \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1}^{\pi ^{\prime}} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) \\ & \geqslant \overline{r}_{{s_{n} }}^{{x_{n} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n} \left( {h_{n} } \right)}} U_{n + 1} \left( {h_{n} ,x_{n} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) - \left( {Z - n - 1} \right)\varepsilon \\ & \geqslant U_{n} \left( {h_{n} } \right) - \left( {Z - n} \right)\varepsilon .\\ \end{aligned}$$

(A6)

The second line in (A6) holds because of the induction hypothesis and ε > 0. Thus the inductive hypothesis is satisfied and (A5) holds for t = 0,1, ···, Z.

Thus for any ε > 0, there exists a ${\pi}'$ $\in$ Π^DHR for which

$$U_{t}^{*} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t}^{{\pi^{\prime } }} \left( {h_{t} } \right) + \left( {Z - t} \right)\varepsilon \geqslant U_{t} \left( {h_{t} } \right) \geqslant U_{t}^{*} \left( {h_{t} } \right),$$

then part (i) in the theorem holds. Part (ii) in the theorem holds because

$${U}_{0}\left({h}_{0}\right)={U}_{0}^{*}\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{U}_{0}^{\pi }\left({h}_{0}\right)=\underset{\pi \in {\Pi }^{\text{DHR}}}{\text{max}}{V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)={V}^{*}\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

1.2.1 Proof of Theorem 4

For t = Z, clearly ${U}_{Z}^{{\pi }^{*}}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z}\right)$ for all h_Z $\in$ H_Z. Assume argument in part (i) in the theorem is true for t = n + 1, ···, Z. Then, for t = n and ${h}_{n}=\left({h}_{n-1},{x}_{n-1}^{*}({h}_{n-1}),\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)$,

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \overline{r}_{{s_{n} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{x_{n}^{*} \left( {h_{n} } \right)}} U_{n + 1}^{{\pi^{*} }} \left( {h_{n} ,x_{n}^{*} \left( {h_{n} } \right),\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right) = U_{n}^{{\pi^{*} }} \left( {h_{n} } \right). \\ \end{aligned}$$

Thus the induction hypothesis is true. Part (ii) in the theorem follows from Theorem 1 and Theorem 3-(ii).

1.3 Proof of Theorem 6

We show that (i) holds by induction. Since ${U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({h}_{Z-1},{a}_{Z-1},\langle {s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\rangle \right)={\overline{r} }_{{s}_{Z}}^{0}$ for all h_Z–1 $\in$ H_Z–1 and a_Z–1 $\in$ A, ${U}_{Z}^{*}\left({h}_{Z}\right)={U}_{Z}^{*}\left({s}_{Z},{{\varvec{b}}}_{{s}_{Z}}\right)$. Assume now that (i) is true for t = n + 1, ···, Z. Then let t = n. For any ${h}_{n}=\left({h}_{n-1},{a}_{n-1},\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle \right)\in {H}_{n}$, it follows from (24), the induction hypothesis and (14) that

$$\begin{aligned} U_{n}^{*} \left( {h_{n} } \right) & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {h_{n} ,a_{n} ,\langle s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} \rangle } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\overline{r}_{{s_{n} }}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \overline{p}_{{s_{n} ,s_{n + 1} }}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,{\varvec{b}}_{{s_{n + 1} }} } \right)} \right\} \\ & = \mathop {\max }\limits_{{a_{n} \in A}} \left\{ {\mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} r_{{s_{n} ,k}}^{{a_{n} }} + \gamma \mathop \sum \limits_{{s_{n + 1} \in S}} \mathop \sum \limits_{k \in C} b_{{s_{n} ,k}} p_{{s_{n} ,s_{n + 1} ,k}}^{{a_{n} }} U_{n + 1}^{*} \left( {s_{n + 1} ,\tau \left( {\langle s_{n} ,{\varvec{b}}_{{s_{n} }} \rangle ,a_{n} ,s_{n + 1} } \right)} \right)} \right\}. \\ \end{aligned}$$

(A7)

Because the quantities within brackets in the last line of (A7) depends on h_n only through $\langle {s}_{n},{{\varvec{b}}}_{{s}_{n}}\rangle$, we have ${U}_{n}^{*}\left({h}_{n}\right)={U}_{n}^{*}\left({s}_{n},{{\varvec{b}}}_{{s}_{n}}\right)$. Thus part (i) in the theorem is true for t = 0,1, ··· , Z.

When S and A are finite, there exists a policy $\pi^{*} = \left( {x_{0}^{*} ,x_{1}^{*},\,\dots\,,x_{Z - 1}^{*} } \right) \in {\Pi }^{{{\text{DMD}}}}$ derived from

$${x}_{t}^{*}\left({s}_{t},{{\varvec{b}}}_{{s}_{t}}\right)\in \underset{{a}_{t}\in A}{\text{argmax}}\left\{\sum \limits_{k \in C}{b}_{{s}_{t},k}{r}_{{s}_{t},k}^{{a}_{t}}+\gamma \sum \limits_{{s_{t + 1} \in S}}\sum \limits_{k \in C} {b}_{{s}_{t},k}{p}_{{s}_{t},{s}_{t+1},k}^{{a}_{t}}{U}_{t+1}^{*}\left({s}_{t+1},\tau \left(\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle ,{a}_{t},{s}_{t+1}\right)\right)\right\}.$$

Therefore, by part (i) in the theorem and Theorem 4-(ii), ${\pi }^{*}$ is optimal.

1.4 Proof of Theorem 7

We will show that ${W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)$ in (4) is equal to ${V}^{\pi }({s}_{0},{{\varvec{b}}}_{{s}_{0}})$ in (16). Firstly, we provide another calculation of scenario belief ${{\varvec{b}}}_{{s}_{t}}$ in state s_t at decision epoch t. For any $h_{0} = \langle s_{0} ,{\varvec{b}}_{{s_{0} }} \rangle \in S \times {\Delta }^{C}$, using history recursion ${h}_{t}=({h}_{t-1},{a}_{t-1},\langle {s}_{t},\tau (\langle {s}_{t-1},{{\varvec{b}}}_{{s}_{t-1}}\rangle ,{a}_{t-1},{s}_{t})\rangle )$ repeatedly, the belief ${{\varvec{b}}}_{{s}_{t}}$ within $\langle {s}_{t},{{\varvec{b}}}_{{s}_{t}}\rangle$ in h_t $\in$ H_t can be expressed by

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{0} ,k}} p_{{s_{0} ,s_{1} ,k}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k}}^{{a_{t - 1} }} }}{{\mathop \sum\limits_{\text{k}' \in \text{C}}b_{{s_{0} ,k{^{\prime}}}} p_{{s_{0} ,s_{1} ,k{^{\prime}}}}^{{a_{0} }} p_{{s_{1} ,s_{2} ,k{^{\prime}}}}^{{a_{1} }} \ldots p_{{s_{t - 1} ,s_{t} ,k{^{\prime}}}}^{{a_{t - 1} }} }}, \forall\,k \in C,$$

(A8)

where s₀, a₀, s₁, ···, a_t−1, s_t are the state realizations and actions taken up to t.

For a fixed policy π $\in$ Π^DHD, starting from the boundary condition ${U}_{Z}^{\pi }\left({h}_{Z}\right)={\overline{r} }_{{s}_{Z}}^{0}$, we evaluate recursively ${U}_{t}^{\pi }$ for $t=Z-1,\cdots,1,0$ by (22), that is,

$$\begin{aligned} U_{Z - 1}^{\pi } \left( {h_{Z - 1} } \right) & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} U_{Z}^{\pi } \left( {h_{Z - 1} ,x_{Z - 1} \left( {h_{Z - 1} } \right),\langle s_{Z} ,{\varvec{b}}_{{s_{Z} }} \rangle } \right) \\ & = \overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} ,\forall\, h_{Z - 1} \in H_{Z - 1}^{\pi } , \\ \end{aligned}$$

$$\begin{aligned} &U_{Z - 2}^{\pi } \left( {h_{Z - 2} } \right) = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} U_{Z - 1}^{\pi } \left( {h_{Z - 2} ,x_{Z - 2} \left( {h_{Z - 2} } \right),\langle s_{Z - 1} ,{\varvec{b}}_{{s_{Z - 1} }} \rangle } \right) \\ &\quad\quad\quad\quad\!\!\quad\quad\quad = \overline{r}_{{s_{Z - 2} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\},\forall\, h_{Z - 2} \in H_{Z - 2}^{\pi } , \\ & \cdots, \\ \end{aligned}$$

$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \overline{r}_{{s_{0} }}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} \overline{p}_{{s_{0} ,s_{1} }}^{{x_{0} \left( {h_{0} } \right)}} \left\{ {\overline{r}_{{s_{1} }}^{{x_{1} \left( {h_{1} } \right)}} + \gamma \mathop \sum \limits_{{s_{2} \in S}} \overline{p}_{{s_{1} ,s_{2} }}^{{x_{1} \left( {h_{1} } \right)}} } \right. \\ & \quad \cdot \left\{ {\overline{r}_{{s_{2} }}^{{x_{2} \left( {h_{2} } \right)}} + \ldots + \gamma \mathop \sum \limits_{{s_{Z - 1} \in S}} \overline{p}_{{s_{Z - 2} ,s_{Z - 1} }}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \left. {\left\{ {\overline{r}_{{s_{Z - 1} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} + \gamma \mathop \sum \limits_{{s_{Z} \in S}} \overline{p}_{{s_{Z - 1} ,s_{Z} }}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \overline{r}_{{s_{Z} }}^{0} } \right\} \ldots } \right\}} \right\}. \\ \end{aligned}$$

(A9)

Submitting (14) into the last equation in (A9) and using (A8), we have

$$\begin{aligned} U_{0}^{\pi } \left( {h_{0} } \right) & = \mathop \sum \limits_{k \in C} b_{{s_{0} ,k}} \left\{ {r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} } \right. + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad \left. { + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} \mathop \sum \limits_{{s_{2} \in S}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} } \right\}. \\ \end{aligned}$$

(A10)

On the other hand, Steimle et al. [4] infer that there are no optimal policies that are Markovian for the problem in (5). Then, we consider the history-dependent deterministic policies of standard MDPs, $\pi = \left\{ {x_{t} \left( {h_{t} } \right):h_{t} \in S \times A \times \cdots \times A \times S,t \in T} \right\}$. Using the finite horizon policy evaluation algorithm presented in [1] and the same logic as in (A9), we obtain ${V}_{k}^{\pi }\left({s}_{0}\right)$ in (3) as follows

$$\begin{aligned} V_{k}^{\pi } \left( {s_{0} } \right) & = r_{{s_{0} ,k}}^{{x_{0} \left( {h_{0} } \right)}} + \gamma \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} r_{{s_{1} ,k}}^{{x_{1} \left( {h_{1} } \right)}} + \gamma^{2} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} r_{{s_{2} ,k}}^{{x_{2} \left( {h_{2} } \right)}} \\ & \quad + \ldots + \gamma^{Z - 1} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} r_{{s_{Z - 1} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} \\ & \quad + \gamma^{Z} \mathop \sum \limits_{{s_{1} \in S}} p_{{s_{0} ,s_{1} ,k}}^{{x_{0} \left( {h_{0} } \right)}} \mathop \sum \limits_{{s_{2} \in S}} p_{{s_{1} ,s_{2} ,k}}^{{x_{1} \left( {h_{1} } \right)}} \ldots \mathop \sum \limits_{{s_{Z - 1} \in S}} p_{{s_{Z - 2} ,s_{Z - 1} ,k}}^{{x_{Z - 2} \left( {h_{Z - 2} } \right)}} \mathop \sum \limits_{{s_{Z} \in S}} p_{{s_{Z - 1} ,s_{Z} ,k}}^{{x_{Z - 1} \left( {h_{Z - 1} } \right)}} r_{{s_{Z} ,k}}^{0} . \\ \end{aligned}$$

(A11)

It holds from (4), (A11) and (A10) that

$${W}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right)=\sum_{k\in C}{b}_{{s}_{0},k}{V}_{k}^{\pi }\left({s}_{0}\right)={U}_{0}^{\pi }\left({h}_{0}\right)={V}^{\pi }\left({s}_{0},{{\varvec{b}}}_{{s}_{0}}\right).$$

Then it follows from (5) and (29) that the finite horizon DFMDP is equivalent to the WVP of the multi-scenario MDP.

1.5 Proof of Theorem 8

For a given $\langle {s}_{0},{{\varvec{b}}}_{{s}_{0}}\rangle$, (10) becomes

$$b_{{s_{t} ,k}} = \frac{{b_{{s_{t - 1} ,k}} p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} }}{{p_{{s_{t - 1} ,s_{t} ,0}}^{{a_{t - 1} }} \mathop \sum \nolimits_{{k{^{\prime}} \in {\text{C}}}} b_{{s_{t - 1} ,k{^{\prime}}}} }} = b_{{s_{t - 1} ,k}} ,\forall\, a_{t - 1} \in A,s_{t - 1} ,s_{t} \in S, k \in C,$$

since p₁ = … = p_|C|= p₀. This implies that the all scenario beliefs for any state at any time are constant and equal to ${{\varvec{b}}}_{{s}_{0}}$. As a result, $\overline{r}_{{s_{t} }}^{{a_{t} }} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} r_{{s_{t} ,k}}^{{a_{t} }} ,\forall\, s_{t} \in S,a_{t} \in A,t \in T$, i.e., $\overline{\varvec{r}} = \mathop \sum \nolimits_{k \in C} b_{{s_{0} ,k}} {\varvec{r}}_{k}$ which means that the scenario expected reward does not change over time. Therefore, the DFMDP with the type-I of scenarios can be reduced to a standard MDP with parameter pair (${{\varvec{p}}}_{0}$,$\overline{{\varvec{r}} }$).

1.6 Proof of Theorem 9

Firstly, we will prove that H is an isotone mapping. Let V, U $\in$ $\mathcal{V}$ and U $\geqslant$ V. Suppose that

$${a}^{*}(s,{{\varvec{b}}}_{s})\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{s\mathrm{^{\prime}}\in S}\sum \limits_{k\in C}{b}_{s,k}{p}_{s,s\mathrm{^{\prime}},k}^{a}V\left(s\mathrm{^{\prime}},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,s\mathrm{^{\prime}}\right)\right)\right\},$$

then for any $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$,

$$\begin{aligned} & \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \geqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {\text{argmax}\,U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) - V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right)} \right\} \geqslant 0. \\ \end{aligned}$$

It shows that HU $\geqslant$ HV when U $\geqslant$ V. So H is an isotone mapping.

Then we will prove that for all ${\text{V}}$, U $\in$ ${\mathcal{V}}$, that $\left \| HV-HU \right \| \leqslant \gamma \left \| V-U \right \|$ is true for any 0 < γ < 1. Let V, U $\in$ ${\mathcal{V}}$ and HV(s, b_s) $\geqslant$ HU(s, b_s) for a fixed $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$. Again, suppose that

$${a}^{*}\left(s,{{\varvec{b}}}_{s}\right)\in \underset{a\in A}{\text{argmax}}\left\{ \sum \limits_{k\in C}{b}_{s,k}{r}_{s,k}^{a}+\gamma \sum \limits_{{s}^{\prime} \in S} \sum \limits_{k\in C}{b}_{s,k}{p}_{s,{s}^{\mathrm{^{\prime}}},k}^{a}V\left({s}^{\prime},\tau \left(\langle s,{{\varvec{b}}}_{s}\rangle ,a,{s}^{\prime}\right)\right)\right\}.$$

Then

$$\begin{aligned} 0 & \leqslant \left( {HV} \right)\left( {s,{\varvec{b}}_{s} } \right) - \left( {HU} \right)\left( {s,{\varvec{b}}_{s} } \right) \\ & \leqslant \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} + \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} V\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & \quad - \mathop \sum \limits_{k \in C} b_{s,k} r_{s,k}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} - \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} U\left( {s{^{\prime}},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s{^{\prime}}} \right)} \right) \\ & = \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \left\{ {V\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right) - U\left( {s^{\prime},\tau \left( {\langle s,{\varvec{b}}_{s} \rangle ,a^{*} \left( {s,{\varvec{b}}_{s} } \right),s^{\prime}} \right)} \right)} \right\} \\ & \leqslant \gamma \mathop \sum \limits_{{s{^{\prime}} \in S}} \mathop \sum \limits_{k \in C} b_{s,k} p_{{s,s{^{\prime}},k}}^{{a^{*} \left( {s,{\varvec{b}}_{s} } \right)}} \Vert V - U \Vert \leqslant \gamma \Vert V - U \Vert . \\ \end{aligned}$$

If we assume HU(s, b_s) $\geqslant$ HV(s, b_s), the same logic will imply that

$$\left|\left(HV\right)\left(s,{{\varvec{b}}}_{s}\right)-\left(HU\right)\left(s,{{\varvec{b}}}_{s}\right)\right|\leqslant \gamma \Vert V-U\Vert$$

for any $\langle s,{{\varvec{b}}}_{s}\rangle \in S\times {\Delta }^{C}$. This results in $\left \| \, HV-HU \right \| \leqslant \gamma \left \| V-U \right \|$ So H is a contraction mapping.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hou, CJ. Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters. J. Oper. Res. Soc. China (2023). https://doi.org/10.1007/s40305-023-00457-5

Download citation

Received: 16 March 2022
Revised: 20 December 2022
Accepted: 02 March 2023
Published: 15 May 2023
DOI: https://doi.org/10.1007/s40305-023-00457-5

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Double-Factored Decision Theory for Markov Decision Processes with Multiple Scenarios of the Parameters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Complexity Bounds for Deterministic Partially Observed Markov Decision Processes

Analysis of Markov Decision Processes Under Parameter Uncertainty

Markov Decision Processes with Discounted Costs over a Finite Horizon: Action Elimination

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Appendix

Appendix

1.1 Proof of Theorem 1

1.2 Proof of Theorem 3

1.2.1 Proof of Theorem 4

1.3 Proof of Theorem 6

1.4 Proof of Theorem 7

1.5 Proof of Theorem 8

1.6 Proof of Theorem 9

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now