Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- Part III Partially Observed Markov Decision Processes: Structural Results
- 9 Structural results for Markov decision processes
- 10 Structural results for optimal filters
- 11 Monotonicity of value function for POMDPs
- 12 Structural results for stopping time POMDPs
- 13 Stopping time POMDPs for quickest change detection
- 14 Myopic policy bounds for POMDPs and sensitivity to model parameters
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
12 - Structural results for stopping time POMDPs
from Part III - Partially Observed Markov Decision Processes: Structural Results
Published online by Cambridge University Press: 05 April 2016
- Frontmatter
- Contents
- Preface
- 1 Introduction
- Part I Stochastic Models and Bayesian Filtering
- Part II Partially Observed Markov Decision Processes: Models and Applications
- Part III Partially Observed Markov Decision Processes: Structural Results
- 9 Structural results for Markov decision processes
- 10 Structural results for optimal filters
- 11 Monotonicity of value function for POMDPs
- 12 Structural results for stopping time POMDPs
- 13 Stopping time POMDPs for quickest change detection
- 14 Myopic policy bounds for POMDPs and sensitivity to model parameters
- Part IV Stochastic Approximation and Reinforcement Learning
- Appendix A Short primer on stochastic simulation
- Appendix B Continuous-time HMM filters
- Appendix C Markov processes
- Appendix D Some limit theorems
- References
- Index
Summary
Introduction
The previous chapter established conditions under which the value function of a POMDP is monotone with respect to the MLR order. Also conditions were given for the optimal policy for a two-state POMDP to be monotone (threshold). This and the next chapter develop structural results for the optimal policy of multi-state POMDPs. To establish the structural results, we will use submodularity, and stochastic dominance on the lattice of belief states to analyze Bellman's dynamic programming equation – such analysis falls under the area of “Lattice Programming” [144]. Lattice programming and “monotone comparative statics” pioneered by Topkis [322] (see also [15, 26]) provide a general set of sufficient conditions for the existence of monotone strategies. Once a POMDP is shown to have a monotone policy, then gradient-based algorithms that exploit this structure can be designed to estimate this policy. This and the next two chapters rely heavily on the structural results for filtering (Chapter 10) and monotone value function (Chapter 11). Please see Figure 10.1 on page 220 for the context of this chapter.
Main results
This chapter deals with structural results for the optimal policy of stopping time POMDPs. Stopping time POMDPs have action space U = {1 (stop), 2 (continue) }. They arise in sequential detection such as quickest change detection and machine replacement. Establishing structural results for stopping time POMDPs are easier than that for general POMDPs (which is considered in the next chapter). The main structural results in this chapter regarding stopping time POMDPs are:
Convexity of stopping region: §12.2 shows that the set of beliefs where it is optimal to apply action 1 (stop) is a convex subset of the belief space. This result unifies several well known results about the convexity of the stopping set for sequential detection problems.
Monotonicity of the optimal policy: §12.3 gives conditions under which the optimal policy of a stopping time POMDP is monotone with respect to the monotone likelihood ratio (MLR) order. The MLR order is naturally suited for POMDPs since it is preserved under conditional expectations.
Figure 12.1 displays these structural results. For X = 2, we will show that stopping set is the interval [π*, 1] and the optimal policy μ*(π) is a step function; see Figure 12.1(a)). So it is only necessary to compute the threshold state π*.
- Type
- Chapter
- Information
- Partially Observed Markov Decision ProcessesFrom Filtering to Controlled Sensing, pp. 255 - 283Publisher: Cambridge University PressPrint publication year: 2016