Structural results for stopping time POMDPs

Vikram Krishnamurthy

doi:10.1017/CBO9781316471104.016

Introduction

The previous chapter established conditions under which the value function of a POMDP is monotone with respect to the MLR order. Also conditions were given for the optimal policy for a two-state POMDP to be monotone (threshold). This and the next chapter develop structural results for the optimal policy of multi-state POMDPs. To establish the structural results, we will use submodularity, and stochastic dominance on the lattice of belief states to analyze Bellman's dynamic programming equation – such analysis falls under the area of “Lattice Programming” [144]. Lattice programming and “monotone comparative statics” pioneered by Topkis [322] (see also [15, 26]) provide a general set of sufficient conditions for the existence of monotone strategies. Once a POMDP is shown to have a monotone policy, then gradient-based algorithms that exploit this structure can be designed to estimate this policy. This and the next two chapters rely heavily on the structural results for filtering (Chapter 10) and monotone value function (Chapter 11). Please see Figure 10.1 on page 220 for the context of this chapter.

Main results

This chapter deals with structural results for the optimal policy of stopping time POMDPs. Stopping time POMDPs have action space U = {1 (stop), 2 (continue) }. They arise in sequential detection such as quickest change detection and machine replacement. Establishing structural results for stopping time POMDPs are easier than that for general POMDPs (which is considered in the next chapter). The main structural results in this chapter regarding stopping time POMDPs are:

Convexity of stopping region: §12.2 shows that the set of beliefs where it is optimal to apply action 1 (stop) is a convex subset of the belief space. This result unifies several well known results about the convexity of the stopping set for sequential detection problems.
Monotonicity of the optimal policy: §12.3 gives conditions under which the optimal policy of a stopping time POMDP is monotone with respect to the monotone likelihood ratio (MLR) order. The MLR order is naturally suited for POMDPs since it is preserved under conditional expectations.

Figure 12.1 displays these structural results. For X = 2, we will show that stopping set is the interval [π*, 1] and the optimal policy μ*(π) is a step function; see Figure 12.1(a)). So it is only necessary to compute the threshold state π*.

Book contents

12 - Structural results for stopping time POMDPs

Summary

Access options

Book contents

12 - Structural results for stopping time POMDPs

Summary

Access options

Save book to Kindle

Save book to Dropbox

Save book to Google Drive