论文标题
在州行动空间中使用无内存的随机策略来解决无限的Horizon Pomdps
Solving infinite-horizon POMDPs with memoryless stochastic policies in state-action space
论文作者
论文摘要
完全可观察到的马尔可夫决策过程中的奖励优化等同于状态行动频率的线性程序。在具有无记忆随机策略的部分可观察到的马尔可夫决策过程中,采用类似的观点,该问题最近被提出为线性物镜受多项式约束的优化。基于此,我们提出了一种在州行动空间(ROSA)中进行奖励优化的方法。我们在迷宫导航任务中通过实验测试这种方法。我们发现Rosa在计算上是有效的,并且可以比其他现有方法产生稳定性的改进。
Reward optimization in fully observable Markov decision processes is equivalent to a linear program over the polytope of state-action frequencies. Taking a similar perspective in the case of partially observable Markov decision processes with memoryless stochastic policies, the problem was recently formulated as the optimization of a linear objective subject to polynomial constraints. Based on this we present an approach for Reward Optimization in State-Action space (ROSA). We test this approach experimentally in maze navigation tasks. We find that ROSA is computationally efficient and can yield stability improvements over other existing methods.