论文标题
在完美观察到的环境中,用于决策的经常性总和最大网络
Recurrent Sum-Product-Max Networks for Decision Making in Perfectly-Observed Environments
论文作者
论文摘要
最近对概括总和网络(SPN)的总和最大网络(SPMN)的调查提供了一个数据驱动的替代决策替代方案,该替代方案主要依赖于手工制作的模型。 SPMN在计算上表示概率决策问题,其解决方案在网络的大小上线性缩放。但是,SPMN不太适合在多个时间步骤上进行连续决策。在本文中,我们提出了随着时间的推移学习和模拟决策数据的经常性SPMN(RSPMN)。 RSPMN使用一个模板网络,该模板网络根据数据序列的长度而根据需要展开。这很重要,因为RSPMN不仅继承了SPMN在数据驱动和大部分可进行数据驱动方面的好处,而且还非常适合顺序问题。我们在模板网络上建立条件,该条件确保所得的SPMN有效,并提出一种结构学习算法以学习声音模板网络。我们证明,在顺序决策数据集的测试台上学习的RSPMN生成了与完美观察到的域上最佳的MEU和策略。它们很容易改善最近的批处理限制的强化学习方法,这很重要,因为RSPMN提供了一种新的基于模型的离线增强学习方法。
Recent investigations into sum-product-max networks (SPMN) that generalize sum-product networks (SPN) offer a data-driven alternative for decision making, which has predominantly relied on handcrafted models. SPMNs computationally represent a probabilistic decision-making problem whose solution scales linearly in the size of the network. However, SPMNs are not well suited for sequential decision making over multiple time steps. In this paper, we present recurrent SPMNs (RSPMN) that learn from and model decision-making data over time. RSPMNs utilize a template network that is unfolded as needed depending on the length of the data sequence. This is significant as RSPMNs not only inherit the benefits of SPMNs in being data driven and mostly tractable, they are also well suited for sequential problems. We establish conditions on the template network, which guarantee that the resulting SPMN is valid, and present a structure learning algorithm to learn a sound template network. We demonstrate that the RSPMNs learned on a testbed of sequential decision-making data sets generate MEUs and policies that are close to the optimal on perfectly-observed domains. They easily improve on a recent batch-constrained reinforcement learning method, which is important because RSPMNs offer a new model-based approach to offline reinforcement learning.