使用深层神经网络的时间逻辑规格进行连续运动计划

论文标题

使用深层神经网络的时间逻辑规格进行连续运动计划

Continuous Motion Planning with Temporal Logic Specifications using Deep Neural Networks

论文作者

Wang, Chuanzheng, Li, Yinan, Smith, Stephen L., Liu, Jun

论文摘要

在本文中，我们提出了一种无模型的加固学习方法，以合成具有连续状态和行动的运动计划问题的控制策略。该机器人被建模为具有连续状态和动作空间的标记为离散时间马尔可夫决策过程（MDP）。线性时间逻辑（LTL）用于指定高级任务。然后，我们使用参与者批判性的增强学习方法来训练深层神经网络，以近似值函数和策略。 LTL规范将转换为注释的极限确定性BüchiAutomaton（LDBA），以持续塑造奖励，从而在训练过程中可获得浓厚的奖励。使用强化学习解决LTL规范的运动计划问题的一种天真的方法是采样轨迹，然后在轨迹满足整个LTL公式的情况下为训练分配高奖励。但是，当我们具有连续状态和动作空间的复杂LTL公式时，找到这种轨迹所需的采样复杂性太高了。结果，如果所有样本轨迹从自动机中的初始状态开始，就不可能获得足够的奖励来训练。在本文中，我们提出了一种方法，该方法不仅是从状态空间中采样初始状态，而且还在每个训练剧集开始时在自动机中进行任意状态。我们使用类似汽车的机器人测试模拟中的算法，并发现我们的方法可以成功地学习不同工作配置和LTL规范的策略。

In this paper, we propose a model-free reinforcement learning method to synthesize control policies for motion planning problems with continuous states and actions. The robot is modelled as a labeled discrete-time Markov decision process (MDP) with continuous state and action spaces. Linear temporal logics (LTL) are used to specify high-level tasks. We then train deep neural networks to approximate the value function and policy using an actor-critic reinforcement learning method. The LTL specification is converted into an annotated limit-deterministic Büchi automaton (LDBA) for continuously shaping the reward so that dense rewards are available during training. A naïve way of solving a motion planning problem with LTL specifications using reinforcement learning is to sample a trajectory and then assign a high reward for training if the trajectory satisfies the entire LTL formula. However, the sampling complexity needed to find such a trajectory is too high when we have a complex LTL formula for continuous state and action spaces. As a result, it is very unlikely that we get enough reward for training if all sample trajectories start from the initial state in the automata. In this paper, we propose a method that samples not only an initial state from the state space, but also an arbitrary state in the automata at the beginning of each training episode. We test our algorithm in simulation using a car-like robot and find out that our method can learn policies for different working configurations and LTL specifications successfully.

下载PDF全文

下载文献需遵守相关版权规定

论文标题