针对具有分段稳定上下文的非平稳环境的自适应深度RL方法

论文标题

针对具有分段稳定上下文的非平稳环境的自适应深度RL方法

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

论文作者

Chen, Xiaoyu, Zhu, Xiangming, Zheng, Yufeng, Zhang, Pushi, Zhao, Li, Cheng, Wenxue, Cheng, Peng, Xiong, Yongqiang, Qin, Tao, Chen, Jianyu, Liu, Tie-Yan

论文摘要

将RL部署到现实世界应用程序的主要挑战之一是适应未知环境环境的变化，例如机器人任务中的地形改变和拥堵控制中的带宽波动。现有的关于适应未知环境环境的作品要么假定整个情节的上下文相同，要么假设上下文变量是马尔可夫人。但是，在许多现实世界中，环境环境通常在随机时期保持稳定，然后在情节中以突然和不可预测的方式变化，从而导致了段结构，现有作品无法解决。为了利用现实世界应用程序中的分段稳定上下文的段结构，在本文中，我们提出了一个\ textIt {\ textbf {se} gented \ textbf {c} intext \ textbf {b textbf {b} eLief \ textbf \ textbf \ textbf {a} a} a} u} u \ fexted \ fexted \ fexted \ textbf {u} ubf {d d} eep {d} ep {d} eep（sec）我们的方法可以共同推断潜在上下文的信念分布与后段长度相比，并在当前上下文段内使用观察到的数据执行更准确的信念上下文推断。可以利用推断的信念上下文来增强国家，从而导致一项可以适应上下文中突然变化的政策。我们从经验上证明，SECBAD可以准确地推断上下文段的长度，并在玩具网格世界环境和具有分段稳定上下文的Mujuco任务上胜过现有的方法。

One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题