在非平稳线性马尔可夫决策过程中有效学习

论文标题

在非平稳线性马尔可夫决策过程中有效学习

Efficient Learning in Non-Stationary Linear Markov Decision Processes

论文作者

Touati, Ahmed, Vincent, Pascal

论文摘要

我们研究了非平稳线性（又称低级）马尔可夫决策过程（MDPS）的情节增强学习，即，奖励和过渡内核相对于给定特征图是线性的，并且随着时间的推移，奖励和过渡核都可以缓慢地或突然发展。对于此问题设置，我们建议OPT-WLSVI基于加权最小二乘价值迭代的乐观算法，该算法使用指数级的权重来平稳忘记过去的数据。我们表明，当我们的算法每次与最佳政策竞争时，遗憾的是由$ \ widetilde {\ Mathcal {\ Mathcal {o}}（d^{5/4} h^2Δ发作和$δ$是MDP非平稳性的合适度量。此外，我们指出了忘记以前作品中遗忘策略的研究中的技术差距，我们为他们的遗憾分析提出了解决方案。

We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $\widetilde{\mathcal{O}}(d^{5/4}H^2 Δ^{1/4} K^{3/4})$ where $d$ is the dimension of the feature space, $H$ is the planning horizon, $K$ is the number of episodes and $Δ$ is a suitable measure of non-stationarity of the MDP. Moreover, we point out technical gaps in the study of forgetting strategies in non-stationary linear bandits setting made by previous works and we propose a fix to their regret analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题