论文标题
在成本操纵下,线性二次控制的强化学习很容易受到伤害
Reinforcement Learning for Linear Quadratic Control is Vulnerable Under Cost Manipulation
论文作者
论文摘要
在这项工作中,我们通过操纵成本信号来研究线性季度高斯(LQG)代理的欺骗。我们表明,对成本参数的少量伪造只会导致最佳策略的有限变化。界限对攻击者可以应用于成本参数的伪造量是线性的。我们提出了一个攻击模型,攻击者旨在通过故意伪造成本参数来误导代理商学习“邪恶”政策。我们将攻击的问题提出为凸优化问题,并开发出必要和足够的条件,以检查攻击者目标的实现性。 我们在两种类型的LQG学习者上展示了对抗性操作:批次RL学习者,另一个是自适应动态编程(ADP)学习者。我们的结果表明,在成本数据上只有2.296%的伪造,攻击者误导了批处理RL,以学习导致车辆处于危险位置的“邪恶”政策。攻击者还可以逐渐欺骗ADP学习者,通过向学习者持续喂养一个伪造的成本信号,该信号与实际的成本信号接近。该论文旨在提高人们对支持RL的控制系统面临的安全威胁的认识。
In this work, we study the deception of a Linear-Quadratic-Gaussian (LQG) agent by manipulating the cost signals. We show that a small falsification of the cost parameters will only lead to a bounded change in the optimal policy. The bound is linear on the amount of falsification the attacker can apply to the cost parameters. We propose an attack model where the attacker aims to mislead the agent into learning a `nefarious' policy by intentionally falsifying the cost parameters. We formulate the attack's problem as a convex optimization problem and develop necessary and sufficient conditions to check the achievability of the attacker's goal. We showcase the adversarial manipulation on two types of LQG learners: the batch RL learner and the other is the adaptive dynamic programming (ADP) learner. Our results demonstrate that with only 2.296% of falsification on the cost data, the attacker misleads the batch RL into learning the 'nefarious' policy that leads the vehicle to a dangerous position. The attacker can also gradually trick the ADP learner into learning the same `nefarious' policy by consistently feeding the learner a falsified cost signal that stays close to the actual cost signal. The paper aims to raise people's awareness of the security threats faced by RL-enabled control systems.