对抗性政策击败了超人

论文标题

对抗性政策击败了超人

Adversarial Policies Beat Superhuman Go AIs

论文作者

Wang, Tony T., Gleave, Adam, Tseng, Tom, Pelrine, Kellin, Belrose, Nora, Miller, Joseph, Dennis, Michael D., Duan, Yawen, Pogrebniak, Viktor, Levine, Sergey, Russell, Stuart

论文摘要

我们通过训练对抗性政策对其进行训练，攻击了最先进的AI系统卡塔哥，这在超人环境下的卡塔哥率达到了97％的胜利率。我们的对手并不能通过表现出色而获胜。取而代之的是，他们欺骗卡塔哥造成严重的失误。我们的攻击将零射击转移到其他超人竞争的AIS上，并且在人类专家可以在没有算法援助的情况下实施它以始终如一地击败超人AIS的范围内，可以理解。即使在受过对抗训练以防御我们的攻击的对手训练中，我们的攻击所发现的核心脆弱性仍然存在。我们的结果表明，即使是超人AI系统也可能具有令人惊讶的故障模式。可以使用示例游戏https://goattack.far.ai/。

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders. Our attack transfers zero-shot to other superhuman Go-playing AIs, and is comprehensible to the extent that human experts can implement it without algorithmic assistance to consistently beat superhuman AIs. The core vulnerability uncovered by our attack persists even in KataGo agents adversarially trained to defend against our attack. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available https://goattack.far.ai/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题