论文标题
价值对齐验证
Value Alignment Verification
论文作者
论文摘要
当人类与自主代理人互动以执行越来越复杂的潜在风险任务时,能够有效评估代理的性能和正确性很重要。在本文中,我们对有效的价值一致性验证的问题进行形式化和理论分析:如何有效地测试另一个代理的行为是否与人类价值观对齐。目的是构建一种“驾驶员测试”,人类可以给任何可以通过最小数量的查询来验证价值对齐的代理。我们研究具有明确奖励函数的理想化人类以及具有隐式价值的问题的对齐验证问题。我们分析了合理剂的确切价值比对的验证,并提出和分析在广泛的网格世界中的启发式和近似值对准验证测试以及连续的自主驾驶域。最后,我们证明存在足够的条件,因此我们可以通过恒定的复杂性比对测试来验证无限的测试环境集的精确和近似对齐。
As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. The goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.