论文标题
基于神经进化的游戏生成游戏
Neuroevolution-Based Generation of Tests and Oracles for Games
论文作者
论文摘要
类似游戏的程序在许多软件工程领域(例如移动应用程序,Web应用程序或编程教育)中变得越来越流行。但是,为具有挑战人类玩家的目的的程序创建测试是自动测试生成器的艰巨任务。即使测试生成成功地找到了一系列相关的事件来锻炼程序,游戏的随机性质也意味着不可能重现此序列基础的确切程序行为,也不可能创建测试断言检查是否正确。为了克服这些问题,我们提出了基于增强拓扑(整洁)算法的神经进化的新型测试生成器。最整洁的系统地探索了程序的语句,并创建了操作程序的神经网络,以可靠地达到每个语句 - 也就是说,最巧妙的是学会以一种可靠地覆盖代码不同部分的方式玩游戏。当网络学习实际游戏行为时,它们还可以通过评估所观察到的程序的观察到的行为与该程序的正确版本相比,可以作为测试隔壁。我们在Scratch的背景下评估了这种方法,这是一种教育编程环境。我们对25种非平凡刮擦游戏的实证研究表明,我们的方法可以成功训练与静态输入序列组成的传统测试套件相比,不仅对随机影响更具弹性的神经网络,而且在平均突变评分的平均突变分数超过65%以上。
Game-like programs have become increasingly popular in many software engineering domains such as mobile apps, web applications, or programming education. However, creating tests for programs that have the purpose of challenging human players is a daunting task for automatic test generators. Even if test generation succeeds in finding a relevant sequence of events to exercise a program, the randomized nature of games means that it may neither be possible to reproduce the exact program behavior underlying this sequence, nor to create test assertions checking if observed randomized game behavior is correct. To overcome these problems, we propose Neatest, a novel test generator based on the NeuroEvolution of Augmenting Topologies (NEAT) algorithm. Neatest systematically explores a program's statements, and creates neural networks that operate the program in order to reliably reach each statement -- that is, Neatest learns to play the game in a way to reliably cover different parts of the code. As the networks learn the actual game behavior, they can also serve as test oracles by evaluating how surprising the observed behavior of a program under test is compared to a supposedly correct version of the program. We evaluate this approach in the context of Scratch, an educational programming environment. Our empirical study on 25 non-trivial Scratch games demonstrates that our approach can successfully train neural networks that are not only far more resilient to random influences than traditional test suites consisting of static input sequences, but are also highly effective with an average mutation score of more than 65%.