深层网络的进度和局限性以识别异常姿势的对象

论文标题

深层网络的进度和局限性以识别异常姿势的对象

Progress and limitations of deep networks to recognize objects in unusual poses

论文作者

Abbas, Amro, Deny, Stéphane

论文摘要

如果要成功部署在高风险现实世界应用程序（例如自动驾驶汽车）中，则深层网络应对罕见事件具有强大的核心。在这里，我们研究深网识别异常姿势对象的能力。我们创建了一个以异常方向的对象图像的合成数据集，并评估了38个最近且竞争性深网的鲁棒性，用于图像分类。我们表明，对所有测试的网络进行分类仍然是一个挑战，与直立物体呈现对象相比，平均准确度下降了29.5％。这种脆弱性在很大程度上不受各种网络设计选择的影响，例如培训损失（例如，有监督与自我监督），架构（例如，卷积网络与变形金刚），数据集模式（例如，图像与图像 - 文本对）和数据登记计划。但是，在非常大的数据集上培训的网络基本上要优于其他培训，最佳网络测试了$ \ unicode {x2014} $噪声嘈杂的学生EfficentNet-L2接受了JFT-300m $ \ unicode {x2014} $的训练，显示出相对较小的准确量仅在同意的情况下仅14.5％。然而，对嘈杂学生的失败的视觉检查表明，与人类视觉系统的稳定性存在剩余差距。此外，结合多个对象转换$ \ unicode {x2014} $ 3D旋转并缩放$ \ unicode {x2014} $进一步降低了所有网络的性能。总的来说，我们的结果提供了对深网的鲁棒性的另一种衡量，这在现实世界中使用它们时要考虑的重要性很重要。代码和数据集可在https://github.com/amro-kamal/objectpose上找到。

Deep networks should be robust to rare events if they are to be successfully deployed in high-stakes real-world applications (e.g., self-driving cars). Here we study the capability of deep networks to recognize objects in unusual poses. We create a synthetic dataset of images of objects in unusual orientations, and evaluate the robustness of a collection of 38 recent and competitive deep networks for image classification. We show that classifying these images is still a challenge for all networks tested, with an average accuracy drop of 29.5% compared to when the objects are presented upright. This brittleness is largely unaffected by various network design choices, such as training losses (e.g., supervised vs. self-supervised), architectures (e.g., convolutional networks vs. transformers), dataset modalities (e.g., images vs. image-text pairs), and data-augmentation schemes. However, networks trained on very large datasets substantially outperform others, with the best network tested$\unicode{x2014}$Noisy Student EfficentNet-L2 trained on JFT-300M$\unicode{x2014}$showing a relatively small accuracy drop of only 14.5% on unusual poses. Nevertheless, a visual inspection of the failures of Noisy Student reveals a remaining gap in robustness with the human visual system. Furthermore, combining multiple object transformations$\unicode{x2014}$3D-rotations and scaling$\unicode{x2014}$further degrades the performance of all networks. Altogether, our results provide another measurement of the robustness of deep networks that is important to consider when using them in the real world. Code and datasets are available at https://github.com/amro-kamal/ObjectPose.

下载PDF全文

下载文献需遵守相关版权规定

论文标题