城市街景场景的实时高性能语义图像细分

论文标题

城市街景场景的实时高性能语义图像细分

Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes

论文作者

Dong, Genshun, Yan, Yan, Shen, Chunhua, Wang, Hanzi

论文摘要

深度卷积神经网络（DCNN）最近在语义图像分割方面表现出了出色的性能。但是，基于DCNN的最新语义分割方法通常由于使用复杂的网络体系结构而遭受高计算复杂性。这极大地限制了他们在需要实时处理的现实情况下的应用程序。在本文中，我们提出了一种基于DCNN的实时高性能方法，用于对城市街道场景的稳健语义分割，从而在准确性和速度之间取得了良好的权衡。具体而言，首先将具有严重卷积和注意力的轻质基线网络（LBN-AA）用作我们的基线网络，以有效地获得密集的特征图。然后，开发了独特的非常耐用的空间金字塔池（DASPP），该金字塔池（DASPP）开发了池操作的不同尺寸以编码丰富而独特的语义信息，以在多个尺度上检测对象。同时，具有浅卷积层的空间保存网络（SPN）旨在生成保留详细空间信息的高分辨率特征地图。最后，一个简单但实用的特征融合网络（FFN）分别用来有效地结合语义分支（DASPP）和空间分支（SPN）的浅和深特征。广泛的实验结果表明，所提出的方法分别达到了与联合（MIOU）相比的73.6％和68.0％平均交叉点的准确性，而推理速度为51.0 fps和39.3 fps，在具有挑战性的CityScapes和Camvid Test数据集上（仅使用单个NVIDIA Titan X Card）。这表明所提出的方法在实时速度上提供了出色的性能，用于对城市街头场景的语义分割。

Deep Convolutional Neural Networks (DCNNs) have recently shown outstanding performance in semantic image segmentation. However, state-of-the-art DCNN-based semantic segmentation methods usually suffer from high computational complexity due to the use of complex network architectures. This greatly limits their applications in the real-world scenarios that require real-time processing. In this paper, we propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes, which achieves a good trade-off between accuracy and speed. Specifically, a Lightweight Baseline Network with Atrous convolution and Attention (LBN-AA) is firstly used as our baseline network to efficiently obtain dense feature maps. Then, the Distinctive Atrous Spatial Pyramid Pooling (DASPP), which exploits the different sizes of pooling operations to encode the rich and distinctive semantic information, is developed to detect objects at multiple scales. Meanwhile, a Spatial detail-Preserving Network (SPN) with shallow convolutional layers is designed to generate high-resolution feature maps preserving the detailed spatial information. Finally, a simple but practical Feature Fusion Network (FFN) is used to effectively combine both shallow and deep features from the semantic branch (DASPP) and the spatial branch (SPN), respectively. Extensive experimental results show that the proposed method respectively achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps on the challenging Cityscapes and CamVid test datasets (by only using a single NVIDIA TITAN X card). This demonstrates that the proposed method offers excellent performance at the real-time speed for semantic segmentation of urban street scenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题