论文标题
PIX2Shape:使用基于视图的表示形式从图像中无监督学习3D场景
Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images using a View-based Representation
论文作者
论文摘要
我们从单个输入映像中推断并生成三维(3D)场景信息,而无需监督。这个问题的探索不足,大多数先前的工作都依赖于3D地面真相,场景的多个图像,图像剪影或钥匙点的监督。我们提出PIX2SHAPE,这是一种用四个组成部分解决此问题的方法:(i)一个编码器,从图像中删除潜在的3D表示,(ii)一种解码器,一种解码器,产生了明确的2.5D表面的重建场景重建场景中的场景重建,从潜在的代码(iii)构成了一个可与surfel surnim a I Invriim the surfel crimim and Intriim a surfel surfel的网络(III),并)climectiv and criim and criim and Intriim ai Intriim and Intriim and(解码器渲染器和培训分布的图像产生。 PIX2Shape可以生成复杂的3D场景,以与视图相关的屏幕分辨率扩展,这与捕获世界空间分辨率的表示不同,即Voxels或网格。我们表明,Pix2Shape在其编码的潜在空间中学习了一个一致的场景表示形式,然后可以将解码器应用于此潜在表示,以便从新颖的角度合成场景。我们通过在Shapenet数据集上的实验以及我们开发的名为3D-IQTT的新型基准测试中评估PIX2Shape,以根据模型启用3D空间推理的能力来评估模型。定性和定量评估证明了Pix2Shape解决场景重建,生成和理解任务的能力。
We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four components: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene from the latent code (iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed, called 3D-IQTT, to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape's ability to solve scene reconstruction, generation, and understanding tasks.