ABSTRACT

Compared with unimodal deep learning algorithms that directly process 3D point clouds, multi-modal fusion algorithms that leverage 2D images as supplementary information have performance advantages. In this work, the performance of an open-source multimodal algorithm, MVPNet, is improved on the 3D semantic segmentation task by using KPConv as a more robust 3D backbone. Different modules of the two networks are meaningfully combined: the 2D-3D lifting method provided by MVPNet aggregates selected 2D image features into 3D point clouds, then KPConv is used to fuse these features with geometric information to make predictions. On a ScanNet sub dataset, the proposed network significantly outperforms the original MVPNet and KPConv regardless of the fusion structure. By integrating COLMAP into the workflow, we further extend the proposed method to a custom dataset. The results show the improved performance of our multimodal fusion algorithm in identifying relevant categories of objects in the 3D scene.