MASSTAR: A Multi-Modal Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion

Submitted to IROS 2024

Guiyong Zheng1,4,*, Jinqi Jiang1,3,*, Chen Feng2,*, Shaojie Shen2, Boyu Zhou1,†

1 Sun Yat-Sen University.    2 The Hong Kong University of Science and Technology.   
3 Harbin Institute of Technology.    4 Xidian University.   
*Equal Contribution    Corresponding Authors

Abstract


We propose MASSTAR, a multi-modal large-scale scene dataset with a versatile toolchain for surface prediction and completion.
Overview

Surface prediction and completion have been widely studied in various applications. Recently, research in surface completion has evolved from small objects to complex large-scale scenes. As a result, researchers have begun increasing the volume of data and leveraging a greater variety of data modalities including rendered RGB images, descriptive text, depth images, etc, to enhance algorithm performance. However, existing datasets suffer from a deficiency in the amounts of scene-level models along with the corresponding multi-modal information. Therefore, a method to scale the datasets and generate multi-modal information in them efficiently is essential. To bridge this research gap, we propose MASSTAR: a multi-modal large-scale scene dataset with a versatile toolchain for surface prediction and completion. We develop a versatile and efficient toolchain for processing the raw 3D data from the environments. It screens out a set of fine-grained scene models and generates the corresponding multi-modal data. Utilizing the toolchain, we then generate an example dataset composed of over a thousand scene-level models with partial real-world data added. We compare MASSTAR with the existing datasets, which validates its superiority: the ability to efficiently extract high-quality models from complex scenarios to expand the dataset. Additionally, several representative surface completion algorithms are benchmarked on MASSTAR, which reveals that existing algorithms can hardly deal with scene-level completion.

Multi-modal Dataset Generation Toolchain


A. 3D Scene Segmentation
Overview

An overview of 3D scene segmentation. Initially, we generate the depth image and RGB image by rendering a bird's-eye view of each scene. Users have the option to employ SAM for segmenting top-view images in manual mode or automatic mode. Subsequently, the 3D mesh model is sliced using Blender, and then CLIP is utilized to filter out non-architectural categories.

B. Images Rendering
Overview

An example of the image rendering part of the toolchain. We offer the random mode (left) and trajectory mode (right) for users.

C. Descriptive Texts Generation
Overview

An example of the descriptive text rendering part of the toolchain. BLIP is employed to perform zero-shot image-to-text generation.

D. Partial Point Clouds Generation
Overview

An example of the partial point cloud render part of the toolchain.

Toolchain Tested on Existing Datasets


Overview

Some examples of independent data processing using the proposed toolchain on the existing datasets.

Example of dataset


Some high-quality and real-world models.
ablation-representation
More Scene-level models.
ablation-representation

Surface Completion Benchmark


We conducted a comparative analysis of three surface prediction and completion algorithms, including SPM, PCN, and XMFnet.
ablation-representation

BibTeX


@misc{zheng2024masstar,
  title={MASSTAR: A Multi-Modal and Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion}, 
  author={Guiyong Zheng and Jinqi Jiang and Chen Feng and Shaojie Shen and Boyu Zhou},
  year={2024},
  eprint={2403.11681},
  archivePrefix={arXiv},
  primaryClass={cs.RO}
}