O₂V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

1 Academy for Engineering & Technology, Fudan University, China
2 School of Future Technology, Harbin Institute of Technology, China
3 Department of Computer Science and Engineering, Shanghai Jiao Tong University, China

metie22@m.fudan.edu.cn
Corresponding Authors

Online O2V Mapping and Text-Based Search Results.

Abstract

Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.We have now open-sourced our code at https://github.com/Fudan-MAGIC-Lab/O2Vmapping.git.

MY ALT TEXT

Top: Optimization of voxel-based neural radiance fields. Nearest trilinear interpolation is used to obtain color and geometric features for spatially sampled points. Then, leveraging NeRF’s volume rendering, the scene is sampled and average-rendered to produce RGB and depth images.

Bottom: Optimization of our O2V filed. We employ SAM to segment input RGB images and obtain instances. We further obtain language features through CLIP encoding. Feature indexing is performed to prepare for feature fusion. Finally, voxel splitting and multi-perspective voting are adopted to obtain fine-grained 3D open-vocabulary results.

Video Presentation

O2V Compare Slider x4
"chair"
Left 1
Right 1
"door"
Left 2
Right 2
"couch"
Left 3
Right 3
"trash can"
Left 4
Right 4

BibTeX

@inproceedings{tie20242,
        title={O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation},
        author={Tie, Muer and Wei, Julong and Wu, Ke and Wang, Zhengjun and Yuan, Shanshuai and Zhang, Kaizhao and Jia, Jie and Zhao, Jieru and Gan, Zhongxue and Ding, Wenchao},
        booktitle={European Conference on Computer Vision},
        pages={318--333},
        year={2024},
        organization={Springer}
      }