GenRC: 3D Indoor Scene Generation from Sparse Image Collections

Abstract

Sparse RGBD scene completion is a challenging task especially when considering consistent textures and geometries throughout the entire scene. Different from existing solutions that rely on human-designed text prompts or predefined camera trajectories, we propose GenRC, an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. To achieve this, we first project the sparse RGBD images to a highly incomplete 3D mesh. Instead of iteratively generating novel views to fill in the void, we utilized our proposed E-Diffusion to generate a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. Furthermore, we maintain the input-output scene stylistic consistency through textual inversion to replace human-designed text prompts. To bridge the domain gap among datasets, E-Diffusion leverages models trained on large-scale datasets to generate diverse appearances. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets, even though GenRC is not trained on these datasets nor using predefined camera trajectories.

Pipeline of GenRC

Pipeline of GenRC: (a) Firstly, we extract text embeddings as a token to represent the style of provided RGBD images via textual inversion. Next, we project these images to a 3D mesh. (b) Following that, we render a panorama from a plausible room center and use equirectangular projection to render various viewpoints of the scene from the panoramic image. Then, we propose E-Diffusion that satisfies equirectangular geometry to concurrently denoise these images and determine their depth via monocular depth estimation, resulting in a cross-view consistent panoramic RGBD image. (c) Lastly, we sample novel views from the mesh to fill in holes, resulting in a complete mesh.

Concepts of E-Diffusion

Multi-view diffusion with equirectangular geometry: (a) Given an incomplete panoramic image, we first obtain several incomplete perspective images via equirectangular projection. (b) To denoise a perspective image at i-th view for one step, we first denoise all images to clean images and warp all the images to i-th view to get an averaged image. Then, we add random noise back to the averaged image to get a perspective image which is denoised for one step. Note that while we use images in RGB space here for illustration, the entire process is operated in latent space.

Visual Comparison with RGBD2

Results on ScanNet

Comparison with Baselines on Scannet: GenRC can produce a comprehensive room-scale mesh with high-fidelity texture, even when provided with sparse RGBD observations. In comparison to the prior method RGBD2[1], GenRC excels in generating more complete meshes and high-fidelity images. Besides, while T2R+RGBD (adapted from Text2Room[2]) achieves high-fidelity texture, it may generate cross-view inconsistent geometry and artifacts.

BibTeX


      @article{ming2024GenRC,
        author  = {Ming-Feng Li, Yueh-Feng Ku, Hong-Xuan Yen, Chi Liu, Yu-Lun Liu, Albert Y. C. Chen, Cheng-Hao Kuo, Min Sun},
        title   = {GenRC: 3D Indoor Scene Generation from Sparse Image Collections},
        journal = {ECCV},
        year    = {2024}
      }

References

[1] Lei, Jiabao, Jiapeng Tang, and Kui Jia. "RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models." CVPR 2023.
[2] Höllein, Lukas, et al. "Text2room: Extracting textured 3d meshes from 2d text-to-image models." ICCV 2023.