Skip to content

star9988rr/VIPScene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VIPScene
Video Perception Models for 3D Scene Synthesis

VIPScene is a novel framework that exploits the encoded commonsense knowledge of 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.

Core Dependencies & External Tools

VIPScene is built upon the Holodeck codebase and AI2-THOR (which supports visualization in Unity). It seamlessly integrates the following state-of-the-art tools for its perception framework:

  • UniDepth: Universal Monocular Metric Depth Estimation
  • Grounded-SAM: Grounded Segment Anything for open-vocabulary object segmentation
  • Fast3R: Fast 3D Reconstruction
  • Mast3r: 3D Reconstruction from Multiple Views for scene understanding

Installation

  1. Clone and Install VIPScene

    git clone https://github.com/VIPScene/VIPScene.git cd VIPScene conda create --name vipscene python=3.10 conda activate vipscene pip install -r requirements.txt pip install --extra-index-url https://ai2thor-pypi.allenai.org ai2thor==0+8524eadda94df0ab2dbb2ef5a577e4d37c712897
  2. Setup External Tools Please refer to the respective repositories linked above (UniDepth, Grounded-SAM, Fast3R, Mast3r) for their specific installation instructions and environment setup.

Data

Please refer to Holodeck for data download instructions.

Usage

Prerequisite: Our system uses gpt-4o-2024-05-13. Please ensure you have an OpenAI API key with access to this model. You can set it as an environment variable:

export OPENAI_API_KEY=<your_key>

For detailed step-by-step instructions on running the full video processing pipeline (including object analysis, depth estimation, segmentation, and 3D reconstruction), please refer to the Usage Documentation.

Citation

Please cite the following paper if you use this code in your work.

@article{huang2025video, title={Video Perception Models for 3D Scene Synthesis}, author={Huang, Rui and Zhai, Guangyao and Bauer, Zuria and Pollefeys, Marc and Tombari, Federico and Guibas, Leonidas and Huang, Gao and Engelmann, Francis}, journal={arXiv preprint arXiv:2506.20601}, year={2025} }

Acknowledgement

We would like to express our sincere gratitude to the authors of Holodeck, UniDepth, Grounded-SAM, Fast3R, and Mast3R for their excellent open-source work.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages