VIPScene is a novel framework that exploits the encoded commonsense knowledge of 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
VIPScene is built upon the Holodeck codebase and AI2-THOR (which supports visualization in Unity). It seamlessly integrates the following state-of-the-art tools for its perception framework:
- UniDepth: Universal Monocular Metric Depth Estimation
- Grounded-SAM: Grounded Segment Anything for open-vocabulary object segmentation
- Fast3R: Fast 3D Reconstruction
- Mast3r: 3D Reconstruction from Multiple Views for scene understanding
-
Clone and Install VIPScene
git clone https://github.com/VIPScene/VIPScene.git cd VIPScene conda create --name vipscene python=3.10 conda activate vipscene pip install -r requirements.txt pip install --extra-index-url https://ai2thor-pypi.allenai.org ai2thor==0+8524eadda94df0ab2dbb2ef5a577e4d37c712897 -
Setup External Tools Please refer to the respective repositories linked above (UniDepth, Grounded-SAM, Fast3R, Mast3r) for their specific installation instructions and environment setup.
Please refer to Holodeck for data download instructions.
Prerequisite: Our system uses gpt-4o-2024-05-13. Please ensure you have an OpenAI API key with access to this model. You can set it as an environment variable:
export OPENAI_API_KEY=<your_key>For detailed step-by-step instructions on running the full video processing pipeline (including object analysis, depth estimation, segmentation, and 3D reconstruction), please refer to the Usage Documentation.
Please cite the following paper if you use this code in your work.
@article{huang2025video, title={Video Perception Models for 3D Scene Synthesis}, author={Huang, Rui and Zhai, Guangyao and Bauer, Zuria and Pollefeys, Marc and Tombari, Federico and Guibas, Leonidas and Huang, Gao and Engelmann, Francis}, journal={arXiv preprint arXiv:2506.20601}, year={2025} }We would like to express our sincere gratitude to the authors of Holodeck, UniDepth, Grounded-SAM, Fast3R, and Mast3R for their excellent open-source work.