GitHub

VIPScene
Video Perception Models for 3D Scene Synthesis

Paper | Project Page

VIPScene is a novel framework that exploits the encoded commonsense knowledge of 3D physical world in video generation models to ensure coherent scene layouts and consistent object placements across views. VIPScene accepts both text and image prompts and seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.

Core Dependencies & External Tools

VIPScene is built upon the Holodeck codebase and AI2-THOR (which supports visualization in Unity). It seamlessly integrates the following state-of-the-art tools for its perception framework:

UniDepth: Universal Monocular Metric Depth Estimation
Grounded-SAM: Grounded Segment Anything for open-vocabulary object segmentation
Fast3R: Fast 3D Reconstruction
Mast3r: 3D Reconstruction from Multiple Views for scene understanding

Installation

Clone and Install VIPScene

git clone https://github.com/VIPScene/VIPScene.git cd VIPScene conda create --name vipscene python=3.10 conda activate vipscene pip install -r requirements.txt pip install --extra-index-url https://ai2thor-pypi.allenai.org ai2thor==0+8524eadda94df0ab2dbb2ef5a577e4d37c712897

Setup External Tools Please refer to the respective repositories linked above (UniDepth, Grounded-SAM, Fast3R, Mast3r) for their specific installation instructions and environment setup.

Data

Please refer to Holodeck for data download instructions.

Usage

Prerequisite: Our system uses gpt-4o-2024-05-13. Please ensure you have an OpenAI API key with access to this model. You can set it as an environment variable:

export OPENAI_API_KEY=<your_key>

For detailed step-by-step instructions on running the full video processing pipeline (including object analysis, depth estimation, segmentation, and 3D reconstruction), please refer to the Usage Documentation.

Citation

Please cite the following paper if you use this code in your work.

@article{huang2025video, title={Video Perception Models for 3D Scene Synthesis}, author={Huang, Rui and Zhai, Guangyao and Bauer, Zuria and Pollefeys, Marc and Tombari, Federico and Guibas, Leonidas and Huang, Gao and Engelmann, Francis}, journal={arXiv preprint arXiv:2506.20601}, year={2025} }

Acknowledgement

We would like to express our sincere gratitude to the authors of Holodeck, UniDepth, Grounded-SAM, Fast3R, and Mast3R for their excellent open-source work.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
external_scripts		external_scripts
vipscene		vipscene
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
connect_to_unity.py		connect_to_unity.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VIPScene
Video Perception Models for 3D Scene Synthesis

Paper | Project Page

Core Dependencies & External Tools

Installation

Data

Usage

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

star9988rr/VIPScene

Folders and files

Latest commit

History

Repository files navigation

VIPScene Video Perception Models for 3D Scene Synthesis

Paper | Project Page

Core Dependencies & External Tools

Installation

Data

Usage

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

VIPScene
Video Perception Models for 3D Scene Synthesis

Packages