[Paper] β [Online Demo] β [Project Page]
[π€ Full Model Card (Diffusers)] β [π€ LoRA Model Card (Diffusers)]
[π€ Dataset Card]
- [2025-5-15] π€π€π€ VisualCloze has been merged into the official pipelines of diffusers. For usage guidance, please refer to the Model Card.
- [2025-5-18] π₯³π₯³π₯³ We have released the LoRA weights supporting diffusers at LoRA Model Card 384 and LoRA Model Card 512.
An in-context learning based universal image generation framework.
- Support various in-domain tasks. π₯ Examples
- Generalize to unseen tasks through in-context learning. π₯ Examples
- Unify multiple tasks into one step and generate both target image and intermediate results. π₯ Examples
- Support reverse-engineering a set of conditions from a target image. π₯ Examples
Using in-context examples as task demonstrations to enable the model generalize to unseen tasks.
Our method can unify multiple tasks into one step and generate not only the target image but also the intermediate results.
Our method supports reverse generation, i.e., reverse-engineering a set of conditions from a target.
See installation structions for details.
We have released the Graph200K dataset in huggingface. To use it in VisualCloze, we preprocess it using the script.
Please refer to dataset for more details.
After preprocessing the Graph200K dataset as shown in dataset, please setting the path item in visualcloze.yaml as the generated json file.
META: - path: "the json file of the training set after preprocessing" type: 'image_grid_graph200k'Then, you can train the model using a script like exps/train.sh. You should personalize gpu_num, batch_size, and micro_batch_size according to your device.
bash exps/train.shFor training, we use 8 A100 GPUs with a batch size of 2, requiring 50GB of memory with Fully Sharded Data Parallelism. And gradient accumulation can be employed to support a larger batch size. Additionally, 40GB GPUs can also be used when the batch size is set to 1.
In huggingface, we release visualcloze-384-lora and visualcloze-512-lora, trained with the grid resolution of 384 and 512, respectively. The grid resolution means that each image is resized to the area of the square of it before concatenating images into a grid layout.
Note: Apart from the Graph200K, the released models are trained with a part of internal multi-task datasets, to cover more diverse tasks and improve the generalization ability.
ββ Using huggingface-cli downloading our model:
Note: The weights here are provided for the training, testing, and gradio demo in this repository. For usage with Diffusers, please refer to the Custom Sampling with Diffusers.
huggingface-cli download --resume-download VisualCloze/VisualCloze --local-dir /path/to/ckptor using git for cloning the model you want to use:
git clone https://huggingface.co/VisualCloze/VisualClozeTo host a local gradio demo for interactive inference, run the following command:
# By default, we use the model trained under the grid resolution of 384. python app.py --model_path "path to downloaded visualcloze-384-lora.pth" --resolution 384 # To use the model with the grid resolution of 512, you should set the resolution parameter to 512. python app.py --model_path "path to downloaded visualcloze-512-lora.pth" --resolution 512- SDEdit is used to upsampling the generated image that has the initial resoluation of 384/512 when using grid resolution of 384/512. You can set
upsampling noisein the advanced options to adjust the noise levels added to the image. For tasks that have strict requirements on the spatial alignment of inputs and outputs, you can increaseupsampling noiseor even set it to 1 to disable SDEdit. - βββ Before clicking the generate button, please wait until all images, prompts, and other components are fully loaded, especially when using task examples. Otherwise, the inputs from the previous and current sessions may get mixed.
βββ VisualCloze has been merged into the official pipelines of diffusers. For usage guidance, please refer to the Model Card.
First, please install diffusers.
pip install git+https://github.com/huggingface/diffusers.gitNote that chinese users can use the command below to download the model:
git lfs install git clone https://www.wisemodel.cn/VisualCloze/VisualClozePipeline-384.git git clone https://www.wisemodel.cn/VisualCloze/VisualClozePipeline-512.gitThen you can use VisualClozePipeline to run the model.
import torch from diffusers import VisualClozePipeline from diffusers.utils import load_image # Load in-context images (make sure the paths are correct and accessible) # The images are from the VITON-HD dataset at https://github.com/shadow2496/VITON-HD image_paths = [ # in-context examples [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/03673_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00700_00_tryon_catvton_0.jpg'), ], # query with the target image [ load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/00555_00.jpg'), load_image('https://github.com/lzyhha/VisualCloze/raw/main/examples/examples/tryon/12265_00.jpg'), None ], ] # Task and content prompt task_prompt = "Each row shows a virtual try-on process that aims to put [IMAGE2] the clothing onto [IMAGE1] the person, producing [IMAGE3] the person wearing the new clothing." content_prompt = None # Load the VisualClozePipeline pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16) pipe.to("cuda") # Loading the VisualClozePipeline via LoRA # pipe = VisualClozePipeline.from_pretrained("black-forest-labs/FLUX.1-Fill-dev", resolution=384, torch_dtype=torch.bfloat16) # pipe.load_lora_weights('VisualCloze/VisualClozePipeline-LoRA-384', weight_name='visualcloze-lora-384.safetensors') # pipe.to("cuda") # Run the pipeline image_result = pipe( task_prompt=task_prompt, content_prompt=content_prompt, image=image_paths, upsampling_height=1632, upsampling_width=1232, upsampling_strength=0.3, guidance_scale=30, num_inference_steps=30, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0) ).images[0][0] # Save the resulting image image_result.save("visualcloze.png")We also implement a pipeline of the visualcloze in visualcloze.py of this repository. This can be easily used for custom reasoning. In inference.py, we show an example of usage on virtual try-on.
from visualcloze import VisualClozeModel model = VisualClozeModel( model_path="the path of model weigts", resolution=384 or 512, lora_rank=256 ) ''' grid_h: The number of in-context examples + 1. When without in-context example, it should be set to 1. grid_w: The number of images involved in a task. In the Depth-to-Image task, it is 2. In the Virtual Try-On, it is 3. ''' model.set_grid_size(grid_h, grid_w) ''' images: List[List[PIL.Image.Image]]. A grid-layout image collection, each row represents an in-context example or the current query, where the current query should be placed in the last row. The target image can be None in the input. The other images should be the PIL Image class (Image.Image). prompts: List[str]. Three prompts, representing the layout prompt, task prompt, and content prompt respectively. ''' result = model.process_images( images, prompts, )[-1] # return PIL.Image.ImageExecute the usage example and see the output in example.jpg.
python inference.py --model_path "path to downloaded visualcloze-384-lora.pth" --resolution 384 python inference.py --model_path "path to downloaded visualcloze-512-lora.pth" --resolution 512To generate images on the test set of the Graph200K, run the following command:
# Set data_path to the json file of the test set, which is generated when preprocessing. # Set model_path to the path of model weights. bash exp/sample.shYou can modify test_task_dicts in prefix_instruction.py to customize your required tasks.
If you find VisualCloze useful for your research and applications, please cite using this BibTeX:
@article{li2025visualcloze, title={VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning}, author={Li, Zhong-Yu and Du, Ruoyi and Yan, Juncheng and Zhuo, Le and Li, Zhen and Gao, Peng and Ma, Zhanyu and Cheng, Ming-Ming}, journal={arXiv preprint arXiv:2504.07960}, year={2025} }




